Last updated: July 17, 2026
Application No. 17/977,884
TECHNIQUES FOR WEAKLY SUPERVISED REFERRING IMAGE SEGMENTATION

Final Rejection §103
Filed
Oct 31, 2022
Priority
Jul 11, 2022 — provisional 63/388,091
Examiner
ZAK, JACQUELINE ROSE
Art Unit
2666
Tech Center
2600 — Communications
Assignee
NVIDIA Corporation
OA Round
5 (Final)
Interview Optional

— +0.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 60% grant rate with +0.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 25 resolved cases, 2023–2026
Examiner Intelligence

ZAK, JACQUELINE ROSE View full profile →
Grants 60% of resolved cases
Career Allowance Rate
15 granted / 25 resolved
-2.0% vs TC avg
Minimal +0% lift
Without
With
+0.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
24 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
95.1%
+55.1% vs TC avg
§102
4.3%
-35.7% vs TC avg
§112
0.6%
-39.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 25 resolved cases
Office Action

§103
CTFR 17/977,884 CTFR 100464 DETAILED ACTION Notice of Pre-AIA or AIA Status 07-03-aia AIA 15-10-aia The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA. Claim Status Claims 1-21 are pending for examination in the application filed 04/10/2026. Claim 21 is new. Priority Acknowledgement is made of Applicant’s claim to priority of provisional application 63/388,091, filing date 07/11/2022. Response to Arguments 07-37 AIA Applicant's arguments filed 04/10/2026 have been fully considered but they are not persuasive. Applicant argues on pages 7-8 of the Remarks filed 04/10/2026 that Kokkinos does not teach “a loss function that comprises both a multiple instance learning loss term and an energy loss term”. Applicant argues that the three tasks of Kokkinos each implement a loss function adapted to the task rather than a loss function that comprises both a multiple instance learning loss term and an energy loss term, drawing multiple instance learning loss to the boundary detection and conditional random field loss to the semantic segmentation. Kokkinos describes [0105] “In this work we consider instead the fusion of multiple tasks and use recurrent processing to exploit the flow of information across them. Rather than using a fixed parametric form for the update rule, we allow for arbitrary nonlinear mappings, delivered by a multi-layer architecture. Finally, we perform a discriminative fusion of the recurrent processing results with the results from the previous stage, which guarantees an improvement of the (training) loss at every stage of processing” and [0107] “Our recursive processing architecture is shown in Fig. 4. For this figure we have abstracted the inner workings of the individual tasks, detailed in Fig. 3, and we now focus instead on the information flow across them, through the network's trellis. We notice a 'french-braid' processing pattern, where each task draws new information from the VisionBank layer, and then feeds the solution back to the trellis in the next iteration. We also note that loss layers are attached to all intermediate results - namely we process the VisionBank layer with processing modules that aim directly at predicting the desired solution and then fuse their results with the previously computed solutions” Thus, the overall training loss, which incorporates the multiple instance learning loss term and energy loss term, is minimized . Applicant further argues that the CRF modeling is part of additional processing streams that are adapted through learning using a softmax loss function, rather than the claimed energy loss term, and “energy loss” as a term does not appear in Kokkinos. On page 17 of the specification the CRF loss is described, where the CRF energy is minimized via equation 2: PNG media_image1.png 208 652 media_image1.png Greyscale In Kokkinos, [0085] “The module CRF implements fully connected conditional random field modelling and outputs semantic segmentation results, by integrating the classifier scores with some image-based measure of affinity” . In regards to the CRF energy loss equation, the unary term is the classifier scores and the pairwise potential is the image-based measure of affinity. The unary term is also explicitly represented in Figure 3 of Kokkinos. Thus, Kokkinos teaches “a loss function that comprises both a multiple instance learning loss term and an energy loss term” and the 35 USC § 103 rejections of claims 1-20 are maintained. Regarding new claim 21, Examiner maintains that Kokkinos teaches a loss function that comprises both a multiple instance learning loss term and an energy loss term, as described above. Please see below for the new 35 USC § 103 rejection incorporating the further limitations of new claim 21 . Claim Rejections - 35 USC § 103 07-06 AIA 15-10-15 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 07-20-aia AIA The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 07-23-aia AIA The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. 07-20-02-aia AIA This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. 07-21-aia AIA Claim s 1, 7-11, and 17-21 are rejected under 35 U.S.C. 103 as being unpatentable over Chakraborty (US11971955B1) in view of Kokkinos (EP3171297A1) . Regarding claim 1, Chakraborty teaches a computer-implemented method for training a machine leaning model ([col. 10 ln. 33-38] In this section, the neural network architecture, RFM module and training methods of SPOT are described. A one-shot object detection network is created by extending DETR, an end-to-end detection model composed of a backbone (typically a convolutional residual network), followed by a Transformer Encoder-Decoder) , the method comprising: receiving a training data set that includes at least one image, text referring to at least one object included in the at least one image, and at least one bounding box annotation associated with the at least one object ([col. 2 ln. 30-35] For example, human annotators are typically provided annotation interfaces where the annotator draws a bounding box around specific objects of interest and provides a label for those objects. These annotated images may be used to train an object detection model to detect the objects that were labeled in the training data) ; and performing, based on the training data set, one or more operations to generate a trained machine learning model to segment images based on text ([col. 3 ln. 61-64] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image. [col. 14 ln. 27-49] The original training image 306 may have multiple classes of objects. For example, a first class of object may have first bounding box coordinates and a corresponding object class label 304a. Similarly, a second class of object may have second bounding box coordinates and a second label, and a third class of object may have third bounding box coordinates and a third label... In at least some examples, the example-based object detector 114 may be trained end-to-end to update parameters of the feature-extraction backbone networks as well as the transformer model and/or object detection/image segmentation heads), wherein the one or more operations to generate the trained machine learning model include minimizing a loss function ([col. 2-3 ln. 64-3] Generally, in machine learning models, such as neural networks, after initialization, annotated training data may be used to generate a cost or “loss” function that describes the difference between expected output of the machine learning model and actual output. The parameters (e.g., weights and/or biases) of the machine learning model may be updated to minimize (or maximize) the cost). Chakraborty does not teach a loss function that comprises both a multiple instance learning loss term and an energy loss term. Kokkinos, in the same field of endeavor of image segmentation, teaches a loss function that comprises both a multiple instance learning loss term and an energy loss term ([0018] According to an aspect of the invention, we introduce a training objective that explicitly encodes the uncertainty of human boundary annotations in a Multiple Instance Learning framework (MIL) and incorporate it as a loss for the training of a HED-style DCNN for boundary detection. [0082] As illustrated in Fig. 3 we now have additional processing streams, for semantic segmentation, for region proposals and object mapping. [0085] The module CRF implements fully connected conditional random field modelling and outputs semantic segmentation results, by integrating the classifier scores with some image-based measure of affinity. [0089] For these additional streams we use loss functions adapted to the task at hand, guiding the learning of the fully-connected layers so as to optimize a task-specific performance measure. We use the softmax loss function, as is common in all recent works on semantic segmentation). Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Kokkinos to use a loss function that comprises multiple instance learning loss and energy loss because "Unlike most recent multi-task architectures, in our work the flow of information across these different tasks is explicitly controlled and exploited through a recurrent architecture. For this we introduce a reference layer that conglomerates at every stage the results of the different tasks and then updates the solutions to the individual tasks in the light of the previous solutions. This recurrent architecture allows the synergy between the tasks to be exploited and leads to increasingly accurate solutions" [Kokkinos 0010] and "guarantees an improvement of the (training) loss at every stage of processing." [Kokkinos 0105]. Regarding claim 7, Chakraborty and Kokkinos teach the method of claim 1. Chakraborty further teaches wherein the machine learning model comprises a transformer model ([col. 7 ln. 32-35] Although, in various examples described herein, the architecture of the example-based object detector 114 may include a robust feature mapping module and/or a transformer model) . Regarding claim 8, Chakraborty and Kokkinos teach the method of claim 1. Chakraborty further teaches wherein the text referring to the at least one object comprises one or more natural language expressions ([col. 12 ln 44-46] Additionally, the multiple query objects detected may be labeled with their respective class labels (e.g., “vegetarian logo,” “warning logo,” etc.) and confidence scores) . Regarding claim 9, Chakraborty and Kokkinos teach the method of claim 1. Chakraborty further teaches processing a first image and a first text using the machine learning model ([col. 11 ln. 49-65] FIG. 2A is a block diagram illustrating an example-based annotation system, in accordance with various aspects of the present disclosure. As shown, a user may upload configurations (action 1) using a user interface (UI). The configurations may include target images (e.g., those images in which objects represented in query images are to be detected) and query images (e.g., examples images including classes of objects to be detected in the target images)… The user-uploaded query images may include class labels labeling the classes of the respective objects-of-interest depicted in the user-uploaded query images. [col. 12 ln. 18-19] At action 2, query-target pairs may be generated for input into the example-based object detector 114. [col. 7 ln. 32-37] Although, in various examples described herein, the architecture of the example-based object detector 114 may include a robust feature mapping module and/or a transformer model, in some other examples, other machine learning algorithms may be used to locate representations of an object represented in a query image within a target image) to generate a segmentation mask indicating one or more objects in the first image that are referred to by the first text ([col. 13 ln. 11-16] The example-based object detection system 106 may generate results 216. The results 216 may localize the object-of-interest in images in which depictions of the object-of-interest appear (in whole or in part). As previously described, the results 216 may include bounding box 218 localizations and/or segmentation masks) . Regarding claim 10, Chakraborty and Kokkinos teach the method of claim 1. Kokkinos teaches wherein the energy loss term comprises a conditional random field loss term ([0082] As illustrated in Fig. 3 we now have additional processing streams, for semantic segmentation, for region proposals and object mapping. [0085] The module CRF implements fully connected conditional random field modelling and outputs semantic segmentation results, by integrating the classifier scores with some image-based measure of affinity. [0089] For these additional streams we use loss functions adapted to the task at hand, guiding the learning of the fully-connected layers so as to optimize a task-specific performance measure. We use the softmax loss function, as is common in all recent works on semantic segmentation). Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Kokkinos to use conditional random field loss to "output semantic segmentation results" [Kokkinos 0085] because "the fusion of multiple tasks and use recurrent processing to exploit the flow of information across them…guarantees an improvement of the (training) loss at every stage of processing." [Kokkinos 0105]. Regarding claim 11, Chakraborty teaches one or more non-transitory computer-readable media storing instructions that, when executed by at least one processor ([col. 25 ln. 7-11] Also, any logic or application described herein that comprises software or code can be embodied in any non-transitory computer-readable medium or memory for use by or in connection with an instruction execution system such as a processing component in a computer system) , cause the at least one processor to perform the steps of: receiving a training data set that includes at least one image, text referring to at least one object included in the at least one image, and at least one bounding box annotation associated with the at least one object ([col. 2 ln. 30-35] For example, human annotators are typically provided annotation interfaces where the annotator draws a bounding box around specific objects of interest and provides a label for those objects. These annotated images may be used to train an object detection model to detect the objects that were labeled in the training data) ; and performing, based on the training data set, one or more operations to generate a trained machine learning model to segment images based on text ([col. 3 ln. 61-64] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image. [col. 14 ln. 27-49] The original training image 306 may have multiple classes of objects. For example, a first class of object may have first bounding box coordinates and a corresponding object class label 304a. Similarly, a second class of object may have second bounding box coordinates and a second label, and a third class of object may have third bounding box coordinates and a third label... In at least some examples, the example-based object detector 114 may be trained end-to-end to update parameters of the feature-extraction backbone networks as well as the transformer model and/or object detection/image segmentation heads), wherein the one or more operations to generate the trained machine learning model include minimizing a loss function ([col. 2-3 ln. 64-3] Generally, in machine learning models, such as neural networks, after initialization, annotated training data may be used to generate a cost or “loss” function that describes the difference between expected output of the machine learning model and actual output. The parameters (e.g., weights and/or biases) of the machine learning model may be updated to minimize (or maximize) the cost). Chakraborty does not teach a loss function that comprises both a multiple instance learning loss term and an energy loss term. Kokkinos, in the same field of endeavor of image segmentation, teaches a loss function that comprises both a multiple instance learning loss term and an energy loss term ([0018] According to an aspect of the invention, we introduce a training objective that explicitly encodes the uncertainty of human boundary annotations in a Multiple Instance Learning framework (MIL) and incorporate it as a loss for the training of a HED-style DCNN for boundary detection. [0082] As illustrated in Fig. 3 we now have additional processing streams, for semantic segmentation, for region proposals and object mapping. [0085] The module CRF implements fully connected conditional random field modelling and outputs semantic segmentation results, by integrating the classifier scores with some image-based measure of affinity. [0089] For these additional streams we use loss functions adapted to the task at hand, guiding the learning of the fully-connected layers so as to optimize a task-specific performance measure. We use the softmax loss function, as is common in all recent works on semantic segmentation). Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the media of Chakraborty with the teachings of Kokkinos to use a loss function that comprises multiple instance learning loss and energy loss because "Unlike most recent multi-task architectures, in our work the flow of information across these different tasks is explicitly controlled and exploited through a recurrent architecture. For this we introduce a reference layer that conglomerates at every stage the results of the different tasks and then updates the solutions to the individual tasks in the light of the previous solutions. This recurrent architecture allows the synergy between the tasks to be exploited and leads to increasingly accurate solutions" [Kokkinos 0010]. Regarding claim 17, Chakraborty and Kokkinos teach the media of claim 11. Chakraborty further teaches wherein the instructions, when executed by the at least one processor, further cause the at least one processor ([col. 25 ln. 7-11] Also, any logic or application described herein that comprises software or code can be embodied in any non-transitory computer-readable medium or memory for use by or in connection with an instruction execution system such as a processing component in a computer system) to perform the step of generating the at least one bounding box annotation by performing one or more object detection operations based on the at least one image ([col. 3 ln. 58-61] Additionally, an object detection head of the example-based object detection system may output a bounding box surrounding the depiction of the object-of-interest present in the target image) . Regarding claim 18, Chakraborty and Kokkinos teach the media of claim 11. Chakraborty further teaches wherein the instructions, when executed by the at least one processor, further cause the at least one processor ([col. 25 ln. 7-11] Also, any logic or application described herein that comprises software or code can be embodied in any non-transitory computer-readable medium or memory for use by or in connection with an instruction execution system such as a processing component in a computer system) to perform the step of processing a first image and a first text using the trained machine learning model ([col. 11 ln. 49-65] FIG. 2A is a block diagram illustrating an example-based annotation system, in accordance with various aspects of the present disclosure. As shown, a user may upload configurations (action 1) using a user interface (UI). The configurations may include target images (e.g., those images in which objects represented in query images are to be detected) and query images (e.g., examples images including classes of objects to be detected in the target images)… The user-uploaded query images may include class labels labeling the classes of the respective objects-of-interest depicted in the user-uploaded query images. [col. 12 ln. 18-19] At action 2, query-target pairs may be generated for input into the example-based object detector 114. [col. 7 ln. 32-37] Although, in various examples described herein, the architecture of the example-based object detector 114 may include a robust feature mapping module and/or a transformer model, in some other examples, other machine learning algorithms may be used to locate representations of an object represented in a query image within a target image. [col. 14 ln. 45-51] In at least some examples, the example-based object detector 114 may be trained end-to-end to update parameters of the feature-extraction backbone networks as well as the transformer model and/or object detection/image segmentation heads) to generate a segmentation mask indicating one or more objects in the first image that are referred to by the first text ([col. 13 ln. 11-16] The example-based object detection system 106 may generate results 216. The results 216 may localize the object-of-interest in images in which depictions of the object-of-interest appear (in whole or in part). As previously described, the results 216 may include bounding box 218 localizations and/or segmentation masks) . Regarding claim 19, Chakraborty and Kokkinos teach the media of claim 18. Chakraborty further teaches wherein the segmentation mask indicates one or more pixels in the first image that are associated with the one or more objects ([col. 3 ln. 61-64] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image) . Regarding claim 20, Chakraborty teaches a system (Fig. 1) , comprising: one or more memories storing instructions; and one or more processors that are coupled to the one or more memories and, when executing the instructions ([col. 25 ln. 7-11] Also, any logic or application described herein that comprises software or code can be embodied in any non-transitory computer-readable medium or memory for use by or in connection with an instruction execution system such as a processing component in a computer system) , are configured to: receive a training data set that includes at least one image, text referring to at least one object included in the at least one image, and at least one bounding box annotation associated with the at least one object ([col. 2 ln. 30-35] For example, human annotators are typically provided annotation interfaces where the annotator draws a bounding box around specific objects of interest and provides a label for those objects. These annotated images may be used to train an object detection model to detect the objects that were labeled in the training data) ; and perform, based on the training data set, one or more operations to generate a trained machine learning model to segment images based on text ([col. 3 ln. 61-64] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image. [col. 14 ln. 27-49] The original training image 306 may have multiple classes of objects. For example, a first class of object may have first bounding box coordinates and a corresponding object class label 304a. Similarly, a second class of object may have second bounding box coordinates and a second label, and a third class of object may have third bounding box coordinates and a third label... In at least some examples, the example-based object detector 114 may be trained end-to-end to update parameters of the feature-extraction backbone networks as well as the transformer model and/or object detection/image segmentation heads), wherein the one or more operations to generate the trained machine learning model include minimizing a loss function ([col. 2-3 ln. 64-3] Generally, in machine learning models, such as neural networks, after initialization, annotated training data may be used to generate a cost or “loss” function that describes the difference between expected output of the machine learning model and actual output. The parameters (e.g., weights and/or biases) of the machine learning model may be updated to minimize (or maximize) the cost). Chakraborty does not teach a loss function that comprises both a multiple instance learning loss term and an energy loss term. Kokkinos, in the same field of endeavor of image segmentation, teaches a loss function that comprises both a multiple instance learning loss term and an energy loss term ([0018] According to an aspect of the invention, we introduce a training objective that explicitly encodes the uncertainty of human boundary annotations in a Multiple Instance Learning framework (MIL) and incorporate it as a loss for the training of a HED-style DCNN for boundary detection. [0082] As illustrated in Fig. 3 we now have additional processing streams, for semantic segmentation, for region proposals and object mapping. [0085] The module CRF implements fully connected conditional random field modelling and outputs semantic segmentation results, by integrating the classifier scores with some image-based measure of affinity. [0089] For these additional streams we use loss functions adapted to the task at hand, guiding the learning of the fully-connected layers so as to optimize a task-specific performance measure. We use the softmax loss function, as is common in all recent works on semantic segmentation). Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the system of Chakraborty with the teachings of Kokkinos to use a loss function that comprises multiple instance learning loss and energy loss because "Unlike most recent multi-task architectures, in our work the flow of information across these different tasks is explicitly controlled and exploited through a recurrent architecture. For this we introduce a reference layer that conglomerates at every stage the results of the different tasks and then updates the solutions to the individual tasks in the light of the previous solutions. This recurrent architecture allows the synergy between the tasks to be exploited and leads to increasingly accurate solutions" [Kokkinos 0010]. Regarding claim 21, Chakraborty and Kokkinos teach the method of claim 1. Chakraborty does not teach wherein: a first function specifies the multiple instance learning loss term; a second function specifies the energy loss term; and the loss function comprises a joint loss function that combines the first function, the second function, and a loss weight applied to the second function. Kokkinos teaches wherein: a first function specifies the multiple instance learning loss term; a second function specifies the energy loss term ([0082] As illustrated in Fig. 3 we now have additional processing streams, for semantic segmentation, for region proposals and object mapping. [0018] According to an aspect of the invention, we introduce a training objective that explicitly encodes the uncertainty of human boundary annotations in a Multiple Instance Learning framework (MIL) and incorporate it as a loss for the training of a HED-style DCNN for boundary detection. [0085] The module CRF implements fully connected conditional random field modelling and outputs semantic segmentation results, by integrating the classifier scores with some image-based measure of affinity. [0080] For these additional streams we use loss functions adapted to the task at hand, guiding the learning of the fully-connected layers so as to optimize a task-specific performance measure) ; and the loss function comprises a joint loss function that combines the first function, the second function, and a loss weight applied to the second function. PNG media_image2.png 477 996 media_image2.png Greyscale PNG media_image3.png 220 1002 media_image3.png Greyscale Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Kokkinos to use a joint loss function that comprises multiple instance learning loss and energy loss functions because "This recurrent architecture allows the synergy between the tasks to be exploited and leads to increasingly accurate solutions" [Kokkinos 0010] and "guarantees an improvement of the (training) loss at every stage of processing." [Kokkinos 0105] . 07-21-aia AIA Claims 2- 6, 12-16 are re jected under 35 U.S.C. 103 as being unpatentable over Ch akraborty in view of Kokkinos and Lin (US20180267997A1). Re garding claim 2, Chakraborty and Kokkinos teach the method of claim 1. Chakraborty further teaches wherein the machine learning model comprises: an image encoder that generates a feature map based on an image ([col. 3 ln. 48-58] In some examples, one or more backbone networks (e.g., convolutional neural networks (CNNs)) may be used to generate feature embeddings representing the depictions of the objects in the query image and the target image. These embeddings may be input into a transformer encoder (e.g., along with positional embeddings describing a spatial position of various objects in the target image and/or query image). As described in further detail below, a robust feature mapping module may determine whether the object-of-interest of the query image is represented, in whole or in part, in the target image) ; and an image segmentation model (object detection model) that generates a mask based on the feature map ([col. 3 ln. 61-67] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image. As described in further detail below, the output of the example-based object detection system may be used in an object detection context, and/or may be used to automatically annotate images for training a high-precision object detection model (e.g., for a particular set of object classes)) . Chakraborty does not teach a text encoder that encodes text to generate one or more text embeddings Lin, in the same field of endeavor of embedded learning, teaches a text encoder that encodes text to generate one or more text embeddings ([0033] After obtaining word vector representations for each tag, an encoding scheme for the set of user-provided tags (w.sub.1, w.sub.2, . . . , w.sub.n) associated with a given image is calculated. [0034] This encoding scheme is referred to herein as a “soft topic.” [0038] A convolutional neural network then is employed to map the image feature vector and the soft topic feature vector into a common embedding space ε. More specifically, each image I is passed through a residual network and the penultimate layer is extracted and used as image feature vector v. An exemplary embedding network 500 is shown in FIG. 5) . Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Lin to generate a mask based on text embeddings and a feature map (Chakraborty discusses natural language text embedding but does not apply it to the model) because "Traditional approaches to online image search are constrained in their ability to adequately identify and present the most relevant images available in response to an input query" [Lin 0018]. Regarding claim 3, Chakraborty, Kokkinos, and Lin teach the method of claim 2. Lin teaches wherein the machine learning model further comprises a text adaptor that adapts the one or more text embeddings to generate one or more refined text embeddings ([0026] Further, the image embedding system 104 includes a soft topic feature vector (or weighted word vector) generating component 114. The soft topic feature vector generating component 114 is configured to generate a word vector representation for each of a plurality of keyword tags associated with an image, and calculate a weighted average of the generated word vector representations to generate a soft topic feature (or weighted word) vector) . Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Lin to generate refined text embeddings because "Traditional approaches to online image search are constrained in their ability to adequately identify and present the most relevant images available in response to an input query" [Lin 0018]. Regarding claim 4, Chakraborty, Kokkinos, and Lin teach the method of claim 3. Lin teaches wherein the machine learning model further comprises a concatenation module that concatenates the one or more refined text embeddings and the feature map ([0029] A schematic diagram illustrating an exemplary overall embedding learning framework 300 in accordance with implementations of the present disclosure is shown in FIG. 3. The framework 300 is generally configured to create image feature vectors from visual features computed from images, create soft topic feature (weighted word) vectors from keyword tags associated with images, and to align the image feature vectors and the soft topic feature vectors in a common embedding space utilizing embedding learning) . Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Lin to concatenate the text embeddings and the feature map to "[utilize] the aligned vectors, [so that] a relevancy score is computed…for each of the keyword tags as it pertains to the subject image. Once trained, the framework described herein can be utilized to automatically associate keyword tags with additional input images and to rank the relevance of images with respect to queried keywords based upon associated relevancy scores" [Lin 0005]. Regarding claim 5, Chakraborty, Kokkinos, and Lin teach the method of claim 3. Chakraborty further teaches wherein the machine learning model further comprises: a second convolution layer (Fig. 7A) that projects the mask to one channel to generate a segmentation mask ([col. 3 ln. 61-64] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image) . Lin teaches a first convolution layer that fuses a concatenation of the one or more refined text embeddings and the feature map ([0038] A convolutional neural network then is employed to map the image feature vector and the soft topic feature vector into a common embedding space ε. [0029] The framework 300 is generally configured to create image feature vectors from visual features computed from images, create soft topic feature (weighted word) vectors from keyword tags associated with images, and to align the image feature vectors and the soft topic feature vectors in a common embedding space utilizing embedding learning) . Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Lin to use a convolution layer to concatenate the text embeddings and the feature map to "[utilize] the aligned vectors, [so that] a relevancy score is computed…for each of the keyword tags as it pertains to the subject image. Once trained, the framework described herein can be utilized to automatically associate keyword tags with additional input images and to rank the relevance of images with respect to queried keywords based upon associated relevancy scores" [Lin 0005]. Regarding claim 6, Chakraborty, Kokkinos, and Lin teach the method of claim 2. Chakraborty further teaches wherein the image segmentation model comprises: a transformer encoder that generates refined feature tokens based on the feature map ([col. 4 ln. 4-16] In general, the encoder network of a transformer comprises a set of encoding layers that processes the input data one layer after another. Each encoder layer generates encodings (sometimes referred to as “tokens”). These tokens include feature representations (e.g., feature vectors and/or maps) that include information about which parts of the input data are relevant to each other. For example, for each input embedding, the encoder layers may determine which parts of the token are relevant to other tokens received as part of the input data. For example, the encoder layers may determine which portions of the target image are most relevant to the depiction of the object-of-interest in the query image) ; a location decoder that generates location-aware queries based on the refined feature tokens and random queries ([col. 5 ln. 39-48] In various examples described herein, the position embedding may describe a spatial relationship of a plurality of tokens relative to other tokens. For example, an input token may represent a 16×16 (or other dimension grid) overlaid on an input frame of image data. The position embedding may describe a location of an item/token within the grid (e.g., relative to other tokens representing other portions of the frame). [col. 6 ln. 1-6] In a self-attention layer, the keys, values and queries come from the same place—in the case of the encoder, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. In “encoder-decoder attention” layers (sometimes referred to as “cross-attention”), the queries come from the previous decoder layer, and the keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. The decoder is attending to the encoder features) ; and a mask decoder that generates the mask based on the location-aware queries and the refined feature tokens ([col. 12 ln. 28-33] At action 3, the query image/target image pairs (as pre-processed) may be input into the example-based object detector 114 for inference. At action 4, each target image may be annotated with bounding boxes/segmentation masks showing a detection of an object of the class shown in the relevant query images) . Chakraborty does not teach text embeddings. Lin teaches text embeddings ([0026] Further, the image embedding system 104 includes a soft topic feature vector (or weighted word vector) generating component 114) . Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Lin to generate feature tokens based on text embeddings and a feature map (Chakraborty discusses natural language text embedding but does not apply it to the model) because "Traditional approaches to online image search are constrained in their ability to adequately identify and present the most relevant images available in response to an input query" [Lin 0018]. Regarding claim 12, Chakraborty and Kokkinos teach the media of claim 11. Chakraborty further teaches wherein the trained machine learning model comprises: an image encoder that generates a feature map based on an image ([col. 3 ln. 48-58] In some examples, one or more backbone networks (e.g., convolutional neural networks (CNNs)) may be used to generate feature embeddings representing the depictions of the objects in the query image and the target image. These embeddings may be input into a transformer encoder (e.g., along with positional embeddings describing a spatial position of various objects in the target image and/or query image). As described in further detail below, a robust feature mapping module may determine whether the object-of-interest of the query image is represented, in whole or in part, in the target image. [col. 14 ln. 45-51] In at least some examples, the example-based object detector 114 may be trained end-to-end to update parameters of the feature-extraction backbone networks as well as the transformer model and/or object detection/image segmentation heads. After training 312, the example-based object detector 114 is effective to detect previously-unseen objects based on the examples provided in the query images) ; and an image segmentation model (object detection model) that generates a mask based on the feature map ([col. 3 ln. 61-67] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image. As described in further detail below, the output of the example-based object detection system may be used in an object detection context, and/or may be used to automatically annotate images for training a high-precision object detection model (e.g., for a particular set of object classes)) . Chakraborty does not teach a text encoder that encodes text to generate one or more text embeddings. Lin, in the same field of endeavor of embedded learning, teaches a text encoder that encodes text to generate one or more text embeddings ([0033] After obtaining word vector representations for each tag, an encoding scheme for the set of user-provided tags (w.sub.1, w.sub.2, . . . , w.sub.n) associated with a given image is calculated. [0034] This encoding scheme is referred to herein as a “soft topic.” [0038] A convolutional neural network then is employed to map the image feature vector and the soft topic feature vector into a common embedding space ε. More specifically, each image I is passed through a residual network and the penultimate layer is extracted and used as image feature vector v. An exemplary embedding network 500 is shown in FIG. 5) . Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the media of Chakraborty with the teachings of Lin to generate a mask based on text embeddings and a feature map (Chakraborty discusses natural language text embedding but does not apply it to the model) because "Traditional approaches to online image search are constrained in their ability to adequately identify and present the most relevant images available in response to an input query" [Lin 0018]. Regarding claim 13, Chakraborty, Kokkinos, and Lin teach the media of claim 12. Lin teaches wherein the trained machine learning model comprises: a text adaptor that adapts the one or more text embeddings to generate one or more refined text embeddings ([0017] Embodiments of the present invention relate to, among other things, a framework for associating images with topics that are indicative of the subject matter of the images utilizing embedding learning. The framework is trained utilizing multiple images, each image having associated visual characteristics and keyword tags. [0026] Further, the image embedding system 104 includes a soft topic feature vector (or weighted word vector) generating component 114. The soft topic feature vector generating component 114 is configured to generate a word vector representation for each of a plurality of keyword tags associated with an image, and calculate a weighted average of the generated word vector representations to generate a soft topic feature (or weighted word) vector) . Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the media of Chakraborty with the teachings of Lin to generate refined text embeddings because "Traditional approaches to online image search are constrained in their ability to adequately identify and present the most relevant images available in response to an input query" [Lin 0018]. Regarding claim 14, Chakraborty, Kokkinos, and Lin teach the media of claim 13. Lin teaches wherein the trained machine learning model comprises: a concatenation module that concatenates the one or more refined text embeddings and the feature map ([0017] Embodiments of the present invention relate to, among other things, a framework for associating images with topics that are indicative of the subject matter of the images utilizing embedding learning. The framework is trained utilizing multiple images, each image having associated visual characteristics and keyword tags. [0029] A schematic diagram illustrating an exemplary overall embedding learning framework 300 in accordance with implementations of the present disclosure is shown in FIG. 3. The framework 300 is generally configured to create image feature vectors from visual features computed from images, create soft topic feature (weighted word) vectors from keyword tags associated with images, and to align the image feature vectors and the soft topic feature vectors in a common embedding space utilizing embedding learning) . Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the media of Chakraborty with the teachings of Lin to concatenate the text embeddings and the feature map to "[utilize] the aligned vectors, [so that] a relevancy score is computed…for each of the keyword tags as it pertains to the subject image. Once trained, the framework described herein can be utilized to automatically associate keyword tags with additional input images and to rank the relevance of images with respect to queried keywords based upon associated relevancy scores" [Lin 0005]. Regarding claim 15, Chakraborty, Kokkinos, and Lin teach the media of claim 13. Chakraborty further teaches wherein the trained machine learning model comprises: a second convolution layer (Fig. 7A) that projects the mask to one channel to generate a segmentation mask ([col. 3 ln. 61-64] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image) . Lin teaches a first convolution layer that fuses a concatenation of the one or more refined text embeddings and the feature map ([0038] A convolutional neural network then is employed to map the image feature vector and the soft topic feature vector into a common embedding space ε. [0029] The framework 300 is generally configured to create image feature vectors from visual features computed from images, create soft topic feature (weighted word) vectors from keyword tags associated with images, and to align the image feature vectors and the soft topic feature vectors in a common embedding space utilizing embedding learning) . Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the media of Chakraborty with the teachings of Lin to use a convolution layer to concatenate the text embeddings and the feature map to "[utilize] the aligned vectors, [so that] a relevancy score is computed…for each of the keyword tags as it pertains to the subject image. Once trained, the framework described herein can be utilized to automatically associate keyword tags with additional input images and to rank the relevance of images with respect to queried keywords based upon associated relevancy scores" [Lin 0005]. Regarding claim 16, Chakraborty, Kokkinos, and Lin teach the media of claim 12. Chakraborty further teaches wherein the image segmentation model comprises: a transformer encoder that generates refined feature tokens based on the feature tokens ([col. 4 ln. 4-16] In general, the encoder network of a transformer comprises a set of encoding layers that processes the input data one layer after another. Each encoder layer generates encodings (sometimes referred to as “tokens”). These tokens include feature representations (e.g., feature vectors and/or maps) that include information about which parts of the input data are relevant to each other. For example, for each input embedding, the encoder layers may determine which parts of the token are relevant to other tokens received as part of the input data. For example, the encoder layers may determine which portions of the target image are most relevant to the depiction of the object-of-interest in the query image) ; a transformer decoder that generates the mask based on the refined feature tokens ([col. 12 ln. 28-33] At action 3, the query image/target image pairs (as pre-processed) may be input into the example-based object detector 114 for inference. At action 4, each target image may be annotated with bounding boxes/segmentation masks showing a detection of an object of the class shown in the relevant query images. ([col. 5 ln. 39-48] In various examples described herein, the position embedding may describe a spatial relationship of a plurality of tokens relative to other tokens. For example, an input token may represent a 16×16 (or other dimension grid) overlaid on an input frame of image data. The position embedding may describe a location of an item/token within the grid (e.g., relative to other tokens representing other portions of the frame). [col. 6 ln. 1-6] In a self-attention layer, the keys, values and queries come from the same place—in the case of the encoder, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. In “encoder-decoder attention” layers (sometimes referred to as “cross-attention”), the queries come from the previous decoder layer, and the keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. The decoder is attending to the encoder features) . Conclusion 07-39 AIA THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jacqueline R Zak whose telephone number is (571)272-4077. The examiner can normally be reached M-F 9-5. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emily Terrell can be reached at (571) 270-3717. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /JACQUELINE R ZAK/Examiner, Art Unit 2666 /Molly Wilburn/Primary Examiner, Art Unit 2666 Application/Control Number: 17/977,884 Page 2 Art Unit: 2666 Application/Control Number: 17/977,884 Page 3 Art Unit: 2666 Application/Control Number: 17/977,884 Page 4 Art Unit: 2666 Application/Control Number: 17/977,884 Page 5 Art Unit: 2666 Application/Control Number: 17/977,884 Page 6 Art Unit: 2666 Application/Control Number: 17/977,884 Page 7 Art Unit: 2666 Application/Control Number: 17/977,884 Page 8 Art Unit: 2666 Application/Control Number: 17/977,884 Page 9 Art Unit: 2666 Application/Control Number: 17/977,884 Page 10 Art Unit: 2666 Application/Control Number: 17/977,884 Page 11 Art Unit: 2666 Application/Control Number: 17/977,884 Page 12 Art Unit: 2666 Application/Control Number: 17/977,884 Page 13 Art Unit: 2666 Application/Control Number: 17/977,884 Page 14 Art Unit: 2666 Application/Control Number: 17/977,884 Page 15 Art Unit: 2666 Application/Control Number: 17/977,884 Page 16 Art Unit: 2666 Application/Control Number: 17/977,884 Page 17 Art Unit: 2666 Application/Control Number: 17/977,884 Page 18 Art Unit: 2666 Application/Control Number: 17/977,884 Page 19 Art Unit: 2666 Application/Control Number: 17/977,884 Page 20 Art Unit: 2666 Application/Control Number: 17/977,884 Page 21 Art Unit: 2666 Application/Control Number: 17/977,884 Page 22 Art Unit: 2666
Read full office action
Prosecution Timeline

Show 8 earlier events
Oct 28, 2025
Response after Non-Final Action
Nov 19, 2025
Request for Continued Examination
Dec 02, 2025
Response after Non-Final Action
Jan 22, 2026
Non-Final Rejection mailed — §103
Apr 10, 2026
Response Filed
Apr 23, 2026
Applicant Interview (Telephonic)
Apr 23, 2026
Examiner Interview Summary
Jun 01, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/123,577
Patent 12652373
IMAGE PROCESSING METHOD AND APPARATUS, DEVICE, AND MEDIUM
3y 2m to grant Granted Jun 09, 2026
17/927,908
Patent 12644773
TEMPERATURE CONTROL SYSTEM, TEMPERATURE CONTROL METHOD AND TEMPERATURE CONTROL PROGRAM FOR FACILITY EQUIPMENT
3y 6m to grant Granted Jun 02, 2026
17/956,679
Patent 12632957
METHODS AND SYSTEMS FOR USE IN PROCESSING IMAGES RELATED TO CROPS
3y 7m to grant Granted May 19, 2026
17/987,574
Patent 12632932
IMAGE PROCESSING DEVICE AND OPERATION METHOD THEREOF
3y 6m to grant Granted May 19, 2026
18/175,738
Patent 12586340
PIXEL PERSPECTIVE ESTIMATION AND REFINEMENT IN AN IMAGE
3y 0m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

6-7
Expected OA Rounds
60%
Grant Probability
60%
With Interview (+0.0%)
3y 2m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 25 resolved cases by this examiner. Grant probability derived from career allowance rate.