DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Status
Claims 1-20 are pending for examination in the application filed 10/28/2025. Claims 1, 10-15, and 18-20 have been amended.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 11/19/2025 has been entered.
Priority
Acknowledgement is made of Applicant’s claim to priority of provisional application 63/388,091, filing date 07/11/2022.
Response to Arguments
Applicant’s arguments, filed 10/28/2025, with respect to claim 1 have been considered but are moot because the new ground of rejection does not rely on the combination of references applied in the prior rejection of record for any teaching or matter specifically challenged in the argument, as facilitated by the newly added amendments.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 7-11, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Chakraborty (US11971955B1) in view of Kokkinos (EP3171297A1).
Regarding claim 1, Chakraborty teaches a computer-implemented method for training a machine leaning model ([col. 10 ln. 33-38] In this section, the neural network architecture, RFM module and training methods of SPOT are described. A one-shot object detection network is created by extending DETR, an end-to-end detection model composed of a backbone (typically a convolutional residual network), followed by a Transformer Encoder-Decoder), the method comprising:
receiving a training data set that includes at least one image, text referring to at least one object included in the at least one image, and at least one bounding box annotation associated with the at least one object ([col. 2 ln. 30-35] For example, human annotators are typically provided annotation interfaces where the annotator draws a bounding box around specific objects of interest and provides a label for those objects. These annotated images may be used to train an object detection model to detect the objects that were labeled in the training data); and
performing, based on the training data set, one or more operations to generate a trained machine learning model to segment images based on text ([col. 3 ln. 61-64] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image. [col. 14 ln. 27-49] The original training image 306 may have multiple classes of objects. For example, a first class of object may have first bounding box coordinates and a corresponding object class label 304a. Similarly, a second class of object may have second bounding box coordinates and a second label, and a third class of object may have third bounding box coordinates and a third label... In at least some examples, the example-based object detector 114 may be trained end-to-end to update parameters of the feature-extraction backbone networks as well as the transformer model and/or object detection/image segmentation heads),
wherein the one or more operations to generate the trained machine learning model include minimizing a loss function ([col. 2-3 ln. 64-3] Generally, in machine learning models, such as neural networks, after initialization, annotated training data may be used to generate a cost or “loss” function that describes the difference between expected output of the machine learning model and actual output. The parameters (e.g., weights and/or biases) of the machine learning model may be updated to minimize (or maximize) the cost).
Chakraborty does not teach a loss function that comprises both a multiple instance learning loss term and an energy loss term.
Kokkinos, in the same field of endeavor of image segmentation, teaches a loss function that comprises both a multiple instance learning loss term and an energy loss term ([0018] According to an aspect of the invention, we introduce a training objective that explicitly encodes the uncertainty of human boundary annotations in a Multiple Instance Learning framework (MIL) and incorporate it as a loss for the training of a HED-style DCNN for boundary detection. [0082] As illustrated in Fig. 3 we now have additional processing streams, for semantic segmentation, for region proposals and object mapping. [0085] The module CRF implements fully connected conditional random field modelling and outputs semantic segmentation results, by integrating the classifier scores with some image-based measure of affinity. [0089] For these additional streams we use loss functions adapted to the task at hand, guiding the learning of the fully-connected layers so as to optimize a task-specific performance measure. We use the softmax loss function, as is common in all recent works on semantic segmentation).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Kokkinos to use a loss function that comprises multiple instance learning loss and energy loss because "Unlike most recent multi-task architectures, in our work the flow of information across these different tasks is explicitly controlled and exploited through a recurrent architecture. For this we introduce a reference layer that conglomerates at every stage the results of the different tasks and then updates the solutions to the individual tasks in the light of the previous solutions. This recurrent architecture allows the synergy between the tasks to be exploited and leads to increasingly accurate solutions" [Kokkinos 0010] and "guarantees an improvement of the (training) loss at every stage of processing." [Kokkinos 0105].
Regarding claim 7, Chakraborty and Kokkinos teach the method of claim 1. Chakraborty further teaches wherein the machine learning model comprises a transformer model ([col. 7 ln. 32-35] Although, in various examples described herein, the architecture of the example-based object detector 114 may include a robust feature mapping module and/or a transformer model).
Regarding claim 8, Chakraborty and Kokkinos teach the method of claim 1. Chakraborty further teaches wherein the text referring to the at least one object comprises one or more natural language expressions ([col. 12 ln 44-46] Additionally, the multiple query objects detected may be labeled with their respective class labels (e.g., “vegetarian logo,” “warning logo,” etc.) and confidence scores).
Regarding claim 9, Chakraborty and Kokkinos teach the method of claim 1. Chakraborty further teaches processing a first image and a first text using the machine learning model ([col. 11 ln. 49-65] FIG. 2A is a block diagram illustrating an example-based annotation system, in accordance with various aspects of the present disclosure. As shown, a user may upload configurations (action 1) using a user interface (UI). The configurations may include target images (e.g., those images in which objects represented in query images are to be detected) and query images (e.g., examples images including classes of objects to be detected in the target images)… The user-uploaded query images may include class labels labeling the classes of the respective objects-of-interest depicted in the user-uploaded query images. [col. 12 ln. 18-19] At action 2, query-target pairs may be generated for input into the example-based object detector 114. [col. 7 ln. 32-37] Although, in various examples described herein, the architecture of the example-based object detector 114 may include a robust feature mapping module and/or a transformer model, in some other examples, other machine learning algorithms may be used to locate representations of an object represented in a query image within a target image) to generate a segmentation mask indicating one or more objects in the first image that are referred to by the first text ([col. 13 ln. 11-16] The example-based object detection system 106 may generate results 216. The results 216 may localize the object-of-interest in images in which depictions of the object-of-interest appear (in whole or in part). As previously described, the results 216 may include bounding box 218 localizations and/or segmentation masks).
Regarding claim 10, Chakraborty and Kokkinos teach the method of claim 1. Kokkinos teaches wherein the energy loss term comprises a conditional random field loss term ([0082] As illustrated in Fig. 3 we now have additional processing streams, for semantic segmentation, for region proposals and object mapping. [0085] The module CRF implements fully connected conditional random field modelling and outputs semantic segmentation results, by integrating the classifier scores with some image-based measure of affinity. [0089] For these additional streams we use loss functions adapted to the task at hand, guiding the learning of the fully-connected layers so as to optimize a task-specific performance measure. We use the softmax loss function, as is common in all recent works on semantic segmentation).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Kokkinos to use conditional random field loss to "output semantic segmentation results" [Kokkinos 0085] because "the fusion of multiple tasks and use recurrent processing to exploit the flow of information across them…guarantees an improvement of the (training) loss at every stage of processing." [Kokkinos 0105].
Regarding claim 11, Chakraborty teaches one or more non-transitory computer-readable media storing instructions that, when executed by at least one processor ([col. 25 ln. 7-11] Also, any logic or application described herein that comprises software or code can be embodied in any non-transitory computer-readable medium or memory for use by or in connection with an instruction execution system such as a processing component in a computer system), cause the at least one processor to perform the steps of:
receiving a training data set that includes at least one image, text referring to at least one object included in the at least one image, and at least one bounding box annotation associated with the at least one object ([col. 2 ln. 30-35] For example, human annotators are typically provided annotation interfaces where the annotator draws a bounding box around specific objects of interest and provides a label for those objects. These annotated images may be used to train an object detection model to detect the objects that were labeled in the training data); and
performing, based on the training data set, one or more operations to generate a trained machine learning model to segment images based on text ([col. 3 ln. 61-64] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image. [col. 14 ln. 27-49] The original training image 306 may have multiple classes of objects. For example, a first class of object may have first bounding box coordinates and a corresponding object class label 304a. Similarly, a second class of object may have second bounding box coordinates and a second label, and a third class of object may have third bounding box coordinates and a third label... In at least some examples, the example-based object detector 114 may be trained end-to-end to update parameters of the feature-extraction backbone networks as well as the transformer model and/or object detection/image segmentation heads),
wherein the one or more operations to generate the trained machine learning model include minimizing a loss function ([col. 2-3 ln. 64-3] Generally, in machine learning models, such as neural networks, after initialization, annotated training data may be used to generate a cost or “loss” function that describes the difference between expected output of the machine learning model and actual output. The parameters (e.g., weights and/or biases) of the machine learning model may be updated to minimize (or maximize) the cost).
Chakraborty does not teach a loss function that comprises both a multiple instance learning loss term and an energy loss term.
Kokkinos, in the same field of endeavor of image segmentation, teaches a loss function that comprises both a multiple instance learning loss term and an energy loss term ([0018] According to an aspect of the invention, we introduce a training objective that explicitly encodes the uncertainty of human boundary annotations in a Multiple Instance Learning framework (MIL) and incorporate it as a loss for the training of a HED-style DCNN for boundary detection. [0082] As illustrated in Fig. 3 we now have additional processing streams, for semantic segmentation, for region proposals and object mapping. [0085] The module CRF implements fully connected conditional random field modelling and outputs semantic segmentation results, by integrating the classifier scores with some image-based measure of affinity. [0089] For these additional streams we use loss functions adapted to the task at hand, guiding the learning of the fully-connected layers so as to optimize a task-specific performance measure. We use the softmax loss function, as is common in all recent works on semantic segmentation).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the media of Chakraborty with the teachings of Kokkinos to use a loss function that comprises multiple instance learning loss and energy loss because "Unlike most recent multi-task architectures, in our work the flow of information across these different tasks is explicitly controlled and exploited through a recurrent architecture. For this we introduce a reference layer that conglomerates at every stage the results of the different tasks and then updates the solutions to the individual tasks in the light of the previous solutions. This recurrent architecture allows the synergy between the tasks to be exploited and leads to increasingly accurate solutions" [Kokkinos 0010].
Regarding claim 17, Chakraborty and Kokkinos teach the media of claim 11. Chakraborty further teaches wherein the instructions, when executed by the at least one processor, further cause the at least one processor ([col. 25 ln. 7-11] Also, any logic or application described herein that comprises software or code can be embodied in any non-transitory computer-readable medium or memory for use by or in connection with an instruction execution system such as a processing component in a computer system) to perform the step of generating the at least one bounding box annotation by performing one or more object detection operations based on the at least one image ([col. 3 ln. 58-61] Additionally, an object detection head of the example-based object detection system may output a bounding box surrounding the depiction of the object-of-interest present in the target image).
Regarding claim 18, Chakraborty and Kokkinos teach the media of claim 11. Chakraborty further teaches wherein the instructions, when executed by the at least one processor, further cause the at least one processor ([col. 25 ln. 7-11] Also, any logic or application described herein that comprises software or code can be embodied in any non-transitory computer-readable medium or memory for use by or in connection with an instruction execution system such as a processing component in a computer system) to perform the step of processing a first image and a first text using the trained machine learning model ([col. 11 ln. 49-65] FIG. 2A is a block diagram illustrating an example-based annotation system, in accordance with various aspects of the present disclosure. As shown, a user may upload configurations (action 1) using a user interface (UI). The configurations may include target images (e.g., those images in which objects represented in query images are to be detected) and query images (e.g., examples images including classes of objects to be detected in the target images)… The user-uploaded query images may include class labels labeling the classes of the respective objects-of-interest depicted in the user-uploaded query images. [col. 12 ln. 18-19] At action 2, query-target pairs may be generated for input into the example-based object detector 114. [col. 7 ln. 32-37] Although, in various examples described herein, the architecture of the example-based object detector 114 may include a robust feature mapping module and/or a transformer model, in some other examples, other machine learning algorithms may be used to locate representations of an object represented in a query image within a target image. [col. 14 ln. 45-51] In at least some examples, the example-based object detector 114 may be trained end-to-end to update parameters of the feature-extraction backbone networks as well as the transformer model and/or object detection/image segmentation heads) to generate a segmentation mask indicating one or more objects in the first image that are referred to by the first text ([col. 13 ln. 11-16] The example-based object detection system 106 may generate results 216. The results 216 may localize the object-of-interest in images in which depictions of the object-of-interest appear (in whole or in part). As previously described, the results 216 may include bounding box 218 localizations and/or segmentation masks).
Regarding claim 19, Chakraborty and Kokkinos teach the media of claim 18. Chakraborty further teaches wherein the segmentation mask indicates one or more pixels in the first image that are associated with the one or more objects ([col. 3 ln. 61-64] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image).
Regarding claim 20, Chakraborty teaches a system (Fig. 1), comprising: one or more memories storing instructions; and one or more processors that are coupled to the one or more memories and, when executing the instructions ([col. 25 ln. 7-11] Also, any logic or application described herein that comprises software or code can be embodied in any non-transitory computer-readable medium or memory for use by or in connection with an instruction execution system such as a processing component in a computer system), are configured to:
receive a training data set that includes at least one image, text referring to at least one object included in the at least one image, and at least one bounding box annotation associated with the at least one object ([col. 2 ln. 30-35] For example, human annotators are typically provided annotation interfaces where the annotator draws a bounding box around specific objects of interest and provides a label for those objects. These annotated images may be used to train an object detection model to detect the objects that were labeled in the training data); and
perform, based on the training data set, one or more operations to generate a trained machine learning model to segment images based on text ([col. 3 ln. 61-64] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image. [col. 14 ln. 27-49] The original training image 306 may have multiple classes of objects. For example, a first class of object may have first bounding box coordinates and a corresponding object class label 304a. Similarly, a second class of object may have second bounding box coordinates and a second label, and a third class of object may have third bounding box coordinates and a third label... In at least some examples, the example-based object detector 114 may be trained end-to-end to update parameters of the feature-extraction backbone networks as well as the transformer model and/or object detection/image segmentation heads),
wherein the one or more operations to generate the trained machine learning model include minimizing a loss function ([col. 2-3 ln. 64-3] Generally, in machine learning models, such as neural networks, after initialization, annotated training data may be used to generate a cost or “loss” function that describes the difference between expected output of the machine learning model and actual output. The parameters (e.g., weights and/or biases) of the machine learning model may be updated to minimize (or maximize) the cost).
Chakraborty does not teach a loss function that comprises both a multiple instance learning loss term and an energy loss term.
Kokkinos, in the same field of endeavor of image segmentation, teaches a loss function that comprises both a multiple instance learning loss term and an energy loss term ([0018] According to an aspect of the invention, we introduce a training objective that explicitly encodes the uncertainty of human boundary annotations in a Multiple Instance Learning framework (MIL) and incorporate it as a loss for the training of a HED-style DCNN for boundary detection. [0082] As illustrated in Fig. 3 we now have additional processing streams, for semantic segmentation, for region proposals and object mapping. [0085] The module CRF implements fully connected conditional random field modelling and outputs semantic segmentation results, by integrating the classifier scores with some image-based measure of affinity. [0089] For these additional streams we use loss functions adapted to the task at hand, guiding the learning of the fully-connected layers so as to optimize a task-specific performance measure. We use the softmax loss function, as is common in all recent works on semantic segmentation).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the system of Chakraborty with the teachings of Kokkinos to use a loss function that comprises multiple instance learning loss and energy loss because "Unlike most recent multi-task architectures, in our work the flow of information across these different tasks is explicitly controlled and exploited through a recurrent architecture. For this we introduce a reference layer that conglomerates at every stage the results of the different tasks and then updates the solutions to the individual tasks in the light of the previous solutions. This recurrent architecture allows the synergy between the tasks to be exploited and leads to increasingly accurate solutions" [Kokkinos 0010].
Claims 2-6, 12-16 are rejected under 35 U.S.C. 103 as being unpatentable over Chakraborty in view of Kokkinos and Lin (US20180267997A1).
Regarding claim 2, Chakraborty and Kokkinos teach the method of claim 1. Chakraborty further teaches wherein the machine learning model comprises: an image encoder that generates a feature map based on an image ([col. 3 ln. 48-58] In some examples, one or more backbone networks (e.g., convolutional neural networks (CNNs)) may be used to generate feature embeddings representing the depictions of the objects in the query image and the target image. These embeddings may be input into a transformer encoder (e.g., along with positional embeddings describing a spatial position of various objects in the target image and/or query image). As described in further detail below, a robust feature mapping module may determine whether the object-of-interest of the query image is represented, in whole or in part, in the target image);
and an image segmentation model (object detection model) that generates a mask based on the feature map ([col. 3 ln. 61-67] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image. As described in further detail below, the output of the example-based object detection system may be used in an object detection context, and/or may be used to automatically annotate images for training a high-precision object detection model (e.g., for a particular set of object classes)).
Chakraborty does not teach a text encoder that encodes text to generate one or more text embeddings
Lin, in the same field of endeavor of embedded learning, teaches a text encoder that encodes text to generate one or more text embeddings ([0033] After obtaining word vector representations for each tag, an encoding scheme for the set of user-provided tags (w.sub.1, w.sub.2, . . . , w.sub.n) associated with a given image is calculated. [0034] This encoding scheme is referred to herein as a “soft topic.” [0038] A convolutional neural network then is employed to map the image feature vector and the soft topic feature vector into a common embedding space ε. More specifically, each image I is passed through a residual network and the penultimate layer is extracted and used as image feature vector v. An exemplary embedding network 500 is shown in FIG. 5).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Lin to generate a mask based on text embeddings and a feature map (Chakraborty discusses natural language text embedding but does not apply it to the model) because "Traditional approaches to online image search are constrained in their ability to adequately identify and present the most relevant images available in response to an input query" [Lin 0018].
Regarding claim 3, Chakraborty, Kokkinos, and Lin teach the method of claim 2. Lin teaches wherein the machine learning model further comprises a text adaptor that adapts the one or more text embeddings to generate one or more refined text embeddings ([0026] Further, the image embedding system 104 includes a soft topic feature vector (or weighted word vector) generating component 114. The soft topic feature vector generating component 114 is configured to generate a word vector representation for each of a plurality of keyword tags associated with an image, and calculate a weighted average of the generated word vector representations to generate a soft topic feature (or weighted word) vector).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Lin to generate refined text embeddings because "Traditional approaches to online image search are constrained in their ability to adequately identify and present the most relevant images available in response to an input query" [Lin 0018].
Regarding claim 4, Chakraborty, Kokkinos, and Lin teach the method of claim 3. Lin teaches wherein the machine learning model further comprises a concatenation module that concatenates the one or more refined text embeddings and the feature map ([0029] A schematic diagram illustrating an exemplary overall embedding learning framework 300 in accordance with implementations of the present disclosure is shown in FIG. 3. The framework 300 is generally configured to create image feature vectors from visual features computed from images, create soft topic feature (weighted word) vectors from keyword tags associated with images, and to align the image feature vectors and the soft topic feature vectors in a common embedding space utilizing embedding learning).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Lin to concatenate the text embeddings and the feature map to "[utilize] the aligned vectors, [so that] a relevancy score is computed…for each of the keyword tags as it pertains to the subject image. Once trained, the framework described herein can be utilized to automatically associate keyword tags with additional input images and to rank the relevance of images with respect to queried keywords based upon associated relevancy scores" [Lin 0005].
Regarding claim 5, Chakraborty, Kokkinos, and Lin teach the method of claim 3. Chakraborty further teaches wherein the machine learning model further comprises: a second convolution layer (Fig. 7A) that projects the mask to one channel to generate a segmentation mask ([col. 3 ln. 61-64] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image).
Lin teaches a first convolution layer that fuses a concatenation of the one or more refined text embeddings and the feature map ([0038] A convolutional neural network then is employed to map the image feature vector and the soft topic feature vector into a common embedding space ε. [0029] The framework 300 is generally configured to create image feature vectors from visual features computed from images, create soft topic feature (weighted word) vectors from keyword tags associated with images, and to align the image feature vectors and the soft topic feature vectors in a common embedding space utilizing embedding learning).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Lin to use a convolution layer to concatenate the text embeddings and the feature map to "[utilize] the aligned vectors, [so that] a relevancy score is computed…for each of the keyword tags as it pertains to the subject image. Once trained, the framework described herein can be utilized to automatically associate keyword tags with additional input images and to rank the relevance of images with respect to queried keywords based upon associated relevancy scores" [Lin 0005].
Regarding claim 6, Chakraborty, Kokkinos, and Lin teach the method of claim 2. Chakraborty further teaches wherein the image segmentation model comprises: a transformer encoder that generates refined feature tokens based on the feature map ([col. 4 ln. 4-16] In general, the encoder network of a transformer comprises a set of encoding layers that processes the input data one layer after another. Each encoder layer generates encodings (sometimes referred to as “tokens”). These tokens include feature representations (e.g., feature vectors and/or maps) that include information about which parts of the input data are relevant to each other. For example, for each input embedding, the encoder layers may determine which parts of the token are relevant to other tokens received as part of the input data. For example, the encoder layers may determine which portions of the target image are most relevant to the depiction of the object-of-interest in the query image);
a location decoder that generates location-aware queries based on the refined feature tokens and random queries ([col. 5 ln. 39-48] In various examples described herein, the position embedding may describe a spatial relationship of a plurality of tokens relative to other tokens. For example, an input token may represent a 16×16 (or other dimension grid) overlaid on an input frame of image data. The position embedding may describe a location of an item/token within the grid (e.g., relative to other tokens representing other portions of the frame). [col. 6 ln. 1-6] In a self-attention layer, the keys, values and queries come from the same place—in the case of the encoder, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. In “encoder-decoder attention” layers (sometimes referred to as “cross-attention”), the queries come from the previous decoder layer, and the keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. The decoder is attending to the encoder features); and
a mask decoder that generates the mask based on the location-aware queries and the refined feature tokens ([col. 12 ln. 28-33] At action 3, the query image/target image pairs (as pre-processed) may be input into the example-based object detector 114 for inference. At action 4, each target image may be annotated with bounding boxes/segmentation masks showing a detection of an object of the class shown in the relevant query images).
Chakraborty does not teach text embeddings.
Lin teaches text embeddings ([0026] Further, the image embedding system 104 includes a soft topic feature vector (or weighted word vector) generating component 114).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the method of Chakraborty with the teachings of Lin to generate feature tokens based on text embeddings and a feature map (Chakraborty discusses natural language text embedding but does not apply it to the model) because "Traditional approaches to online image search are constrained in their ability to adequately identify and present the most relevant images available in response to an input query" [Lin 0018].
Regarding claim 12, Chakraborty and Kokkinos teach the media of claim 11. Chakraborty further teaches wherein the trained machine learning model comprises: an image encoder that generates a feature map based on an image ([col. 3 ln. 48-58] In some examples, one or more backbone networks (e.g., convolutional neural networks (CNNs)) may be used to generate feature embeddings representing the depictions of the objects in the query image and the target image. These embeddings may be input into a transformer encoder (e.g., along with positional embeddings describing a spatial position of various objects in the target image and/or query image). As described in further detail below, a robust feature mapping module may determine whether the object-of-interest of the query image is represented, in whole or in part, in the target image. [col. 14 ln. 45-51] In at least some examples, the example-based object detector 114 may be trained end-to-end to update parameters of the feature-extraction backbone networks as well as the transformer model and/or object detection/image segmentation heads. After training 312, the example-based object detector 114 is effective to detect previously-unseen objects based on the examples provided in the query images);
and an image segmentation model (object detection model) that generates a mask based on the feature map ([col. 3 ln. 61-67] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image. As described in further detail below, the output of the example-based object detection system may be used in an object detection context, and/or may be used to automatically annotate images for training a high-precision object detection model (e.g., for a particular set of object classes)).
Chakraborty does not teach a text encoder that encodes text to generate one or more text embeddings.
Lin, in the same field of endeavor of embedded learning, teaches a text encoder that encodes text to generate one or more text embeddings ([0033] After obtaining word vector representations for each tag, an encoding scheme for the set of user-provided tags (w.sub.1, w.sub.2, . . . , w.sub.n) associated with a given image is calculated. [0034] This encoding scheme is referred to herein as a “soft topic.” [0038] A convolutional neural network then is employed to map the image feature vector and the soft topic feature vector into a common embedding space ε. More specifically, each image I is passed through a residual network and the penultimate layer is extracted and used as image feature vector v. An exemplary embedding network 500 is shown in FIG. 5).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the media of Chakraborty with the teachings of Lin to generate a mask based on text embeddings and a feature map (Chakraborty discusses natural language text embedding but does not apply it to the model) because "Traditional approaches to online image search are constrained in their ability to adequately identify and present the most relevant images available in response to an input query" [Lin 0018].
Regarding claim 13, Chakraborty, Kokkinos, and Lin teach the media of claim 12. Lin teaches wherein the trained machine learning model comprises: a text adaptor that adapts the one or more text embeddings to generate one or more refined text embeddings ([0017] Embodiments of the present invention relate to, among other things, a framework for associating images with topics that are indicative of the subject matter of the images utilizing embedding learning. The framework is trained utilizing multiple images, each image having associated visual characteristics and keyword tags. [0026] Further, the image embedding system 104 includes a soft topic feature vector (or weighted word vector) generating component 114. The soft topic feature vector generating component 114 is configured to generate a word vector representation for each of a plurality of keyword tags associated with an image, and calculate a weighted average of the generated word vector representations to generate a soft topic feature (or weighted word) vector).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the media of Chakraborty with the teachings of Lin to generate refined text embeddings because "Traditional approaches to online image search are constrained in their ability to adequately identify and present the most relevant images available in response to an input query" [Lin 0018].
Regarding claim 14, Chakraborty, Kokkinos, and Lin teach the media of claim 13. Lin teaches wherein the trained machine learning model comprises: a concatenation module that concatenates the one or more refined text embeddings and the feature map ([0017] Embodiments of the present invention relate to, among other things, a framework for associating images with topics that are indicative of the subject matter of the images utilizing embedding learning. The framework is trained utilizing multiple images, each image having associated visual characteristics and keyword tags. [0029] A schematic diagram illustrating an exemplary overall embedding learning framework 300 in accordance with implementations of the present disclosure is shown in FIG. 3. The framework 300 is generally configured to create image feature vectors from visual features computed from images, create soft topic feature (weighted word) vectors from keyword tags associated with images, and to align the image feature vectors and the soft topic feature vectors in a common embedding space utilizing embedding learning).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the media of Chakraborty with the teachings of Lin to concatenate the text embeddings and the feature map to "[utilize] the aligned vectors, [so that] a relevancy score is computed…for each of the keyword tags as it pertains to the subject image. Once trained, the framework described herein can be utilized to automatically associate keyword tags with additional input images and to rank the relevance of images with respect to queried keywords based upon associated relevancy scores" [Lin 0005].
Regarding claim 15, Chakraborty, Kokkinos, and Lin teach the media of claim 13. Chakraborty further teaches wherein the trained machine learning model comprises: a second convolution layer (Fig. 7A) that projects the mask to one channel to generate a segmentation mask ([col. 3 ln. 61-64] In some examples, a segmentation head may be used to output a segmentation mask that identifies those pixels corresponding to the object-of-interest in the target image).
Lin teaches a first convolution layer that fuses a concatenation of the one or more refined text embeddings and the feature map ([0038] A convolutional neural network then is employed to map the image feature vector and the soft topic feature vector into a common embedding space ε. [0029] The framework 300 is generally configured to create image feature vectors from visual features computed from images, create soft topic feature (weighted word) vectors from keyword tags associated with images, and to align the image feature vectors and the soft topic feature vectors in a common embedding space utilizing embedding learning).
Therefore, it would have been obvious to a person of ordinary skill in the art at the time that the invention was made to modify the media of Chakraborty with the teachings of Lin to use a convolution layer to concatenate the text embeddings and the feature map to "[utilize] the aligned vectors, [so that] a relevancy score is computed…for each of the keyword tags as it pertains to the subject image. Once trained, the framework described herein can be utilized to automatically associate keyword tags with additional input images and to rank the relevance of images with respect to queried keywords based upon associated relevancy scores" [Lin 0005].
Regarding claim 16, Chakraborty, Kokkinos, and Lin teach the media of claim 12. Chakraborty further teaches wherein the image segmentation model comprises: a transformer encoder that generates refined feature tokens based on the feature tokens ([col. 4 ln. 4-16] In general, the encoder network of a transformer comprises a set of encoding layers that processes the input data one layer after another. Each encoder layer generates encodings (sometimes referred to as “tokens”). These tokens include feature representations (e.g., feature vectors and/or maps) that include information about which parts of the input data are relevant to each other. For example, for each input embedding, the encoder layers may determine which parts of the token are relevant to other tokens received as part of the input data. For example, the encoder layers may determine which portions of the target image are most relevant to the depiction of the object-of-interest in the query image);
a transformer decoder that generates the mask based on the refined feature tokens ([col. 12 ln. 28-33] At action 3, the query image/target image pairs (as pre-processed) may be input into the example-based object detector 114 for inference. At action 4, each target image may be annotated with bounding boxes/segmentation masks showing a detection of an object of the class shown in the relevant query images. ([col. 5 ln. 39-48] In various examples described herein, the position embedding may describe a spatial relationship of a plurality of tokens relative to other tokens. For example, an input token may represent a 16×16 (or other dimension grid) overlaid on an input frame of image data. The position embedding may describe a location of an item/token within the grid (e.g., relative to other tokens representing other portions of the frame). [col. 6 ln. 1-6] In a self-attention layer, the keys, values and queries come from the same place—in the case of the encoder, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. In “encoder-decoder attention” layers (sometimes referred to as “cross-attention”), the queries come from the previous decoder layer, and the keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. The decoder is attending to the encoder features).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jacqueline R Zak whose telephone number is (571) 272-4077. The examiner can normally be reached M-F 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emily Terrell can be reached at (571) 270-3717. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JACQUELINE R ZAK/Examiner, Art Unit 2666
/EMILY C TERRELL/Supervisory Patent Examiner, Art Unit 2666