Last updated: April 19, 2026
Application No. 18/421,239
IMAGE RETRIEVAL METHOD AND APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM

Non-Final OA §102§103§112
Filed
Jan 24, 2024
Examiner
ANSARI, TAHMINA N
Art Unit
2674
Tech Center
2600 — Communications
Assignee
Tencent Technology (Shenzhen) Company Limited
OA Round
1 (Non-Final)
Interview Optional

— +17.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 868 resolved cases, 2023–2026
Examiner Intelligence

ANSARI, TAHMINA N View full profile →
Grants 86% — above average
Career Allow Rate
743 granted / 868 resolved
+23.6% vs TC avg
Strong +18% interview lift
Without
With
+17.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
33 currently pending
Career history
901
Total Applications
across all art units
Statute-Specific Performance

§101
12.2%
-27.8% vs TC avg
§103
40.4%
+0.4% vs TC avg
§102
22.6%
-17.4% vs TC avg
§112
10.5%
-29.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 868 resolved cases
Office Action

§102 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
Claims 1-20 are pending in this application. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 

Claim Rejections - 35 USC § 112
The following is a quotation of the second paragraph of 35 U.S.C. 112:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention. 

Claim 9-10 are rejected under 35 U.S.C. 112, second paragraph, as being indefinite for the following reasons: it is unclear from the claimed features if the claimed features can be obviated by an iterative process or if the recited number of iterations are an innovative feature of the invention and would require only that number of iterations for optimization? Claims 9 and 10 both recite the following features:
9. The image retrieval method according to claim 8, wherein: the loss value is a first loss value; and adjusting the parameter of the target model includes: obtaining, by the electronic device, a category tag of the sample image; performing, by the electronic device, feature extraction on the sample image based on the target model, to obtain a fifth feature that is of the sample image and that corresponds to the image modality; classifying, by the electronic device, the sample image according to the fifth feature to obtain a sample category, and determining, by the electronic device, a second loss value according to the sample category and the category tag; and adjusting, by the electronic device, the parameter of the target model according to the first loss value and the second loss value.
10. The image retrieval method according to claim 8, wherein: the loss value is a first loss value; and adjusting the parameter of the target model includes: obtaining, by the electronic device, a first reference image that is of a same category as the sample image and a second reference image that is of a different category than the sample image; performing, by the electronic device, feature extraction on the sample image, the first reference image, and the second reference image based on the target model, to obtain a fifth feature that is of the sample image and that corresponds to the image modality, a sixth feature of the first reference image, and a seventh feature of the second reference image; determining, by the electronic device, a third similarity between the fifth feature and the sixth feature and a fourth similarity between the fifth feature and the seventh feature, and determining, by the electronic device, a second loss value according to the third similarity and the fourth similarity; and adjusting, by the electronic device, the parameter of the target model according to the first loss value and the second loss value.
Additionally, Applicant has cited the reference Neculai et al. (Andrei Neculai, Yanbei Chen, Zeynep Akata; “Probabilistic Compositional Embeddings for Multimodal Image Retrieval”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 4547-4557), hereby referred to as “Neculai” in the IDS submitted by the applicant on May 8, 2025 and which was cited by the European Patent Office as a primary reference for the claimed invention. In this reference there is a section 3.3 for Model Optimization which uses a multi-value loss metric that leverages a probabilistic similarity function between two probability distributions to quantify an overall loss value Lct computed across all positive pairs with a regularization term added. It is unclear if this iterative benchmark algorithm is sufficient to read on these terms or not and further clarification is needed from the Applicant in order to fully understand the manner in which these features are to be interpreted. For purposes of examination on the merits and prior art, this section of Neculai will be cited as disclosing these claimed features. 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claims 1, 12 and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Bursztyn et al. (US20230161808A1, filed on November 19, 2021), hereby referred to as “Bursztyn”.
Consider Claims 1, 12 and 20. 
Bursztyn teaches: 
1. An image retrieval method comprising: / 12. An electronic device comprising: one or more processors; and one or more memories storing at least one computer program that, when executed by the one or more processors, causes the one or more processors to: / 20. A non-transitory computer-readable storage medium storing at least one computer program that, when executed by one or more processors, causes the one or more processors to: (Bursztyn: abstract, [0028]-[0040], Figures 1-3, Image Search System, [0029] FIG. 1 shows an example of an image search system according to aspects of the present disclosure. The example shown includes user 100, user device 105, image search apparatus 110, cloud 115, and database 120. [0030] In the example of FIG. 1 , one or more users 100 can provide an initial query to image search apparatus 110 via user device 105 and cloud 115. Image search apparatus 110 can retrieve results from database 120 via cloud 115 based on the query. A user can select a reference image from the results and input a critique of the reference image. Image search apparatus 110 can generate a preference statement based on the critique, retrieve one or more second images from database 120 based on a combination of the reference image, critique, and preference statement, and return the one or more second images to the one or more users 100. [0052], Figure 4)
1. obtaining, by an electronic device, a candidate image set and query data in a plurality of modalities, the candidate image set including a plurality of candidate images; / 12. obtain a candidate image set and query data in a plurality of modalities, the candidate image set including a plurality of candidate images; / 20. obtain a candidate image set and query data in a plurality of modalities, the candidate image set including a plurality of candidate images; (Bursztyn: [0030] In the example of FIG. 1 , one or more users 100 can provide an initial query to image search apparatus 110 via user device 105 and cloud 115. Image search apparatus 110 can retrieve results from database 120 via cloud 115 based on the query. A user can select a reference image from the results and input a critique of the reference image. Image search apparatus 110 can generate a preference statement based on the critique, retrieve one or more second images from database 120 based on a combination of the reference image, critique, and preference statement, and return the one or more second images to the one or more users 100. [0031] One or more users 100 communicates with the image search apparatus 110 via one or more user devices 105 and the cloud 115.)
1. performing, by the electronic device, feature extraction on the query data based on a target model to obtain a plurality of first features of the query data; / 12. perform feature extraction on the query data based on a target model to obtain a plurality of first features of the query data; / 20. perform feature extraction on the query data based on a target model to obtain a plurality of first features of the query data; (Bursztyn: [0035] In some cases, image search apparatus 110 allows one or more users 100 to incrementally refine search results using multi-modal queries that combine a reference image and one or more natural language statements. For example, a reference image that depicts a puppy can be combined with a user input (e.g., “I prefer more cheerful”) to generate a preference statement such as “I prefer the puppies jumping, running, or playing”. In some cases, the user input can include a critique of the reference image, such as “It still looks a bit boring” or “That's too clean for a day in the park”. In some cases, image search apparatus 110 does not predefine the feature space of the user input or make assumptions about the type of language that constitutes a critique. [0036] In some cases, image search apparatus 110 includes an architecture that is based on text-based image retrieval and multi-modal image retrieval processes with cross-modal embeddings and critique understanding process with natural language generation. [0069] According to some aspects, the one or more neural networks included in text generator 515 includes a transformer. A transformer is a deep learning network that is useful in natural language processing applications and that operates according to an attention mechanism. An attention mechanism is a method of placing differing levels of importance on different elements of an input. Calculating attention can be a three-step process of computing the similarity between a query and key vectors obtained from an input to generate attention weights, using a softmax function to normalize the attention weights, and weighing the attention weights in together with the corresponding values. A softmax function is used as the activation function of a neural network to normalize the output of the network to a probability distribution over predicted output classes. After applying the softmax function, each component of the feature map is in the interval (0, 1) and the components add up to one. These values are interpreted as probabilities.)
1. performing, by the electronic device, feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities; / 12. perform feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities; / 20. perform feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities; (Bursztyn: [0059]-[0060], Figure 5, [0068] In one aspect, machine learning model 510 includes text generator 515, multi-modal encoder 520, caption generator 525, and intent classifier 530. In some cases, each of text generator 515, multi-modal encoder 520, caption generator 525, and intent classifier 530 include one or more artificial neural networks. [0069] According to some aspects, the one or more neural networks included in text generator 515 includes a transformer. A transformer is a deep learning network that is useful in natural language processing applications and that operates according to an attention mechanism. An attention mechanism is a method of placing differing levels of importance on different elements of an input. Calculating attention can be a three-step process of computing the similarity between a query and key vectors obtained from an input to generate attention weights, using a softmax function to normalize the attention weights, and weighing the attention weights in together with the corresponding values. A softmax function is used as the activation function of a neural network to normalize the output of the network to a probability distribution over predicted output classes. After applying the softmax function, each component of the feature map is in the interval (0, 1) and the components add up to one. These values are interpreted as probabilities. [0073] According to some aspects, multi-modal encoder 520 encodes the preference statement in an embedding space to obtain an encoded preference statement. The term “embedding space” in a machine learning context refers to a vector space that is used in a word embedding. In some embodiments, the vector space is multi-modal. For example, it can represent both words and images simultaneously. Embodiments of the disclosure use the multi-modal characteristic of the embedding space to match multi-modal queries (e.g., a reference image combined with text input) to images. Thus, the term “embedding space” includes vector spaces in which concepts from either one modality (e.g., words) or multiple modalities (e.g., images and words) can be represented. In some embodiments, concepts rep[resented in either text or image are positioned in the vector space in a manner such that similar concepts are located nearby.)
1. determining, by the electronic device, a plurality of similarities each between the candidate images and the query data in one of the modalities according to the first features and the second features; / 12. determine a plurality of similarities each between the candidate images and the query data in one of the modalities according to the first features and the second features; / 20. determine a plurality of similarities each between the candidate images and the query data in one of the modalities according to the first features and the second features; (Bursztyn: [0059] FIG. 5 shows an example of an image search apparatus according to aspects of the present disclosure. The example shown includes training component 500, search component 505, and machine learning model 510. In some embodiments, the image search apparatus 500 is an example of, or includes aspects of, the computing system 400. For example, in some cases, training component 500, search component 505, and machine learning model 510 can be implemented as hardware circuits that interact with components similar to the ones illustrated in FIG. 4 via a channel. For example, in some cases, training component 500, search component 505, and machine learning model 510 can be implemented as software stored in a memory device. [0060] According to some aspects, training component 500 receives training data including a set of input statements and a set of ground truth preference statements corresponding to the input statements. In some examples, training component 500 computes a loss function for the machine learning model 510 by comparing a preference statement to a corresponding preference statement from the set of ground truth preference statements. In some examples, training component 500 trains the machine learning model 510 using the training data to generate a trained machine learning model 510, where the trained machine learning model 510 is configured to perform a search operation to retrieve an image that matches a query preference statement corresponding to a user input.)
1. determining, by the electronic device, result image sets corresponding to a plurality of query data combinations from the candidate image set according to the similarities, the query data combinations including the query data in at least one of the modalities; / 12. determine result image sets corresponding to a plurality of query data combinations from the candidate image set according to the similarities, the query data combinations including the query data in at least one of the modalities; / 20. determine result image sets corresponding to a plurality of query data combinations from the candidate image set according to the similarities, the query data combinations including the query data in at least one of the modalities; (Bursztyn: [0063] According to some aspects, search component 505 receives a search query including user input for a reference image. In some examples, search component 505 performs a search operation using a multi-modal search encoding to retrieve a second image, where the second image differs from the reference image based on the user input for the reference image. For example, the second image can have characteristics that are similar to the user input but not to the reference image. [0064] In some examples, search component 505 receives an additional search query including a user input for the second image. In some examples, search component 505 retrieves an additional second image based on the additional preference statement. In some examples, search component 505 retrieves an additional second image based on the additional search query. [0065] In some examples, search component 505 compares each of a set of encoded images to the multi-modal search encoding to obtain a similarity score for each of the set of encoded images. In some examples, search component 505 selects the second image from among the set of encoded images based on the similarity score corresponding to the second image.)
1. and merging, by the electronic device, the result image sets to obtain an image retrieval result. / 12. and merge the result image sets to obtain an image retrieval result. / 20. and merge the result image sets to obtain an image retrieval result. (Bursztyn: [0066]  In some examples, search component 505 retrieves a set of images based on the multi-modal search encoding. In some examples, search component 505 receives a user selection identifying one of the set of images. In some examples, search component 505 receives a subsequent search query including a subsequent user input for the set of images. In some examples, search component 505 retrieves a set of additional images based on the subsequent critique. [0067] According to some aspects, search component 505 performs a search operation to retrieve an image that matches one or more query preference statements. In some examples, search component 505 retrieves a set of images based on the search operation. In some examples, search component 505 receives a user selection identifying one of the set of images.)

Claims 1, 12 and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Neculai et al. (Andrei Neculai, Yanbei Chen, Zeynep Akata; “Probabilistic Compositional Embeddings for Multimodal Image Retrieval”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 4547-4557), hereby referred to as “Neculai”.
Consider Claims 1, 12 and 20. 
Neculai teaches: 
1. An image retrieval method comprising: / 12. An electronic device comprising: one or more processors; and one or more memories storing at least one computer program that, when executed by the one or more processors, causes the one or more processors to: / 20. A non-transitory computer-readable storage medium storing at least one computer program that, when executed by one or more processors, causes the one or more processors to: (Neculai:page 4547, abstract, Existing works in image retrieval often consider retrieving images with one or two query inputs, which do not generalize to multiple queries. In this work, we investigate a more challenging scenario for composing multiple multimodal queries in image retrieval. Given an arbitrary number of query images and (or) texts, our goal is to retrieve target images containing the semantic concepts specified in multiple multimodal queries. To learn an informative embedding that can flexibly encode the semantics of various queries, we propose a novel multimodal probabilistic composer (MPC). Specifically, we model input images and texts as probabilistic embeddings, which can be further composed by a probabilistic composition rule to facilitate image retrieval with multiple multimodal queries. We propose a new benchmark based on the MS-COCO dataset and evaluate our model on various setups that compose multiple images and (or) text queries for multimodal image retrieval. Without bells and whistles, we show that our probabilistic model formulation significantly outperforms existing related methods on multimodal image retrieval while generalizing well to query with different amounts of inputs given in arbitrary visual and (or) textual modalities.)
1. obtaining, by an electronic device, a candidate image set and query data in a plurality of modalities, the candidate image set including a plurality of candidate images; / 12. obtain a candidate image set and query data in a plurality of modalities, the candidate image set including a plurality of candidate images; / 20. obtain a candidate image set and query data in a plurality of modalities, the candidate image set including a plurality of candidate images; (Neculai: page 4547 section 1. Introduction, As Figure 1 shows, given an arbitrary number of image and (or) text queries, our goal is to retrieve the images that contain all the semantic concepts specified in the queries. Inspired by the recent advances in compositional learning for visual recognition [37, 41, 52], we tackle this problem by learning a compositional embedding to flexibly encapsulates the multiple semantic concepts specified in the multimodal queries, and to be used for retrieving the more relevant images. Page 4548 Our model formulation offers two unique properties to learn a compositional embedding for multimodal image retrieval. First, our probabilistic composer allows to compose embeddings of a flexible amount of queries in arbitrary modalities. Second, its probabilistic nature allows to encode semantics as well as ambiguities of a given input, thus well capturing the polysemantic information in text queries, e.g. a text query “dog” may refer to a variety of dog breeds that differ visually. These properties well faciliate better performance in multimodal image retrieval)
1. performing, by the electronic device, feature extraction on the query data based on a target model to obtain a plurality of first features of the query data; / 12. perform feature extraction on the query data based on a target model to obtain a plurality of first features of the query data; / 20. perform feature extraction on the query data based on a target model to obtain a plurality of first features of the query data; (Neculai: page 4548 section 1. Introduction, Our contribution is three-fold: • We establish a new multimodal image retrieval benchmark using the MS-COCO dataset to investigate image retrieval using an arbitrary number of queries in arbitrary modalities. We evaluate a variety of settings including (1) using different combinations of input modalities, and (2) using various number of queries. • We propose a Multimodal Probabilistic Composer (MPC), which features a new probabilistic rule to compose probabilistic embeddings and a new probabilistic similarity metric to compare probabilistic embeddings, which together lead to its superior model performance in composing multimodal queries for image retrieval. • We show that our model outperforms existing multi-modal fusion methods significantly for multi-modal image retrieval. To further analyze our model design rationale, we also conduct an in-depth experimental analysis)
1. performing, by the electronic device, feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities; / 12. perform feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities; / 20. perform feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities; (Neculai: page 4548-4549, section 3. Multimodal Probabilistic Composer, Given a composite set of k input samples – where each input specifies a semantic concept (e.g. “dog”, “sport ball”), our goal is to learn a compositional embedding for retrieving the corresponding target images that contain the set of specified semantic concepts. Each input can be given in visual or textual modality, and passed through modality specific encoder to obtain its embedding. Accordingly, the composition of different embeddings should be modality agnostic, which means our model by default is trained to combine an arbitrary set of samples in arbitrary modalities.

    PNG
    media_image1.png
    422
    836
    media_image1.png
    Greyscale
)
1. determining, by the electronic device, a plurality of similarities each between the candidate images and the query data in one of the modalities according to the first features and the second features; / 12. determine a plurality of similarities each between the candidate images and the query data in one of the modalities according to the first features and the second features; / 20. determine a plurality of similarities each between the candidate images and the query data in one of the modalities according to the first features and the second features; (Neculai: page 4550 section 3.3. Model Optimization Similar to standard objectives in non-probabilistic metric learning such as triplet loss and contrastive loss, our training objective is imposed to pull the distribution of the compositional embedding and the target image distribution closer, while pushing away the distributions of negative pairs. To achieve this aim, we first need to define a probabilistic similarity function between two probability distributions. Probabilistic similarity. To quantify the similarity between two probabilistic distribution, Monte-Carlo estimation can be adopted which draws a number of J data points

    PNG
    media_image2.png
    786
    403
    media_image2.png
    Greyscale
)
1. determining, by the electronic device, result image sets corresponding to a plurality of query data combinations from the candidate image set according to the similarities, the query data combinations including the query data in at least one of the modalities; / 12. determine result image sets corresponding to a plurality of query data combinations from the candidate image set according to the similarities, the query data combinations including the query data in at least one of the modalities; / 20. determine result image sets corresponding to a plurality of query data combinations from the candidate image set according to the similarities, the query data combinations including the query data in at least one of the modalities; (Neculai: page 4549 
    PNG
    media_image1.png
    422
    836
    media_image1.png
    Greyscale
section 3.1. Modality-Specific Probabilistic Embeddings In this section, we first describe how different modalities (i.e. image and text) are modeled and then detail how probabilistic embeddings are learned to represent each input. Image encoder. We use a ResNet [23] backbone fResNet, with an additional linear projection layer fimg, as our image encoder to learn the image embeddings. Given an input image (referred as s1), we pass it through fResNet to obtain the feature map ϕimg and compute its feature encoding as: zimg = fimg(ϕimg), where zimg ∈ R D. Text encoder. To encode text information, we use GloVe word embeddings [48] to encode each word (denoted as fGloV e) and train a bidirectional GRU [11] (denoted as ftxt) to learn the text embeddings. Given a text snippet (referred as s2), we obtained the word embeddings: ϕtxt = fGloV e(s2). We pass ϕtxt through the GRU to obtain its feature encoding: ztxt = ftxt(ϕtxt), where ztxt ∈ R D. Probabilistic embeddings. Our model is motivated with the flexibility to take a composite set of k queries in arbitrary modalities. With this design rationale in mind, we propose to model each embedding as a multivariate Gaussian probability density function (PDF), such that the compositions of different embeddings can be achieved by composing different Gaussian PDFs through a parametric probabilistic rule, i.e. the product of k Gaussian PDFs [5]. Below, we detail how each embedding is modeled as a multivariate Gaussian, similar to recent probabilistic embeddings works [12, 45], and present how we derive compositional embeddings through our probabilistic composer in Section 3.2.)
1. and merging, by the electronic device, the result image sets to obtain an image retrieval result. / 12. and merge the result image sets to obtain an image retrieval result. / 20. and merge the result image sets to obtain an image retrieval result. (Neculai: page 4552-4553, section 4.2. Comparing to the State-of-the-Art, Qualitative results. Figure 3 shows the qualitative image retrieval results using a composite set of queries in different modalities. When given two inputs in arbitrary modalities (see (a), (b), (c)), our model can retrieve the images that contain the set of semantic concepts specified in the input, e.g. in the second example in (b), “broccoli” image and “carrot” text are composed to retrieve a dish with both concepts. When given three inputs (see (d)), our model can discover the images that cover the multiple semantic concepts specified in the input. Another interesting observation

    PNG
    media_image3.png
    290
    858
    media_image3.png
    Greyscale
)



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Bursztyn et al. (US20230161808A1, filed on November 19, 2021), hereby referred to as “Bursztyn”, in view of Neculai et al. (Andrei Neculai, Yanbei Chen, Zeynep Akata; “Probabilistic Compositional Embeddings for Multimodal Image Retrieval”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 4547-4557), hereby referred to as “Neculai”. 

Consider Claims 1, 12 and 20. 
Bursztyn teaches: 
1. An image retrieval method comprising: / 12. An electronic device comprising: one or more processors; and one or more memories storing at least one computer program that, when executed by the one or more processors, causes the one or more processors to: / 20. A non-transitory computer-readable storage medium storing at least one computer program that, when executed by one or more processors, causes the one or more processors to: (Bursztyn: abstract, [0028]-[0040], Figures 1-3, Image Search System, [0029] FIG. 1 shows an example of an image search system according to aspects of the present disclosure. The example shown includes user 100, user device 105, image search apparatus 110, cloud 115, and database 120. [0030] In the example of FIG. 1 , one or more users 100 can provide an initial query to image search apparatus 110 via user device 105 and cloud 115. Image search apparatus 110 can retrieve results from database 120 via cloud 115 based on the query. A user can select a reference image from the results and input a critique of the reference image. Image search apparatus 110 can generate a preference statement based on the critique, retrieve one or more second images from database 120 based on a combination of the reference image, critique, and preference statement, and return the one or more second images to the one or more users 100. [0052], Figure 4)
1. obtaining, by an electronic device, a candidate image set and query data in a plurality of modalities, the candidate image set including a plurality of candidate images; / 12. obtain a candidate image set and query data in a plurality of modalities, the candidate image set including a plurality of candidate images; / 20. obtain a candidate image set and query data in a plurality of modalities, the candidate image set including a plurality of candidate images; (Bursztyn: [0030] In the example of FIG. 1 , one or more users 100 can provide an initial query to image search apparatus 110 via user device 105 and cloud 115. Image search apparatus 110 can retrieve results from database 120 via cloud 115 based on the query. A user can select a reference image from the results and input a critique of the reference image. Image search apparatus 110 can generate a preference statement based on the critique, retrieve one or more second images from database 120 based on a combination of the reference image, critique, and preference statement, and return the one or more second images to the one or more users 100. [0031] One or more users 100 communicates with the image search apparatus 110 via one or more user devices 105 and the cloud 115.)
1. performing, by the electronic device, feature extraction on the query data based on a target model to obtain a plurality of first features of the query data; / 12. perform feature extraction on the query data based on a target model to obtain a plurality of first features of the query data; / 20. perform feature extraction on the query data based on a target model to obtain a plurality of first features of the query data; (Bursztyn: [0035] In some cases, image search apparatus 110 allows one or more users 100 to incrementally refine search results using multi-modal queries that combine a reference image and one or more natural language statements. For example, a reference image that depicts a puppy can be combined with a user input (e.g., “I prefer more cheerful”) to generate a preference statement such as “I prefer the puppies jumping, running, or playing”. In some cases, the user input can include a critique of the reference image, such as “It still looks a bit boring” or “That's too clean for a day in the park”. In some cases, image search apparatus 110 does not predefine the feature space of the user input or make assumptions about the type of language that constitutes a critique. [0036] In some cases, image search apparatus 110 includes an architecture that is based on text-based image retrieval and multi-modal image retrieval processes with cross-modal embeddings and critique understanding process with natural language generation. [0069] According to some aspects, the one or more neural networks included in text generator 515 includes a transformer. A transformer is a deep learning network that is useful in natural language processing applications and that operates according to an attention mechanism. An attention mechanism is a method of placing differing levels of importance on different elements of an input. Calculating attention can be a three-step process of computing the similarity between a query and key vectors obtained from an input to generate attention weights, using a softmax function to normalize the attention weights, and weighing the attention weights in together with the corresponding values. A softmax function is used as the activation function of a neural network to normalize the output of the network to a probability distribution over predicted output classes. After applying the softmax function, each component of the feature map is in the interval (0, 1) and the components add up to one. These values are interpreted as probabilities.)
1. performing, by the electronic device, feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities; / 12. perform feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities; / 20. perform feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities; (Bursztyn: [0059]-[0060], Figure 5, [0068] In one aspect, machine learning model 510 includes text generator 515, multi-modal encoder 520, caption generator 525, and intent classifier 530. In some cases, each of text generator 515, multi-modal encoder 520, caption generator 525, and intent classifier 530 include one or more artificial neural networks. [0069] According to some aspects, the one or more neural networks included in text generator 515 includes a transformer. A transformer is a deep learning network that is useful in natural language processing applications and that operates according to an attention mechanism. An attention mechanism is a method of placing differing levels of importance on different elements of an input. Calculating attention can be a three-step process of computing the similarity between a query and key vectors obtained from an input to generate attention weights, using a softmax function to normalize the attention weights, and weighing the attention weights in together with the corresponding values. A softmax function is used as the activation function of a neural network to normalize the output of the network to a probability distribution over predicted output classes. After applying the softmax function, each component of the feature map is in the interval (0, 1) and the components add up to one. These values are interpreted as probabilities. [0073] According to some aspects, multi-modal encoder 520 encodes the preference statement in an embedding space to obtain an encoded preference statement. The term “embedding space” in a machine learning context refers to a vector space that is used in a word embedding. In some embodiments, the vector space is multi-modal. For example, it can represent both words and images simultaneously. Embodiments of the disclosure use the multi-modal characteristic of the embedding space to match multi-modal queries (e.g., a reference image combined with text input) to images. Thus, the term “embedding space” includes vector spaces in which concepts from either one modality (e.g., words) or multiple modalities (e.g., images and words) can be represented. In some embodiments, concepts rep[resented in either text or image are positioned in the vector space in a manner such that similar concepts are located nearby.)
1. determining, by the electronic device, a plurality of similarities each between the candidate images and the query data in one of the modalities according to the first features and the second features; / 12. determine a plurality of similarities each between the candidate images and the query data in one of the modalities according to the first features and the second features; / 20. determine a plurality of similarities each between the candidate images and the query data in one of the modalities according to the first features and the second features; (Bursztyn: [0059] FIG. 5 shows an example of an image search apparatus according to aspects of the present disclosure. The example shown includes training component 500, search component 505, and machine learning model 510. In some embodiments, the image search apparatus 500 is an example of, or includes aspects of, the computing system 400. For example, in some cases, training component 500, search component 505, and machine learning model 510 can be implemented as hardware circuits that interact with components similar to the ones illustrated in FIG. 4 via a channel. For example, in some cases, training component 500, search component 505, and machine learning model 510 can be implemented as software stored in a memory device. [0060] According to some aspects, training component 500 receives training data including a set of input statements and a set of ground truth preference statements corresponding to the input statements. In some examples, training component 500 computes a loss function for the machine learning model 510 by comparing a preference statement to a corresponding preference statement from the set of ground truth preference statements. In some examples, training component 500 trains the machine learning model 510 using the training data to generate a trained machine learning model 510, where the trained machine learning model 510 is configured to perform a search operation to retrieve an image that matches a query preference statement corresponding to a user input.)
1. determining, by the electronic device, result image sets corresponding to a plurality of query data combinations from the candidate image set according to the similarities, the query data combinations including the query data in at least one of the modalities; / 12. determine result image sets corresponding to a plurality of query data combinations from the candidate image set according to the similarities, the query data combinations including the query data in at least one of the modalities; / 20. determine result image sets corresponding to a plurality of query data combinations from the candidate image set according to the similarities, the query data combinations including the query data in at least one of the modalities; (Bursztyn: [0063] According to some aspects, search component 505 receives a search query including user input for a reference image. In some examples, search component 505 performs a search operation using a multi-modal search encoding to retrieve a second image, where the second image differs from the reference image based on the user input for the reference image. For example, the second image can have characteristics that are similar to the user input but not to the reference image. [0064] In some examples, search component 505 receives an additional search query including a user input for the second image. In some examples, search component 505 retrieves an additional second image based on the additional preference statement. In some examples, search component 505 retrieves an additional second image based on the additional search query. [0065] In some examples, search component 505 compares each of a set of encoded images to the multi-modal search encoding to obtain a similarity score for each of the set of encoded images. In some examples, search component 505 selects the second image from among the set of encoded images based on the similarity score corresponding to the second image.)
1. and merging, by the electronic device, the result image sets to obtain an image retrieval result. / 12. and merge the result image sets to obtain an image retrieval result. / 20. and merge the result image sets to obtain an image retrieval result. (Bursztyn: [0066]  In some examples, search component 505 retrieves a set of images based on the multi-modal search encoding. In some examples, search component 505 receives a user selection identifying one of the set of images. In some examples, search component 505 receives a subsequent search query including a subsequent user input for the set of images. In some examples, search component 505 retrieves a set of additional images based on the subsequent critique. [0067] According to some aspects, search component 505 performs a search operation to retrieve an image that matches one or more query preference statements. In some examples, search component 505 retrieves a set of images based on the search operation. In some examples, search component 505 receives a user selection identifying one of the set of images.)
Even if Bursztyn does not specifically teach: perform feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities
Neculai teaches: 
1. An image retrieval method comprising: / 12. An electronic device comprising: one or more processors; and one or more memories storing at least one computer program that, when executed by the one or more processors, causes the one or more processors to: / 20. A non-transitory computer-readable storage medium storing at least one computer program that, when executed by one or more processors, causes the one or more processors to: (Neculai:page 4547, abstract, Existing works in image retrieval often consider retrieving images with one or two query inputs, which do not generalize to multiple queries. In this work, we investigate a more challenging scenario for composing multiple multimodal queries in image retrieval. Given an arbitrary number of query images and (or) texts, our goal is to retrieve target images containing the semantic concepts specified in multiple multimodal queries. To learn an informative embedding that can flexibly encode the semantics of various queries, we propose a novel multimodal probabilistic composer (MPC). Specifically, we model input images and texts as probabilistic embeddings, which can be further composed by a probabilistic composition rule to facilitate image retrieval with multiple multimodal queries. We propose a new benchmark based on the MS-COCO dataset and evaluate our model on various setups that compose multiple images and (or) text queries for multimodal image retrieval. Without bells and whistles, we show that our probabilistic model formulation significantly outperforms existing related methods on multimodal image retrieval while generalizing well to query with different amounts of inputs given in arbitrary visual and (or) textual modalities.)
1. obtaining, by an electronic device, a candidate image set and query data in a plurality of modalities, the candidate image set including a plurality of candidate images; / 12. obtain a candidate image set and query data in a plurality of modalities, the candidate image set including a plurality of candidate images; / 20. obtain a candidate image set and query data in a plurality of modalities, the candidate image set including a plurality of candidate images; (Neculai: page 4547 section 1. Introduction, As Figure 1 shows, given an arbitrary number of image and (or) text queries, our goal is to retrieve the images that contain all the semantic concepts specified in the queries. Inspired by the recent advances in compositional learning for visual recognition [37, 41, 52], we tackle this problem by learning a compositional embedding to flexibly encapsulates the multiple semantic concepts specified in the multimodal queries, and to be used for retrieving the more relevant images. Page 4548 Our model formulation offers two unique properties to learn a compositional embedding for multimodal image retrieval. First, our probabilistic composer allows to compose embeddings of a flexible amount of queries in arbitrary modalities. Second, its probabilistic nature allows to encode semantics as well as ambiguities of a given input, thus well capturing the polysemantic information in text queries, e.g. a text query “dog” may refer to a variety of dog breeds that differ visually. These properties well faciliate better performance in multimodal image retrieval)
1. performing, by the electronic device, feature extraction on the query data based on a target model to obtain a plurality of first features of the query data; / 12. perform feature extraction on the query data based on a target model to obtain a plurality of first features of the query data; / 20. perform feature extraction on the query data based on a target model to obtain a plurality of first features of the query data; (Neculai: page 4548 section 1. Introduction, Our contribution is three-fold: • We establish a new multimodal image retrieval benchmark using the MS-COCO dataset to investigate image retrieval using an arbitrary number of queries in arbitrary modalities. We evaluate a variety of settings including (1) using different combinations of input modalities, and (2) using various number of queries. • We propose a Multimodal Probabilistic Composer (MPC), which features a new probabilistic rule to compose probabilistic embeddings and a new probabilistic similarity metric to compare probabilistic embeddings, which together lead to its superior model performance in composing multimodal queries for image retrieval. • We show that our model outperforms existing multi-modal fusion methods significantly for multi-modal image retrieval. To further analyze our model design rationale, we also conduct an in-depth experimental analysis)
1. performing, by the electronic device, feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities; / 12. perform feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities; / 20. perform feature extraction on the candidate images based on the target model, to obtain a plurality of second features of the candidate images, each of the second features being obtained by feature extraction after the candidate images are aligned with the query data in one of the modalities; (Neculai: page 4548-4549, section 3. Multimodal Probabilistic Composer, Given a composite set of k input samples – where each input specifies a semantic concept (e.g. “dog”, “sport ball”), our goal is to learn a compositional embedding for retrieving the corresponding target images that contain the set of specified semantic concepts. Each input can be given in visual or textual modality, and passed through modality specific encoder to obtain its embedding. Accordingly, the composition of different embeddings should be modality agnostic, which means our model by default is trained to combine an arbitrary set of samples in arbitrary modalities.

    PNG
    media_image1.png
    422
    836
    media_image1.png
    Greyscale
)
1. determining, by the electronic device, a plurality of similarities each between the candidate images and the query data in one of the modalities according to the first features and the second features; / 12. determine a plurality of similarities each between the candidate images and the query data in one of the modalities according to the first features and the second features; / 20. determine a plurality of similarities each between the candidate images and the query data in one of the modalities according to the first features and the second features; (Neculai: page 4550 section 3.3. Model Optimization Similar to standard objectives in non-probabilistic metric learning such as triplet loss and contrastive loss, our training objective is imposed to pull the distribution of the compositional embedding and the target image distribution closer, while pushing away the distributions of negative pairs. To achieve this aim, we first need to define a probabilistic similarity function between two probability distributions. Probabilistic similarity. To quantify the similarity between two probabilistic distribution, Monte-Carlo estimation can be adopted which draws a number of J data points

    PNG
    media_image2.png
    786
    403
    media_image2.png
    Greyscale
)
1. determining, by the electronic device, result image sets corresponding to a plurality of query data combinations from the candidate image set according to the similarities, the query data combinations including the query data in at least one of the modalities; / 12. determine result image sets corresponding to a plurality of query data combinations from the candidate image set according to the similarities, the query data combinations including the query data in at least one of the modalities; / 20. determine result image sets corresponding to a plurality of query data combinations from the candidate image set according to the similarities, the query data combinations including the query data in at least one of the modalities; (Neculai: page 4549 
    PNG
    media_image1.png
    422
    836
    media_image1.png
    Greyscale
section 3.1. Modality-Specific Probabilistic Embeddings In this section, we first describe how different modalities (i.e. image and text) are modeled and then detail how probabilistic embeddings are learned to represent each input. Image encoder. We use a ResNet [23] backbone fResNet, with an additional linear projection layer fimg, as our image encoder to learn the image embeddings. Given an input image (referred as s1), we pass it through fResNet to obtain the feature map ϕimg and compute its feature encoding as: zimg = fimg(ϕimg), where zimg ∈ R D. Text encoder. To encode text information, we use GloVe word embeddings [48] to encode each word (denoted as fGloV e) and train a bidirectional GRU [11] (denoted as ftxt) to learn the text embeddings. Given a text snippet (referred as s2), we obtained the word embeddings: ϕtxt = fGloV e(s2). We pass ϕtxt through the GRU to obtain its feature encoding: ztxt = ftxt(ϕtxt), where ztxt ∈ R D. Probabilistic embeddings. Our model is motivated with the flexibility to take a composite set of k queries in arbitrary modalities. With this design rationale in mind, we propose to model each embedding as a multivariate Gaussian probability density function (PDF), such that the compositions of different embeddings can be achieved by composing different Gaussian PDFs through a parametric probabilistic rule, i.e. the product of k Gaussian PDFs [5]. Below, we detail how each embedding is modeled as a multivariate Gaussian, similar to recent probabilistic embeddings works [12, 45], and present how we derive compositional embeddings through our probabilistic composer in Section 3.2.)
1. and merging, by the electronic device, the result image sets to obtain an image retrieval result. / 12. and merge the result image sets to obtain an image retrieval result. / 20. and merge the result image sets to obtain an image retrieval result. (Neculai: page 4552-4553, section 4.2. Comparing to the State-of-the-Art, Qualitative results. Figure 3 shows the qualitative image retrieval results using a composite set of queries in different modalities. When given two inputs in arbitrary modalities (see (a), (b), (c)), our model can retrieve the images that contain the set of semantic concepts specified in the input, e.g. in the second example in (b), “broccoli” image and “carrot” text are composed to retrieve a dish with both concepts. When given three inputs (see (d)), our model can discover the images that cover the multiple semantic concepts specified in the input. Another interesting observation

    PNG
    media_image3.png
    290
    858
    media_image3.png
    Greyscale
)
It would have been obvious before the effective filing date of the claimed invention to one of ordinary skill in the art to modify Bursztyn’s neural-network based multi-modal image search system with the probabilistic compositional embeddings for multimodal image retrieval as described by Neculai. The determination of obviousness is predicated upon the following findings: One skilled in the art would have been motivated to modify Bursztyn in order to leverage an improved probabilistic model formulation which significantly outperforms on multimodal image retrieval as presented by Neculai.  Furthermore, the prior art collectively includes each element claimed (though not all in the same reference), and one of ordinary skill in the art could have combined the elements in the manner explained above using known engineering design, interface and/or programming techniques, without changing a “fundamental” operating principle of Bursztyn, while the teaching of Neculai continues to perform the same function as originally taught prior to being combined, in order to produce the repeatable and predictable result of an improved probabilistic model formulation for multimodal image retrieval. It is for at least the aforementioned reasons that the examiner has reached a conclusion of obviousness with respect to the claim in question. 

Consider Claims 2 and 13. 
The combination of Bursztyn and Neculai teaches: 
2. The image retrieval method according to claim 1, wherein: the query data in the plurality of modalities includes query text and a query image, and the target model includes a text modality alignment unit and an image feature extraction unit; / 13. The electronic device according to claim 12, wherein: the query data in the plurality of modalities includes query text and a query image, and the target model includes a text modality alignment unit and an image feature extraction unit; (Bursztyn: [0059] FIG. 5 shows an example of an image search apparatus according to aspects of the present disclosure. The example shown includes training component 500, search component 505, and machine learning model 510. In some embodiments, the image search apparatus 500 is an example of, or includes aspects of, the computing system 400. For example, in some cases, training component 500, search component 505, and machine learning model 510 can be implemented as hardware circuits that interact with components similar to the ones illustrated in FIG. 4 via a channel. For example, in some cases, training component 500, search component 505, and machine learning model 510 can be implemented as software stored in a memory device. [0060] According to some aspects, training component 500 receives training data including a set of input statements and a set of ground truth preference statements corresponding to the input statements. In some examples, training component 500 computes a loss function for the machine learning model 510 by comparing a preference statement to a corresponding preference statement from the set of ground truth preference statements. In some examples, training component 500 trains the machine learning model 510 using the training data to generate a trained machine learning model 510, where the trained machine learning model 510 is configured to perform a search operation to retrieve an image that matches a query preference statement corresponding to a user input. Neculai: page 4547 section 1. Introduction, As Figure 1 shows, given an arbitrary number of image and (or) text queries, our goal is to retrieve the images that contain all the semantic concepts specified in the queries. Inspired by the recent advances in compositional learning for visual recognition [37, 41, 52], we tackle this problem by learning a compositional embedding to flexibly encapsulates the multiple semantic concepts specified in the multimodal queries, and to be used for retrieving the more relevant images. Page 4548 Our model formulation offers two unique properties to learn a compositional embedding for multimodal image retrieval. First, our probabilistic composer allows to compose embeddings of a flexible amount of queries in arbitrary modalities. Second, its probabilistic nature allows to encode semantics as well as ambiguities of a given input, thus well capturing the polysemantic information in text queries, e.g. a text query “dog” may refer to a variety of dog breeds that differ visually. These properties well faciliate better performance in multimodal image retrieval)
2. and performing feature extraction on the candidate images based on the target model, to obtain the second features includes: performing, by the electronic device, feature extraction on the candidate images based on the text modality alignment unit, to obtain the second feature of the candidate images obtained after the candidate images are aligned with the query text; / 13. and the at least one computer program further causes the one or more processors to: perform feature extraction on the candidate images based on the text modality alignment unit, to obtain the second feature of the candidate images obtained after the candidate images are aligned with the query text; (Bursztyn: [0063] According to some aspects, search component 505 receives a search query including user input for a reference image. In some examples, search component 505 performs a search operation using a multi-modal search encoding to retrieve a second image, where the second image differs from the reference image based on the user input for the reference image. For example, the second image can have characteristics that are similar to the user input but not to the reference image. [0064] In some examples, search component 505 receives an additional search query including a user input for the second image. In some examples, search component 505 retrieves an additional second image based on the additional preference statement. In some examples, search component 505 retrieves an additional second image based on the additional search query. [0065] In some examples, search component 505 compares each of a set of encoded images to the multi-modal search encoding to obtain a similarity score for each of the set of encoded images. In some examples, search component 505 selects the second image from among the set of encoded images based on the similarity score corresponding to the second image. Neculai: page 4548-4549, section 3. Multimodal Probabilistic Composer, Given a composite set of k input samples – where each input specifies a semantic concept (e.g. “dog”, “sport ball”), our goal is to learn a compositional embedding for retrieving the corresponding target images that contain the set of specified semantic concepts. Each input can be given in visual or textual modality, and passed through modality specific encoder to obtain its embedding. Accordingly, the composition of different embeddings should be modality agnostic, which means our model by default is trained to combine an arbitrary set of samples in arbitrary modalities.

    PNG
    media_image1.png
    422
    836
    media_image1.png
    Greyscale
)
2. and performing, by the electronic device, feature extraction on the candidate images based on the image feature extraction unit to obtain an image feature of the candidate images as the second feature of the candidate images obtained after the candidate images are aligned with the query image. / 13. and perform feature extraction on the candidate images based on the image feature extraction unit to obtain an image feature of the candidate images as the second feature of the candidate images obtained after the candidate images are aligned with the query image. (Bursztyn: [0066]  In some examples, search component 505 retrieves a set of images based on the multi-modal search encoding. In some examples, search component 505 receives a user selection identifying one of the set of images. In some examples, search component 505 receives a subsequent search query including a subsequent user input for the set of images. In some examples, search component 505 retrieves a set of additional images based on the subsequent critique. [0067] According to some aspects, search component 505 performs a search operation to retrieve an image that matches one or more query preference statements. In some examples, search component 505 retrieves a set of images based on the search operation. In some examples, search component 505 receives a user selection identifying one of the set of images. Neculai: page 4549 
    PNG
    media_image1.png
    422
    836
    media_image1.png
    Greyscale
section 3.1. Modality-Specific Probabilistic Embeddings In this section, we first describe how different modalities (i.e. image and text) are modeled and then detail how probabilistic embeddings are learned to represent each input. Image encoder. We use a ResNet [23] backbone fResNet, with an additional linear projection layer fimg, as our image encoder to learn the image embeddings. Given an input image (referred as s1), we pass it through fResNet to obtain the feature map ϕimg and compute its feature encoding as: zimg = fimg(ϕimg), where zimg ∈ R D. Text encoder. To encode text information, we use GloVe word embeddings [48] to encode each word (denoted as fGloV e) and train a bidirectional GRU [11] (denoted as ftxt) to learn the text embeddings. Given a text snippet (referred as s2), we obtained the word embeddings: ϕtxt = fGloV e(s2). We pass ϕtxt through the GRU to obtain its feature encoding: ztxt = ftxt(ϕtxt), where ztxt ∈ R D. Probabilistic embeddings. Our model is motivated with the flexibility to take a composite set of k queries in arbitrary modalities. With this design rationale in mind, we propose to model each embedding as a multivariate Gaussian probability density function (PDF), such that the compositions of different embeddings can be achieved by composing different Gaussian PDFs through a parametric probabilistic rule, i.e. the product of k Gaussian PDFs [5]. Below, we detail how each embedding is modeled as a multivariate Gaussian, similar to recent probabilistic embeddings works [12, 45], and present how we derive compositional embeddings through our probabilistic composer in Section 3.2.)

Consider Claims 3 and 14. 
The combination of Bursztyn and Neculai teaches: 
3. The image retrieval method according to claim 1, wherein: the plurality of query data combinations include a first data combination and a second data combination, the first data combination includes the query data in one modality of the plurality of modalities, and the second data combination includes the query data in two or more modalities of the plurality of modalities; and determining the result image sets includes: determining, by the electronic device, a result image set corresponding to the first data combination from the candidate image set according to one of the similarities corresponding to the query data in the one modality; and fusing, by the electronic device, two or more of the similarities corresponding to the query data in the two or more modalities to obtain a target similarity, and determining a result image set corresponding to the second data combination from the candidate image set according to the target similarity./ 14. The electronic device according to claim 12, wherein: the plurality of query data combinations include a first data combination and a second data combination, the first data combination includes the query data in one modality of the plurality of modalities, and the second data combination includes the query data in two or more modalities of the plurality of modalities; and the at least one computer program further causes the one or more processors to: determine a result image set corresponding to the first data combination from the candidate image set according to one of the similarities corresponding to the query data in the one modality; and fuse two or more of the similarities corresponding to the query data in the two or more modalities to obtain a target similarity, and determining a result image set corresponding to the second data combination from the candidate image set according to the target similarity. (Bursztyn: [0063] According to some aspects, search component 505 receives a search query including user input for a reference image. In some examples, search component 505 performs a search operation using a multi-modal search encoding to retrieve a second image, where the second image differs from the reference image based on the user input for the reference image. For example, the second image can have characteristics that are similar to the user input but not to the reference image. [0064] In some examples, search component 505 receives an additional search query including a user input for the second image. In some examples, search component 505 retrieves an additional second image based on the additional preference statement. In some examples, search component 505 retrieves an additional second image based on the additional search query. [0065] In some examples, search component 505 compares each of a set of encoded images to the multi-modal search encoding to obtain a similarity score for each of the set of encoded images. In some examples, search component 505 selects the second image from among the set of encoded images based on the similarity score corresponding to the second image. Neculai: page 4550 section 3.3. Model Optimization Similar to standard objectives in non-probabilistic metric learning such as triplet loss and contrastive loss, our training objective is imposed to pull the distribution of the compositional embedding and the target image distribution closer, while pushing away the distributions of negative pairs. To achieve this aim, we first need to define a probabilistic similarity function between two probability distributions. Probabilistic similarity. To quantify the similarity between two probabilistic distribution, Monte-Carlo estimation can be adopted which draws a number of J data points

    PNG
    media_image2.png
    786
    403
    media_image2.png
    Greyscale
)

Consider Claims 4 and 15. 
The combination of Bursztyn and Neculai teaches: 
4. The image retrieval method according to claim 1, wherein performing feature extraction on the query data based on the target model to obtain the first features includes: converting, by the electronic device, the query data in the plurality of modalities into retrieval embedding vectors in a same vector format; and inputting, by the electronic device, the retrieval embedding vectors into the target model, and performing feature mapping on the query data based on the target model, to obtain the first features. / 15. The electronic device according to claim 12, wherein the at least one computer program further causes the one or more processors to: convert the query data in the plurality of modalities into retrieval embedding vectors in a same vector format; and input the retrieval embedding vectors into the target model, and performing feature mapping on the query data based on the target model, to obtain the first features. (Burzsytan: [0036] In some cases, image search apparatus 110 includes an architecture that is based on text-based image retrieval and multi-modal image retrieval processes with cross-modal embeddings and critique understanding process with natural language generation. For example, in some cases, image search apparatus 110 can perform text-based image retrieval with cross-modal embeddings. In some cases, image search apparatus 110 can perform multi-modal image retrieval with cross-modal embeddings. In some cases, image search apparatus 110 can perform critique understanding with controllable natural language generation. In some cases, the image retrieval is based on maximizing the similarity of images in a dataset with respect to an expanded search query. In some cases, image search apparatus 110 retrieves images after shapes of concatenated vectors match. [0082] According to some aspects, the image search apparatus of FIG. 5 can perform multi-modal image retrieval with cross-modal embeddings. For example, multi-modal encoder 520 can encode reference image r and a search query q in a same embedding space using a multi-modal encoding model such that renc=CLIP(r) and qenc=CLIP(q). Image search apparatus 110 can concatenate these embeddings to produce a multi-modal query or expanded query qm: qm=renc⊕qenc. Multi-modal encoder 520 can produce a self-concatenated encoded embedding of each image i in an image dataset D until vector shapes are matched with qm:ienc=CLIP(i)⊕CLIP(i). Search component 505 can compute a similarity score for the pairwise similarity between q and a single i via the formula Similarity(qm,i)=cos(qm,ienc) and can therefore retrieve images from the image dataset based on the multi-modal query qm according to the formula Retriever(qm,D)=argmaxi Similarity(qm,i)∀i∈D. Neculai: page 4550 section 3.3. Model Optimization Similar to standard objectives in non-probabilistic metric learning such as triplet loss and contrastive loss, our training objective is imposed to pull the distribution of the compositional embedding and the target image distribution closer, while pushing away the distributions of negative pairs. To achieve this aim, we first need to define a probabilistic similarity function between two probability distributions. Probabilistic similarity. To quantify the similarity between two probabilistic distribution, Monte-Carlo estimation can be adopted which draws a number of J data points

    PNG
    media_image2.png
    786
    403
    media_image2.png
    Greyscale
)

Consider Claims 5 and 16. 
The combination of Bursztyn and Neculai teaches: 
5. The image retrieval method according to claim 4, wherein converting the query data into the retrieval embedding vectors includes: segmenting, by the electronic device, the query data to obtain a plurality of query data blocks; performing, by the electronic device, feature mapping on the plurality of query data blocks to obtain a plurality of first embedding vectors; determining, by the electronic device, a plurality of pieces of location information of the query data blocks in the query data, and performing feature mapping on the plurality of pieces of location information to obtain a plurality of second embedding vectors; performing, by the electronic device, feature mapping on the plurality of modalities corresponding to the query data, to obtain a plurality of third embedding vectors; and concatenating, by the electronic device, the first embedding vectors, the second embedding vectors, and the third embedding vectors, to obtain the retrieval embedding vectors. / 16. The electronic device according to claim 15, wherein the at least one computer program further causes the one or more processors to: segment the query data to obtain a plurality of query data blocks; perform feature mapping on the plurality of query data blocks to obtain a plurality of first embedding vectors; determine a plurality of pieces of location information of the query data blocks in the query data, and performing feature mapping on the plurality of pieces of location information to obtain a plurality of second embedding vectors; perform feature mapping on the plurality of modalities corresponding to the query data, to obtain a plurality of third embedding vectors; and concatenate the first embedding vectors, the second embedding vectors, and the third embedding vectors, to obtain the retrieval embedding vectors. (Burzsytan: [0036] In some cases, image search apparatus 110 includes an architecture that is based on text-based image retrieval and multi-modal image retrieval processes with cross-modal embeddings and critique understanding process with natural language generation. For example, in some cases, image search apparatus 110 can perform text-based image retrieval with cross-modal embeddings. In some cases, image search apparatus 110 can perform multi-modal image retrieval with cross-modal embeddings. In some cases, image search apparatus 110 can perform critique understanding with controllable natural language generation. In some cases, the image retrieval is based on maximizing the similarity of images in a dataset with respect to an expanded search query. In some cases, image search apparatus 110 retrieves images after shapes of concatenated vectors match. [0082] According to some aspects, the image search apparatus of FIG. 5 can perform multi-modal image retrieval with cross-modal embeddings. For example, multi-modal encoder 520 can encode reference image r and a search query q in a same embedding space using a multi-modal encoding model such that renc=CLIP(r) and qenc=CLIP(q). Image search apparatus 110 can concatenate these embeddings to produce a multi-modal query or expanded query qm: qm=renc⊕qenc. Multi-modal encoder 520 can produce a self-concatenated encoded embedding of each image i in an image dataset D until vector shapes are matched with qm:ienc=CLIP(i)⊕CLIP(i). Search component 505 can compute a similarity score for the pairwise similarity between q and a single i via the formula Similarity(qm,i)=cos(qm,ienc) and can therefore retrieve images from the image dataset based on the multi-modal query qm according to the formula Retriever(qm,D)=argmaxi Similarity(qm,i)∀i∈D. Neculai: page 4550 section 3.3. Model Optimization Similar to standard objectives in non-probabilistic metric learning such as triplet loss and contrastive loss, our training objective is imposed to pull the distribution of the compositional embedding and the target image distribution closer, while pushing away the distributions of negative pairs. To achieve this aim, we first need to define a probabilistic similarity function between two probability distributions. Probabilistic similarity. To quantify the similarity between two probabilistic distribution, Monte-Carlo estimation can be adopted which draws a number of J data points

    PNG
    media_image2.png
    786
    403
    media_image2.png
    Greyscale
)

Consider Claims 6 and 17. 
The combination of Bursztyn and Neculai teaches: 
6. The image retrieval method according to claim 4, wherein performing feature mapping on the query data based on the target model, to obtain the first features includes: normalizing, by the electronic device, the retrieval embedding vectors, to obtain normalized vectors; performing, by the electronic device, attention feature extraction on the normalized vectors, to obtain attention vectors; and performing, by the electronic device, feature mapping on the attention vectors based on the target model, to obtain the first features. / 17. The electronic device according to claim 15, wherein the at least one computer program further causes the one or more processors to: normalize the retrieval embedding vectors, to obtain normalized vectors; perform attention feature extraction on the normalized vectors, to obtain attention vectors; and perform feature mapping on the attention vectors based on the target model, to obtain the first features. (Burzsytan: [0036] In some cases, image search apparatus 110 includes an architecture that is based on text-based image retrieval and multi-modal image retrieval processes with cross-modal embeddings and critique understanding process with natural language generation. For example, in some cases, image search apparatus 110 can perform text-based image retrieval with cross-modal embeddings. In some cases, image search apparatus 110 can perform multi-modal image retrieval with cross-modal embeddings. In some cases, image search apparatus 110 can perform critique understanding with controllable natural language generation. In some cases, the image retrieval is based on maximizing the similarity of images in a dataset with respect to an expanded search query. In some cases, image search apparatus 110 retrieves images after shapes of concatenated vectors match. [0082] According to some aspects, the image search apparatus of FIG. 5 can perform multi-modal image retrieval with cross-modal embeddings. For example, multi-modal encoder 520 can encode reference image r and a search query q in a same embedding space using a multi-modal encoding model such that renc=CLIP(r) and qenc=CLIP(q). Image search apparatus 110 can concatenate these embeddings to produce a multi-modal query or expanded query qm: qm=renc⊕qenc. Multi-modal encoder 520 can produce a self-concatenated encoded embedding of each image i in an image dataset D until vector shapes are matched with qm:ienc=CLIP(i)⊕CLIP(i). Search component 505 can compute a similarity score for the pairwise similarity between q and a single i via the formula Similarity(qm,i)=cos(qm,ienc) and can therefore retrieve images from the image dataset based on the multi-modal query qm according to the formula Retriever(qm,D)=argmaxi Similarity(qm,i)∀i∈D. Neculai: page 4550 section 3.3. Model Optimization Similar to standard objectives in non-probabilistic metric learning such as triplet loss and contrastive loss, our training objective is imposed to pull the distribution of the compositional embedding and the target image distribution closer, while pushing away the distributions of negative pairs. To achieve this aim, we first need to define a probabilistic similarity function between two probability distributions. Probabilistic similarity. To quantify the similarity between two probabilistic distribution, Monte-Carlo estimation can be adopted which draws a number of J data points

    PNG
    media_image2.png
    786
    403
    media_image2.png
    Greyscale
)

Consider Claims 7 and 18. 
The combination of Bursztyn and Neculai teaches: 
7. The image retrieval method according to claim 6, wherein: the normalized vectors are first normalized vectors; and performing feature mapping on the attention vector based on the target model, to obtain the first features includes: concatenating, by the electronic device, the attention vectors and the retrieval embedding vectors, to obtain concatenated vectors; normalizing, by the electronic device, the concatenated vectors, to obtain second normalized vectors; performing, by the electronic device, feed forward feature mapping on the second normalized vectors based on the target model, to obtain mapping vectors; and concatenating, by the electronic device, the mapping vectors and the concatenated vectors, to obtain the first features. / 18. The electronic device according to claim 17, wherein: the normalized vectors are first normalized vectors; and the at least one computer program further causes the one or more processors to: concatenate the attention vectors and the retrieval embedding vectors, to obtain concatenated vectors; normalize the concatenated vectors, to obtain second normalized vectors; perform feed forward feature mapping on the second normalized vectors based on the target model, to obtain mapping vectors; and concatenate the mapping vectors and the concatenated vectors, to obtain the first features. (Burzsytan: [0069] According to some aspects, the one or more neural networks included in text generator 515 includes a transformer. A transformer is a deep learning network that is useful in natural language processing applications and that operates according to an attention mechanism. An attention mechanism is a method of placing differing levels of importance on different elements of an input. Calculating attention can be a three-step process of computing the similarity between a query and key vectors obtained from an input to generate attention weights, using a softmax function to normalize the attention weights, and weighing the attention weights in together with the corresponding values. A softmax function is used as the activation function of a neural network to normalize the output of the network to a probability distribution over predicted output classes. After applying the softmax function, each component of the feature map is in the interval (0, 1) and the components add up to one. These values are interpreted as probabilities. Neculai: 
    PNG
    media_image4.png
    728
    372
    media_image4.png
    Greyscale
)

Consider Claims 8 and 19. 
The combination of Bursztyn and Neculai teaches: 
8. The image retrieval method according to claim 1, wherein the similarities are first similarities; the image retrieval method further comprising, before obtaining the candidate image set and the query data: obtaining, by the electronic device, a sample image and sample retrieval data in a modality other than an image modality, and obtaining a similarity tag between the sample image and the sample retrieval data; performing, by the electronic device, feature extraction on the sample retrieval data based on the target model to obtain a third feature of the sample retrieval data, and performing feature extraction on the sample image based on the target model, to obtain a fourth feature of the sample image obtained after the sample image is aligned with the sample retrieval data; determining, by the electronic device, a second similarity between the sample image and the query data according to the third feature and the fourth feature, and determining a loss value according to the second similarity and the similarity tag; and adjusting, by the electronic device, a parameter of the target model according to the loss value./ 19. The electronic device according to claim 12, wherein: the similarities are first similarities; and the at least one computer program further causes the one or more processors to: obtain a sample image and sample retrieval data in a modality other than an image modality, and obtaining a similarity tag between the sample image and the sample retrieval data; perform feature extraction on the sample retrieval data based on the target model to obtain a third feature of the sample retrieval data, and performing feature extraction on the sample image based on the target model, to obtain a fourth feature of the sample image obtained after the sample image is aligned with the sample retrieval data; determine a second similarity between the sample image and the query data according to the third feature and the fourth feature, and determining a loss value according to the second similarity and the similarity tag; and adjust a parameter of the target model according to the loss value. (Burzsytan: [0036] In some cases, image search apparatus 110 includes an architecture that is based on text-based image retrieval and multi-modal image retrieval processes with cross-modal embeddings and critique understanding process with natural language generation. For example, in some cases, image search apparatus 110 can perform text-based image retrieval with cross-modal embeddings. In some cases, image search apparatus 110 can perform multi-modal image retrieval with cross-modal embeddings. In some cases, image search apparatus 110 can perform critique understanding with controllable natural language generation. In some cases, the image retrieval is based on maximizing the similarity of images in a dataset with respect to an expanded search query. In some cases, image search apparatus 110 retrieves images after shapes of concatenated vectors match. [0082] According to some aspects, the image search apparatus of FIG. 5 can perform multi-modal image retrieval with cross-modal embeddings. For example, multi-modal encoder 520 can encode reference image r and a search query q in a same embedding space using a multi-modal encoding model such that renc=CLIP(r) and qenc=CLIP(q). Image search apparatus 110 can concatenate these embeddings to produce a multi-modal query or expanded query qm: qm=renc⊕qenc. Multi-modal encoder 520 can produce a self-concatenated encoded embedding of each image i in an image dataset D until vector shapes are matched with qm:ienc=CLIP(i)⊕CLIP(i). Search component 505 can compute a similarity score for the pairwise similarity between q and a single i via the formula Similarity(qm,i)=cos(qm,ienc) and can therefore retrieve images from the image dataset based on the multi-modal query qm according to the formula Retriever(qm,D)=argmaxi Similarity(qm,i)∀i∈D. Neculai: page 4550 section 3.3. Model Optimization Similar to standard objectives in non-probabilistic metric learning such as triplet loss and contrastive loss, our training objective is imposed to pull the distribution of the compositional embedding and the target image distribution closer, while pushing away the distributions of negative pairs. To achieve this aim, we first need to define a probabilistic similarity function between two probability distributions. Probabilistic similarity. To quantify the similarity between two probabilistic distribution, Monte-Carlo estimation can be adopted which draws a number of J data points

    PNG
    media_image2.png
    786
    403
    media_image2.png
    Greyscale
)

Consider Claim 9. 
The combination of Bursztyn and Neculai teaches: 
9. The image retrieval method according to claim 8, wherein: the loss value is a first loss value; and adjusting the parameter of the target model includes: obtaining, by the electronic device, a category tag of the sample image; performing, by the electronic device, feature extraction on the sample image based on the target model, to obtain a fifth feature that is of the sample image and that corresponds to the image modality; classifying, by the electronic device, the sample image according to the fifth feature to obtain a sample category, and determining, by the electronic device, a second loss value according to the sample category and the category tag; and adjusting, by the electronic device, the parameter of the target model according to the first loss value and the second loss value. (Neculai: page 4550 section 3.3. Model Optimization Similar to standard objectives in non-probabilistic metric learning such as triplet loss and contrastive loss, our training objective is imposed to pull the distribution of the compositional embedding and the target image distribution closer, while pushing away the distributions of negative pairs. To achieve this aim, we first need to define a probabilistic similarity function between two probability distributions. Probabilistic similarity. To quantify the similarity between two probabilistic distribution, Monte-Carlo estimation can be adopted which draws a number of J data points

    PNG
    media_image2.png
    786
    403
    media_image2.png
    Greyscale
 
Learning objective. In similar spirit as the cross entropy loss and contrastive loss [6], we define our loss function as: 
    PNG
    media_image5.png
    66
    366
    media_image5.png
    Greyscale
 where B denotes the batch size. sim(p(z|Si), p(z|ti))is the similarity between probabilistic distributions of two positive pairs. The loss Lct is computed across all positive pairs. To ensure the training stability and prevent the learned variance σ 2m (Eq. 1) from collapsing to zero or exploding to very high values, we add a ℓ2 regularization term on the logarithm of the variance, as defined below: 
    PNG
    media_image6.png
    54
    294
    media_image6.png
    Greyscale

where |S| is the number of queries being composed. σ 2 i,j is the variance of the input j in the ith pair of queries in the batch. The final loss is L = Lct + λℓ2Lℓ2)

Consider Claim 10. 
The combination of Bursztyn and Neculai teaches: 
10. The image retrieval method according to claim 8, wherein: the loss value is a first loss value; and adjusting the parameter of the target model includes: obtaining, by the electronic device, a first reference image that is of a same category as the sample image and a second reference image that is of a different category than the sample image; performing, by the electronic device, feature extraction on the sample image, the first reference image, and the second reference image based on the target model, to obtain a fifth feature that is of the sample image and that corresponds to the image modality, a sixth feature of the first reference image, and a seventh feature of the second reference image; determining, by the electronic device, a third similarity between the fifth feature and the sixth feature and a fourth similarity between the fifth feature and the seventh feature, and determining, by the electronic device, a second loss value according to the third similarity and the fourth similarity; and adjusting, by the electronic device, the parameter of the target model according to the first loss value and the second loss value. (Neculai: page 4550 section 3.3. Model Optimization Similar to standard objectives in non-probabilistic metric learning such as triplet loss and contrastive loss, our training objective is imposed to pull the distribution of the compositional embedding and the target image distribution closer, while pushing away the distributions of negative pairs. To achieve this aim, we first need to define a probabilistic similarity function between two probability distributions. Probabilistic similarity. To quantify the similarity between two probabilistic distribution, Monte-Carlo estimation can be adopted which draws a number of J data points

    PNG
    media_image2.png
    786
    403
    media_image2.png
    Greyscale
 Learning objective. In similar spirit as the cross entropy loss and contrastive loss [6], we define our loss function as: 
    PNG
    media_image5.png
    66
    366
    media_image5.png
    Greyscale
 where B denotes the batch size. sim(p(z|Si), p(z|ti))is the similarity between probabilistic distributions of two positive pairs. The loss Lct is computed across all positive pairs. To ensure the training stability and prevent the learned variance σ 2m (Eq. 1) from collapsing to zero or exploding to very high values, we add a ℓ2 regularization term on the logarithm of the variance, as defined below: 
    PNG
    media_image6.png
    54
    294
    media_image6.png
    Greyscale

where |S| is the number of queries being composed. σ 2 i,j is the variance of the input j in the ith pair of queries in the batch. The final loss is L = Lct + λℓ2Lℓ2)

Consider Claim 11. 
The combination of Bursztyn and Neculai teaches: 
11. The image retrieval method according to claim 8, wherein: the sample retrieval data comprises sample text; and obtaining the sample image and the sample retrieval data includes: obtaining, by the electronic device, an initial image and initial text; performing, by the electronic device, enhancement processing on the initial image, to obtain an enhanced image; deleting, by the electronic device, a text component of any length in the initial text, or adjusting a text component in the initial text by using a text component in reference text, to obtain enhanced text, the reference text being of a same category as the initial text; and using, by the electronic device, the initial image and the enhanced image as sample images, and using the initial text and the enhanced text as sample text. (Burzsytan: [0036] In some cases, image search apparatus 110 includes an architecture that is based on text-based image retrieval and multi-modal image retrieval processes with cross-modal embeddings and critique understanding process with natural language generation. For example, in some cases, image search apparatus 110 can perform text-based image retrieval with cross-modal embeddings. In some cases, image search apparatus 110 can perform multi-modal image retrieval with cross-modal embeddings. In some cases, image search apparatus 110 can perform critique understanding with controllable natural language generation. In some cases, the image retrieval is based on maximizing the similarity of images in a dataset with respect to an expanded search query. In some cases, image search apparatus 110 retrieves images after shapes of concatenated vectors match. [0082] According to some aspects, the image search apparatus of FIG. 5 can perform multi-modal image retrieval with cross-modal embeddings. For example, multi-modal encoder 520 can encode reference image r and a search query q in a same embedding space using a multi-modal encoding model such that renc=CLIP(r) and qenc=CLIP(q). Image search apparatus 110 can concatenate these embeddings to produce a multi-modal query or expanded query qm: qm=renc⊕qenc. Multi-modal encoder 520 can produce a self-concatenated encoded embedding of each image i in an image dataset D until vector shapes are matched with qm:ienc=CLIP(i)⊕CLIP(i). Search component 505 can compute a similarity score for the pairwise similarity between q and a single i via the formula Similarity(qm,i)=cos(qm,ienc) and can therefore retrieve images from the image dataset based on the multi-modal query qm according to the formula Retriever(qm,D)=argmaxi Similarity(qm,i)∀i∈D. Neculai: page 4550 section 3.3. Model Optimization Similar to standard objectives in non-probabilistic metric learning such as triplet loss and contrastive loss, our training objective is imposed to pull the distribution of the compositional embedding and the target image distribution closer, while pushing away the distributions of negative pairs. To achieve this aim, we first need to define a probabilistic similarity function between two probability distributions. Probabilistic similarity. To quantify the similarity between two probabilistic distribution, Monte-Carlo estimation can be adopted which draws a number of J data points

    PNG
    media_image2.png
    786
    403
    media_image2.png
    Greyscale
)


Conclusion
The prior art made of record in form PTO-892 and not relied upon is considered pertinent to applicant's disclosure. 

    PNG
    media_image7.png
    120
    870
    media_image7.png
    Greyscale

Any inquiry concerning this communication or earlier communications from the examiner should be directed to TAHMINA ANSARI whose telephone number is 571-270-3379.  The examiner can normally be reached on IFP Flex - Monday through Friday 9 to 5.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, O’NEAL MISTRY can be reached on 313-446-4912.  The fax phone numbers for the organization where this application or proceeding is assigned are 571-273-8300 for regular communications and 571-273-8300 for After Final communications. TC 2600’s customer service number is 571-272-2600.
Any inquiry of a general nature or relating to the status of this application or proceeding should be directed to the receptionist whose telephone number is 571-272-2600.




2674
/Tahmina Ansari/

January 8, 2026
/TAHMINA N ANSARI/Primary Examiner, Art Unit 2674
Read full office action
Prosecution Timeline

Jan 24, 2024
Application Filed
Jan 09, 2026
Non-Final Rejection — §102, §103, §112
Jan 22, 2026
Interview Requested
Mar 18, 2026
Applicant Interview (Telephonic)
Mar 18, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

18/068,590
Patent 12586249
PROCESSING APPARATUS, PROCESSING METHOD, AND STORAGE MEDIUM FOR CALIBRATING AN IMAGE CAPTURE APPARATUS
2y 5m to grant Granted Mar 24, 2026
18/484,909
Patent 12586354
TRAINING METHOD, APPARATUS AND NON-TRANSITORY COMPUTER READABLE MEDIUM FOR A MACHINE LEARNING MODEL
2y 5m to grant Granted Mar 24, 2026
18/471,055
Patent 12573083
COMPUTER-READABLE RECORDING MEDIUM STORING OBJECT DETECTION PROGRAM, DEVICE, AND MACHINE LEARNING MODEL GENERATION METHOD OF TRAINING OBJECT DETECTION MODEL TO DETECT CATEGORY AND POSITION OF OBJECT
2y 5m to grant Granted Mar 10, 2026
17/976,971
Patent 12548297
IMAGE PROCESSING METHOD AND APPARATUS, COMPUTER DEVICE, STORAGE MEDIUM, AND PROGRAM PRODUCT BASED ON FEATURE AND DISTRIBUTION CORRELATION
2y 5m to grant Granted Feb 10, 2026
18/444,143
Patent 12524504
METHOD AND DATA PROCESSING SYSTEM FOR PROVIDING EXPLANATORY RADIOMICS-RELATED INFORMATION
2y 5m to grant Granted Jan 13, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
86%
Grant Probability
99%
With Interview (+17.9%)
2y 8m
Median Time to Grant
Low
PTA Risk
Based on 868 resolved cases by this examiner. Grant probability derived from career allow rate.