Office Action Analysis: 18253859 — Machine-Learned Models for Multimodal Searching and Retrieval of Images

Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-21 are presented for examination.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on May 22nd, 2023 was filed. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-4, 6-13, and 15-21 are rejected under 35 U.S.C. 103 as being unpatentable over Sadeh (“Joint Visual-Textual Embedding for Multimodal Style Search”, 2019) in view of Badjatiya (US 11874902 B2).

Regarding claim 1,
Sadeh teaches [a] computing system for machine-learned multimodal searching of images (Page 1 Introduction, “This paper refers to the specific, fine-grained, task of visual-textual multimodal search in the fashion domain.”
Sadeh discloses a machine-learned system performing multi-modal image search using visual and textual inputs.), 
a machine-learned query refinement model trained to refine an image query with a textual query refinement (Page 1 Introduction, “This paper refers to the specific, fine-grained, task of visual-textual multimodal search in the fashion domain… by enabling intuitive and interactive search refinements… We propose a training objective function which we refer to as Mini-Batch Match Retrieval (MBMR). Each mini-batch consists of matching and non matching image-text pairs. We compute the cosine similarity of each pair, and maximize matching samples similarities…”, Page 5 Section 5.1 Query Arithmetic Approach, “This enables searching for visually similar products with some different properties, defined textually, by simply adding (subtracting) desired (undesired) textual features to (from) the product visual feature vector.”
Sadeh discloses a machine-learned joint visual-textual embedding model that is explicitly trained on paired-image text data using the MBMR loss to align visual and textual representations. The trained model refines an image query by incorporating textual refinements directly into the image query representation which enables the modification of search results based on the refined text queries.);
obtaining an image embedding for a query image… (Page 3 Figure 2 Caption, “A ResNet-18 CNN extracts visual features from the image with an additional fully connected (FC) layer which projects these features to the joint space.”
Sadeh explicitly obtains an image embedding for a query image by processing the image through a trained ResNet-18 convolutional neural network followed by a projection layer, which produces the vector representation of the image in a shared embedding space that is used for multimodal refinement and retrieval.);
obtaining… a textual query refinement for the query image, wherein the textual query refinement is responsive to provision of one or more initial result images for the query image… (Page 1 Introduction, “We believe this type of application can greatly impact the customer shopping experience, by enabling intuitive and interactive search refinements…”, Page 5 Section 5.1 Query Arithmetic Approach, “That is, for a given query image, I, and a desired and undesired attribute set, w = {w+,w−}”, See Figure 1, 
    PNG
    media_image1.png
    373
    735
    media_image1.png
    Greyscale

Sadeh discloses interactive refinement where textual attributes are applied after initial retrieval. Figure 1 shows that the textual query refinement of adding a V-neck or adding sleeves for the query image is a direct response from the output produced from the query image.); 
processing the image embedding and the textual query refinement for the query image with the machine-learned query refinement model to obtain a refined image embedding that incorporates the textual query refinement (Page 5 Section 5.1 Query Arithmetic Approach, “That is, for a given query image, I, and a desired and undesired attribute set, w = {w+,w−}, the new mutlimodal query q can be defined by 
    PNG
    media_image2.png
    36
    145
    media_image2.png
    Greyscale
…where fI is the image embedding, and fT is the linear combination of desired and undesired word embeddings.”
Sadeh explicitly discloses a machine-learned joint embedding model that combines an image embedding with a textual refinement embedding to produce a new, refined image embedding. The refined embedding q incorporates the textual query refinement via the additive inclusion of f_t, which is derived from the learned word embedding.); 
and determining one or more refined result images based at least in part on the refined image embedding that incorporates the textual query refinement (Page 5 Section 5.1 Query Arithmetic Approach, “That is, for a given query image, I, and a desired and undesired attribute set, w = {w+,w−}, the new mutlimodal query q can be defined by 
    PNG
    media_image2.png
    36
    145
    media_image2.png
    Greyscale

    PNG
    media_image3.png
    79
    342
    media_image3.png
    Greyscale
, where fI is the image embedding, and fT is the linear combination of desired and undesired word embeddings. The similarity score, S, between the query and reference catalog items, is defined as the cosine similarity between q and the reference visual features fIr.”
Sadeh determines refined result images by comparing the refined image embedding q against catalog image embeddings using cosine similarity and selecting images based on that similarity score. The refined result images are determined after the model takes the original image embedding q and the textual query refinement.).
	Sadeh does not teach comprising: one or more processors; and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising… provided by a user of a visual search application…from the user of the visual search application…to the user of the visual search application.
Badjatiya, in the same field of endeavor, teaches comprising: one or more processors; and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising… (Paragraph 74 of Badjatiya, “…one or more storage devices 1090 and/or non-transitory computer-readable media 1030 having encoded thereon one or more computer-executable instructions…”, Paragraph 14, “A computer program product including one or more non-transitory machine-readable mediums encoded with instructions that when executed by one or more processors cause a process to be carried out for image searching, the process comprising: receiving a source image and a text query defining a target image attribute”)
…provided by a user of a visual search application…from the user of the visual search application… to the user of the visual search application (Paragraph 37, “The user interface 140 is presented to the user to allow the user to interact with the image search system 130 through a series of queries 110 and images 120. For example, an initial image 150 of a dress is presented to the user along with a question: “How is this compared to the one you want?” The user then replies with a text query/response: “What I am looking for is more colorful and shorter.” The image search system 130, then provides a target image 160, which more closely matches the user's requirements”, Paragraph 39, “In some embodiments, the image search system 130 may be part of a larger e-commerce system, or other application. The user interface 140 is shown to accept an initial source or reference image 210, which may be provided by the user, the e-commerce system, or from any other source. The source image 210 is presented to the image search system 130, along with a user text query 230. The text query provides additional details about the user's requirements… The image search system 130 is configured to process the image 210 and the text query 230 to extract content and style feature vectors from the image and the text…The image search system 130 generates a target image 240 as a user feedback condition result which is presented to the user, through the user interface 140, as the new/updated source image 220.”
Badjatiya teaches a visual search application where the user provides an image and a refined text where the system generates a new image based on the request and returns the image result to the user where the process repeats over and over.).
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Sadeh’s similarity-based query refinement and multimodal image retrieval framework with Badjatiya’s use of processors and non-transitory computer media to implement an interactive visual search application that receives user-provided images and textual refinements and returns refined image results, in order to deploy Sadeh’s multimodal refinement technique within a complete computing system to enhance usability and real-world functionality (Paragraphs 39 and 74 of Badjatiya).

Regarding claim 2,
Sadeh teaches determining the one or more refined result images comprises: determining one or more image embeddings within a… distance of the refined image embedding of the query image within an image embedding space (Page 5 Section 5.1 Arithmetic Approach, “the new mutlimodal query q can be defined by, 
    PNG
    media_image4.png
    48
    158
    media_image4.png
    Greyscale
 
    PNG
    media_image5.png
    76
    333
    media_image5.png
    Greyscale
 where fI is the image embedding, and fT is the linear combination of desired and undesired word embeddings. The similarity score, S, between the query and reference catalog items, is defined as the cosine similarity between q and the reference visual features fIr.”
Sadeh determines the distances between a refined query embedding and catalog image embeddings within a shared, joint embedding space using cosine similarity. The image embeddings are identified based on the proximity of the refined query embedding.); 
and selecting the one or more refined result images that respectively correspond to the one or more image embeddings (Page 2 Figure 1 and Caption, 
    PNG
    media_image6.png
    422
    812
    media_image6.png
    Greyscale
, Page 6 Evaluation, “The DCG metric measures ranking quality, which cumulates the relevance of the top-K retrieved items per query, while penalizing them differently based on their rank.”, 
Sadeh teaches using the DCG metric which selects the top K refined image results for the user. Sadeh does this by comparing refined query embeddings to the catalog image embeddings to produce refined images.).
Sadeh does not teach using a threshold distance.
Badjatiya, in the same field of endeavor, teaches threshold distance (Paragraph 52 of Badjatiya, “The selection module 390 is configured to select one or more of the potential target images 377 as an identified target image 240 based on the distances 385. For example, in some embodiments, if the distance 385 is less than a threshold value, the potential target image 377 is considered to be close enough to the user's request…”
Badjatiya discloses determining image embeddings that fall within a threshold distance of a query-derived embedding.)
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Sadeh’s similarity-based multimodal image retrieval framework with Badjatiya’s threshold-based selection of image embeddings in order to improve retrieval precision and ensure that only similar images are presented to the user (Paragraph 14 of Badjatiya).

Regarding claim 3,
Sadet teaches obtaining the image embedding for the query image comprises: obtaining the query image…; and determining the image embedding based at least in part on the query image, wherein the image embedding is representative of the query image (Page 3 Section 4 Training, “Image encoding is based on a ResNet-18 [7] deep convolutional neural network (CNN), followed by an additional fully connected layer which projects the visual feature vector to the same space as the textual encoding.”, Page 5 Section 5.1, “That is, for a given query image, I, and a desired and undesired attribute set, w = {w+,w−}, the new mutlimodal query q can be defined by,

    PNG
    media_image7.png
    44
    540
    media_image7.png
    Greyscale


    PNG
    media_image8.png
    69
    636
    media_image8.png
    Greyscale
 where fI is the image embedding, and fT is the linear combination of desired and undesired word embeddings”
Sadet teaches obtaining the image embedding from the query image using ResNet. The query image embedding is combined with the refined query text w to generate a refined image that meets the user’s request.).
Sadeh does not teach the visual search application.
Badjatiya, in the same field of endeavor, teaches the user of the visual search application (Paragraph 39, “In some embodiments, the image search system 130 may be part of a larger e-commerce system, or other application… The image search system 130 is configured to process the image 210 and the text query 230 to extract content and style feature vectors from the image and the text…The image search system 130 generates a target image 240 as a user feedback condition result which is presented to the user, through the user interface 140, as the new/updated source image 220.”
Badjatiya teaches using a visual search application that allows the user to submit an image with a refined query. The user then receives an image embedding that is based on the user’s request and the process repeats until the user is satisfied.)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Sadeh’s method of obtaining an image embedding representative of a query image using a convolutional neural network with Badjatiya’s use of a visual search application that receives user-submitted images in order to enable Sadeh’s method to be practically applied within an interactive visual search environment to enhance usability (Paragraphs 39 and 74 of Badjatiya).

Regarding claim 4,
Sadet teaches obtaining the textual query refinement for the query image further comprises determining one or more token embeddings representative of the textual query refinement (Page 3 Section 4 Training, “Text encoding is done by summing the word embeddings of all input words.”, Page 5 Section 5.1, “…the text and image encoders can yield image and textual query feature vectors which lay in a common embedding space.”
Sadeh teaches determining token embeddings representative of a textual query refinement through the process of tokenizing the textual input and encoding the text by summing the learned word embeddings. The tokenized words correspond to the “one or more tokens” and the summed word embeddings correspond to “one or more token embeddings” that represent the textual query refinement.);
 and wherein processing the image embedding and the textual query refinement comprises processing the image embedding and the one or more token embeddings with the machine- learned query refinement model to obtain the refined image embedding for the query image that incorporates the textual query refinement (Page 4 Section 4.1, “In order to clean and normalize the textual metadata we use several preprocessing steps when building our vocabulary. (1) Tokenization– divide the raw description text into a set of tokens.”, Page 3 Section 4 Training, “Image encoding is based on a ResNet-18 [7] deep convolutional neural network (CNN), followed by an additional fully connected layer which projects the visual feature vector to the same space as the textual encoding.”, Page 5 Section 5.1 Query Arithmetic Approach, “That is, for a given query image, I, and a desired and undesired attribute set, w = {w+,w−}, the new mutlimodal query q can be defined by 
    PNG
    media_image2.png
    36
    145
    media_image2.png
    Greyscale
…where fI is the image embedding, and fT is the linear combination of desired and undesired word embeddings.”
Sadeh explicitly discloses a machine-learned joint embedding model that combines an image embedding with a textual refinement embedding to produce a new, refined image embedding. The refined embedding q incorporates the textual query refinement via the additive inclusion of f_t, which is derived from the learned word embedding. The redefined image embedding q is computed by adding the image embedding f_i to f^T, which is the linear combination of token word embeddings, which produces the refined image embedding that incorporates the textual query refinement.).

Regarding claim 6,
Sadeh does not teach the operations further comprise: providing the one or more refined result images to a user device for display within an interface of the visual search application.
Badjatiya, in the same field of endeavor, teaches the operations further comprise: providing the one or more refined result images to a user device for display within an interface of the visual search application (Paragraph 37, “ The user interface 140 is presented to the user to allow the user to interact with the image search system 130 through a series of queries 110 and images… The image search system 130, then provides a target image 160, which more closely matches the user's requirements”, Paragraph 39, “The image search system 130 generates a target image 240 as a user feedback condition result which is presented to the user, through the user interface 140, as the new/updated source image 220.”
Badjatiya teaches providing one or more refined result images to a user device for display within an interface of a visual search application.).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Sadeh’s similarity-based multimodal image retrieval framework with Badjatiya’s visual search application that presents refined result images to a user device through an interactive user interface in order to enable user interaction to enhance the practical usability for visual search applications (Paragraphs 37 and 39 of Badjatiya).

Regarding claim 7,
Sadeh teaches the operations further comprise: obtaining, responsive to provision of the one or more refined result images, a second textual query refinement for the query image (Page 1 Introduction, “We believe this type of application can greatly impact the customer shopping experience, by enabling intuitive and interactive search refinements.”, Page 5 Section 5.1, “This enables searching for visually similar products with some different properties, defined textually, by simply adding (subtracting) desired (undesired) textual features to (from) the product visual feature vector.”, See Figure 1 where the image illustrates an initial query image producing retrieved items followed by textual refinements  such as ‘V-neck’ or ‘add sleeves’ applied after viewing the retrieved refined results.).

Regarding claim 8,
Sadeh teaches processing the second textual query refinement and the image embedding of the query image with the machine-learned query refinement model to obtain a second refined image embedding that incorporates the second textual query refinement (Page 5 Section 5.1, “That is, for a given query image, I, and a desired and undesired attribute set, w = {w+,w−}, the new mutlimodal query q can be defined by, q =fI +fT… where fI is the image embedding, and fT is the linear combination of desired and undesired word embeddings.”, Page 1 Introduction, “…by enabling intuitive and interactive search refinements.”, See Figure 1 where the illustration shows multiple rounds of textual refinement queries applied to a query image after prior results are retrieved.
Sadeh teaches processing a second textual query refinment with the image embedding using the machine learned query refinement model because the method discloses combining the image embedding with the newly provided textual attribute embedding to generate a new image representation after each interactive refinement step.).

Regarding claim 9,
Sadeh does not teach the operations further comprise processing the refined image embedding and the second textual query refinement with the machine-learned query refinement model to obtain a second refined image embedding that incorporates the textual query refinement and the second textual query refinement.
Badjatiya, in the same field of endeavor, teaches the operations further comprise processing the refined image embedding and the second textual query refinement with the machine-learned query refinement model to obtain a second refined image embedding that incorporates the textual query refinement and the second textual query refinement (Paragraph 52 of Badjatiya, “The user may then accept the proffered target image 240, or continue the search using the target image 240 as a new/updated source image 220 in combination with a new text query 230 to refine the search.”, Paragraph 39, “The image search system 130 is configured to process the image 210 and the text query 230 to extract content and style feature vectors… The image search system 130 generates a target image 240 as a user feedback condition result which is presented to the user, through the user interface 140, as the new/updated source image 220.”, Paragraph 37, “The user interface 140 is presented to the user to allow the user to interact with the image search system 130 through a series of queries 110 and images 120.”, Paragraph 35, “The method also includes using a first neural network to decompose the source image into an image content feature vector and an image style feature vector that are disentangled from each other. The method further includes using a second neural network to decompose the text query into a text content feature vector and a text style feature vector ”
Badjatiya teaches processing a previously refined image representation (the “new/updated source image”) together with a second textual query refinement to generate a further refined image result.).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Sadeh’s similarity-based multimodal image retrieval framework with Badjatiya’s use of iteratively processing an updated imae representation together with a subsequent textual query refinement in order to enable a multi-step query refinement that enables user interaction to enhance the practical usability for visual search applications (Paragraphs 37 and 39 of Badjatiya).

Regarding claim 10, 
Sadeh teaches [a] computer-implemented method, comprising… a query image embedding for a query image and a textual query refinement associated with the query image (Page 6 Section 5.3, “We attempt to combine both previously described methods into a single robust one. We do so by using the soft attribute filtering along with the query arithmetic based search.”, Page 3 Section 4, “Image encoding is based on a ResNet-18 [7] deep convolutional neural network (CNN)… which projects the visual feature vector to the same space as the textual encoding.”, Page 5 Section 5.1, “…for a given query image, I, and a desired and undesired attribute set, w = {w+,w−}…”
Sadeh teaches obtaining an image embedding for a query image and associates it with a textual refinement expressed as desired/undesired attributes.); 
processing… the query image embedding and the textual query refinement with a machine-learned query refinement model to obtain a refined query image embedding that incorporates the textual query refinement (Page 5 Section 5.1, “…the new mutlimodal query q can be defined by, q =fI +fT…where fI is the image embedding, and fT is the linear combination of desired and undesired word embeddings.”
Sadeh teaches producing a refined query image embedding that combines the refined text query with the retrieved image result.);
 Evaluating… a loss function that evaluates a distance between the refined query image embedding and an embedding for a ground truth image within an image embedding space (Page 1 Introduction, “We propose a training objective function which we refer to as Mini-Batch Match Retrieval (MBMR). Each mini-batch consists of matching and non matching image-text pairs. We compute the cosine similarity of each pair, and maximize matching samples similarities with cross-entropy loss”, Section 4.3 Page 4, “The ground-truth labels are determined by the existence of words in the product textual metadata. An additional loss term is added for this multi-label classification task.”, See Equations 1-4,

    PNG
    media_image9.png
    448
    963
    media_image9.png
    Greyscale

Sadeh evaluates la loss function based on cosine similarity distances between a refined query image embedding and a ground-truth embedding within a shared embedding space, which is seen in Equation 1 where the cosine similarity is computed between the image and text embedding and Equations 2-4 where the cross entropy loss is taking place between matching and non-matching pairs.); 
and modifying… one or more values of one or more parameters of the machine-learned query refinement model based at least in part on the loss function (Page 4 Section 4, “A Mini-Batch Match Retrieval (MBMR) loss, LMBMR, for the task of learning a joint embedding space, and a multi-label cross-entropy loss, La, for attribute extraction. The final objective is a weighted sum of both loss terms… We use the Adam [12] optimizer, with an exponentially decaying learning rate schedule. We have found that all of these settings are helpful in order to improve convergence and reduce overfitting.”, Page 4 Section 4.2, “The objective of the joint-embedding training procedure should encourage matching (non-matching) image-text pairs to be as close (distant) as possible to (from) each other, in the common embedding space. To achieve this, we propose the following Mini-Batch Match Retrieval (MBMR) objective.”
Sadeh trains the visual-textual joint embedding model using the MBMR loss, meaning the cross-entropy loss computed from cosine similarity distances is backpropagated to update the parameters of the embedding networks.).
Sadeh does not teach …obtaining, by a computing system comprising one or more computing devices… by the computing system…
Badjatiya, in the same field of endeavor, teaches …obtaining, by a computing system comprising one or more computing devices… by the computing system… (Paragraph 78 of Badjatiya, “In some embodiments, the computing platform 1000 runs… any proprietary operating system, any operating systems for mobile computing devices, or any other operating system capable of running on the computing platform”, Paragraph 89, “Example 8 is a system for image searching, the system comprising: a first neural network (NN) trained to generate an image content feature vector associated with content of a source image and an image style feature vector associated with style of the source image…”)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Sadeh’s similarity-based query refinement and multimodal image retrieval framework with Badjatiya’s use of processors and non-transitory computer media to implement an interactive visual search application that receives user-provided images and textual refinements and returns refined image results, in order to deploy Sadeh’s multimodal refinement technique within a complete computing system to enhance usability and real-world functionality (Paragraphs 39 and 74 of Badjatiya).

Regarding claim 11,
Sadeh teaches the query image depicts an entity with a first characteristic; the textual query refinement is descriptive of a second characteristic for the entity different than the first characteristic; and the ground truth image depicts the entity with the second characteristic (See Page 2 Figure 1, Page 4 Section 4.2, “The objective… encourage matching (non-matching) image-text pairs to be as close (distant) as possible to (from) each other, in the common embedding space…. each mini-batch consists of N product items, {Ii,Ti}N i=1, where Ii is an image, and Ti is its corresponding textual metadata.”, Page 6 Section 6, “The pool consisted of 110 fashion attributes from 5 major categories: color, pattern, neckline, style and garment type. Textual requirements can specify either adding, removing or replacing specific properties to or from the query image.”, Page 1 Introduction, “For instance, given fI, a representing vector of an image of a blue car, fI- “blue” + “red” yields a representing vector of a red car image.”, Page 5 Section 5.1, “This enables searching for visually similar products with some different properties, defined textually, by simply adding (subtracting) desired (undesired) textual features…” Page 6 Section 6, “Textual requirements can specify either adding, removing or replacing specific properties to or from the query image.”
Sadeh teaches a query image depicting an entity with an initial characteristic (first characteristic). The query image is processed together with a user-provided textual refinement that explicitly specifies a different desired attribute for the same entity, such as a change in color, style, pattern, or necklace, corresponding to the second characteristic, which is illustrated in Figure 1. Sadeh further teaches retrieving and training against catalog images, such that the ground truth image depicts the entity with the second characteristic rather than the initial characteristic depicted in the query image.)

Regarding claim 12,
Sadeh teaches obtaining the query image embedding and the textual query refinement further comprises: determining, by the computing system, a textual embedding for the textual query refinement (Page 1 Introduction, “We consider training a visual-textual joint embedding model in an end-to-end manner, based on images and textual metadata of catalog products.”, Page 4, Section 4.2, “In our training setting, each mini-batch consists of N product items, {Ii,Ti}N i=1, where Ii is an image, and Ti is its corresponding textual metadata.”
Sadeh encodes textual metadata T_i into a learned representation which corresponds to the textual query refinement and encodes the query image into an embedding. The framework then combines the query refinement and the query image embedding into a new refined image for the user.); 
and wherein processing the query image embedding and the textual query refinement with the machine-learned query refinement model comprises processing, by the computing system, the query image embedding and the textual embedding for the textual query refinement with the machine-learned query refinement model to obtain a refined query image embedding that incorporates the textual query refinement (Page 4 Section 4.2, “The objective of the joint-embedding training procedure should encourage matching (non-matching) image-text pairs to be as close (distant) as possible to (from) each other, in the common embedding space… For each image embedding in the batch, fI, and text embedding in the batch, fT, we compute their cosine similarity”
Sadeh processes image embeddings and textual embeddings within the machine-learned joint embedding model, producing refined query image embeddings whose position in the embedding space reflects the semantics of the refined textual input.). 

Regarding claim 13,
Sadeh teaches textual embedding for the textual query refinement comprises a plurality of token embeddings (Page 3 Section 4, “Text encoding is done by summing the word embeddings of all input words… The word embeddings are based on word2vec, and are trained on product titles.”, Page 4 Section 4.1, “Tokenization– divide the raw description text into a set of tokens… These preprocessing steps determine the vocabulary, V , of our model.”
Sadeh teaches that the textual metadata is tokenized into a set of tokens where each token is represented by a word embedding.).

Regarding claim 16,
Sadeh teaches obtaining, by the computing system from a user, a user query image and a textual query refinement for the user query image, wherein the textual query refinement is responsive to provision of one or more initial result images to the user responsive to the user query image (Page 1 Introduction, “We believe this type of application can greatly impact the customer shopping experience, by enabling intuitive and interactive search refinements…”, Page 5 Section 5.1 Query Arithmetic Approach, “That is, for a given query image, I, and a desired and undesired attribute set, w = {w+,w−}”, See Figure 1, 
    PNG
    media_image1.png
    373
    735
    media_image1.png
    Greyscale

Sadeh discloses interactive refinement where textual attributes are applied after initial retrieval. Figure 1 shows that the textual query refinement of adding a V-neck or adding sleeves for the query image is a direct response from the output produced from the query image.); 
 and processing, by the computing system, the user query image and the textual query refinement for the user query image with the machine-learned query refinement model to obtain a refined image embedding of the user query image that incorporates the textual query refinement (Page 5 Section 5.1, “That is, for a given query image, I, and a desired and undesired attribute set, w = {w+,w−}, the new mutlimodal query q can be defined by, q =fI +fT… where fI is the image embedding, and fT is the linear combination of desired and undesired word embeddings.”, Page 1 Introduction, “…by enabling intuitive and interactive search refinements.”, See Figure 1 where the illustration shows multiple rounds of textual refinement queries applied to a query image after prior results are retrieved.
Sadeh teaches processing a textual query refinement with the image embedding using the machine learned query refinement model because the method discloses combining the image embedding with the newly provided textual attribute embedding to generate a new image representation after each interactive refinement step.).

Regarding claim 17,
Sadeh teaches obtaining, by the computing system, one or more refined result images responsive to the refined image embedding of the user query image; and providing, by the computing system, the one or more refined result images (Page 1 Introduction, “This paper refers to the specific, fine-grained, task of visual-textual multimodal search in the fashion domain. Example queries and their retrieved results can be seen in Figure 1. We believe this type of application can greatly impact the customer shopping experience, by enabling intuitive and interactive search refinements.… We compute the cosine similarity of each pair,”, Page 7 Section 7, “The top-K accuracy metric measures the rate of images and text descriptions for which the actual matching pair was ranked, based on cosine similarity, within the top K references”, 
Sadeh obtains retrieved images by computing cosine similarity within a shared embedding space and ranking images based on their similarity to the query embedding.).

Regarding claim 18, 
Sadeh teaches providing the one or more refined result images comprises providing, by the computing system, the one or more refined result…(Page 1 Introduction, “Example queries and their retrieved results can be seen in Figure 1. We believe this type of application can greatly impact the customer shopping experience, by enabling intuitive and interactive search refinements.”).
Sadeh does not teach images for display within an interface of a search application of a user device of the user.
Badjatiya, in the same field of endeavor, teaches images for display within an interface of a search application of a user device of the user (Paragraph 37, “The user interface 140 is presented to the user to allow the user to interact with the image search system 130 through a series of queries 110 and images 120.”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Sadeh’s similarity-based query refinement and multimodal image retrieval framework with Badjatiya’s use of processors and non-transitory computer media to implement an interactive visual search application that receives user-provided images and textual refinements and returns refined image results, in order to deploy Sadeh’s multimodal refinement technique within a complete computing system to enhance usability and real-world functionality (Paragraphs 39 and 74 of Badjatiya).

Regarding claim 19,
Sadeh teaches obtaining the one or more refined result images comprises selecting, by the computing system, one or more image embeddings within a… distance of the refined image embedding of the user query image within the image embedding space, wherein the one or more image embeddings are respectively associated with the one or more refined result images (Page 1 Introduction, “We compute the cosine similarity of each pair…”, Page 4 Section 4.2, “The objective of the joint-embedding training procedure should encourage matching (non-matching) image-text pairs to be as close (distant) as possible to (from) each other, in the common embedding space.”, “Page 7 Section 7, “The top-K accuracy metric measures the rate of images and text descriptions for which the actual matching pair was ranked, based on cosine similarity…”).
Sadeh does not teach a threshold distance.
Badjatiya, in the same field of endeavor, teaches threshold distance (Paragraph 52 of Badjatiya, “The selection module 390 is configured to select one or more of the potential target images 377 as an identified target image 240 based on the distances 385. For example, in some embodiments, if the distance 385 is less than a threshold value, the potential target image 377 is considered to be close enough to the user's request…”
Badjatiya discloses determining image embeddings that fall within a threshold distance of a query-derived embedding.)
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Sadeh’s similarity-based multimodal image retrieval framework with Badjatiya’s threshold-based selection of image embeddings in order to improve retrieval precision and ensure that only similar images are presented to the user (Paragraph 14 of Badjatiya).

Regarding claim 20,
Sadeh teaches receiving, by the computing system, data indicative of a selection of at least one refined result image of the one or more refined result images by the user (Page 1 Introduction, “Example queries and their retrieved results can be seen in Figure 1. We believe this type of application can greatly impact the customer shopping experience, by enabling intuitive and interactive search refinements.”, Page 7 Section 7, “The top-K accuracy metric measures the rate of images and text descriptions for which the actual matching pair was ranked, based on cosine similarity…”
Sadeh teaches an interactive visual textual search system in which users issue refined queries and obtain ranked retrieval results.); 
and modifying, by the computing system, one or more values of the one or more parameters of the machine-learned query refinement model based at least in part on the at least one refined result image (Page 4 Section 4, “A Mini-Batch Match Retrieval (MBMR) loss, LMBMR, for the task of learning a joint embedding space, and a multi-label cross-entropy loss, La, for attribute extraction. The final objective is a weighted sum of both loss terms… We use the Adam [12] optimizer, with an exponentially decaying learning rate schedule. We have found that all of these settings are helpful in order to improve convergence and reduce overfitting.”, Page 4 Section 4.2, “The objective of the joint-embedding training procedure should encourage matching (non-matching) image-text pairs to be as close (distant) as possible to (from) each other, in the common embedding space. To achieve this, we propose the following Mini-Batch Match Retrieval (MBMR) objective.”
Sadeh trains the visual-textual joint embedding model using the MBMR loss, meaning the cross-entropy loss computed from cosine similarity distances is backpropagated to update the parameters of the embedding networks. The model parameters of the embedding model are modified based on the refined result images.).

Regarding claim 21,
Sadeh does not teach One or more non-transitory computer-readable media that store instructions that, when executed by one or more processors of a computing system, cause the computing system to perform operations, the operations comprising.
Badjatiya, in the same field of endeavor, teaches [o]ne or more non-transitory computer-readable media that store instructions that, when executed by one or more processors of a computing system, cause the computing system to perform operations, the operations comprising (Paragraph 95 of Badjatiya, “ Example 14 is a computer program product including one or more non-transitory machine-readable mediums encoded with instructions that when executed by one or more processors cause a process to be carried out for image searching”):
The remainder of claim 21 is similar in scope to limitations recited in claim 1, and thus is rejected using the same rationale.

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Sadeh (“Joint Visual-Textual Embedding for Multimodal Style Search”, 2019) in view of Badjatiya (US 11874902 B2) and Ding (“Vision-Language Transformer and Query Generation for Referring Segmentation, 2021”).

Regarding claim 5,
Sadet does not teach the machine-learned query refinement model comprises a transformer model.
Ding, in the same field of endeavor, teaches the machine-learned query refinement model comprises a transformer model (Page 3 Figure 2, 
    PNG
    media_image10.png
    216
    842
    media_image10.png
    Greyscale
Pag 5 Section 3.3, “We use a complete but shallow transformer to apply the attention operations on input features.”
Ding discloses a machine-learned query refinement model that employes a transformer encoder-decoder architecture to generate language-derived query vectors that refines an image based on the refined text.)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Sadeh’s similarity-based multimodal image retrieval framework with Ding’s transformer based encoding of input in order to yield more effective query refinement while operating in the same multimodal image-text field (Introduction of Ding).

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Sadeh (“Joint Visual-Textual Embedding for Multimodal Style Search”, 2019) in view of Badjatiya (US 11874902 B2) and Fey (US 20150370833 A1).

Regarding claim 14,
Sadeh teaches prior to evaluating the loss function (Page 4 Section 4, “A Mini-Batch Match Retrieval (MBMR) loss, LMBMR,for the task of learning a joint embedding space, and a multi-label cross-entropy loss, La, for attribute extraction.”
Sadeh teaches using a loss function (MBMR) after getting the refined image results based on the result of the refined query.)
Sadeh does not teach the method comprises: obtaining, by the computing system, a corpus of image search data comprising search result images provided to users responsive to a query, and refined search result images provided to the users responsive to selection of query refinement elements provided to the users with the search result images; and selecting, by the computing system, the query image, the textual query refinement, and the ground truth image from the search result images, the query refinement elements, and the refined search result images.
Fey, in the same field of endeavor, teaches obtaining, by the computing system, a corpus of image search data comprising search result images provided to users responsive to a query (Paragraph 21 of Fey, “Image search results that are responsive to an initial search query (“initial query”) are presented in a results portion of an image results page.”, Paragraph 71, “Initial query search results data is provided (510) in order to cause the user device to display the image results.”
Fey discloses image search results that are responsive to an initial query submitted by the user on the device.), 
and refined search result images provided to the users responsive to selection of query refinement elements provided to the users with the search result images (Paragraph 22, “…interaction with the image query suggestion can cause presentation of a preview window in which a set of image search results… are also responsive to the refined query.”, Paragraph 51 of Fey, “image results 214 responsive to the refined query 210a are displayed in a preview window 216 in response to the user interaction.”, Paragraph 72, “Image query suggestion data is also provided… which includes both the refined query and an image from the images selected as responsive to the refined query.”
Fey teaches that when a user selects an image query suggestion, the system presents refined image results responsive to the refined query.); 
and selecting, by the computing system, the query image, the textual query refinement, and the ground truth image from the search result images, the query refinement elements, and the refined search result images (Paragraph 68, “A refined query is selected (506).”, Paragraph 69, “Images responsive to the refined query are selected (508).”, Paragraph 48, “For each refined query, the subset of the image results for the initial query that are also responsive to the refined query is identified. One of these image results can be selected to be the representative image 212.”
The representative image selected as responsive to the refined query corresponds to the ground truth image since the image was selected based on the refined query submitted by the user.).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date to combine Sadeh’s teaching with Fey’s method of query refinement search data and refined image selection techniques in order to improve the effectiveness of Sadeh’s embedding model (Paragraph 66 of Fey).


Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAJD MAHER HADDAD whose telephone number is (571)272-2265. The examiner can normally be reached Mon-Friday 8-5 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar, can be reached at (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/M.M.H./Examiner, Art Unit 2125                                                                                                                                                                                                        
/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125
Read full office action
Machine-Learned Models for Multimodal Searching and Retrieval of Images

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Machine-Learned Models for Multimodal Searching and Retrieval of Images

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email