Last updated: May 29, 2026
Application No. 17/527,917
MULTI-HEAD DEEP METRIC MACHINE-LEARNING ARCHITECTURE

Non-Final OA §101§103
Filed
Nov 16, 2021
Priority
Nov 16, 2020 — provisional 63/114,172
Examiner
GORMLEY, AARON PATRICK
Art Unit
2148
Tech Center
2100 — Computer Architecture & Software
Assignee
Objectvideo Labs LLC
OA Round
3 (Non-Final)
Interview Optional

— -60.0% interview lift. Interview lift (-60.0%) is below the 15.0% threshold. A written response is recommended.
Based on 6 resolved cases, 2023–2026
Examiner Intelligence

GORMLEY, AARON PATRICK View full profile →
Grants 50% of resolved cases
Career Allowance Rate
3 granted / 6 resolved
-5.0% vs TC avg
Minimal -60% lift
Without
With
+-60.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
19 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
24.1%
-15.9% vs TC avg
§103
57.8%
+17.8% vs TC avg
§102
9.6%
-30.4% vs TC avg
§112
7.2%
-32.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 6 resolved cases
Office Action

§101 §103
DETAILED ACTION
	This action is in response to the application filed 11/16/2021. Claims 1-17 and 19-21 are pending and have been examined.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 11/18/2025 has been entered.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claim(s) 1-4, 9, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Wonsik Kim et al. (Attention-based Ensemble for Deep Metric Learning, 2018, arXiv:1804.00382v2), hereafter referred to as W. Kim, in view of Cao et al. (Unifying Deep Local and Global Features for Image Search, Sep. 2020, arXiv:2001.05027v4), hereafter referred to as Cao, and further in view of Sungyeon Kim et al. (Proxy Anchor Loss for Deep Metric Learning, Mar. 2020, arXiv:2003.13911v1), hereafter referred to as S. Kim.

Regarding claim 1, W. Kim teaches [a] method implemented using a machine-learning architecture, the method comprising:
determining, for the input image from the input dataset, i) a first set of vectors that represent one or more global features that indicate at least data for the input image as a whole and ii) a second set of vectors that represent one or more local features that indicate at least geometry and spatial location data for a region from a plurality of regions for the input image, at least some regions from the plurality of regions having different local features:
“In deep metric learning, feature embedding function is modeled as a deep neural network. This feature embedding function embeds input images into feature embedding space with a certain desired condition” (W. Kim, page 1, paragraph 1).
“We call S(⋅) a spatial feature extractor and G(⋅) a global feature embedding function” (W. Kim, page 5, paragraph 3). S extracts features from images.
“                                
                                    
                                        
                                            A
                                        
                                        
                                            m
                                        
                                        
                                            '
                                        
                                    
                                
                            (⋅) consists of a convolution layer of 480 kernels of size 1×1 to match the output of S(⋅) for the element-wise product” (W. Kim, page 6, paragraph 4). The output of S is a vector. In this case, S outputs a 480-dimensional vector.
“Note that, same feature extraction module is shared across different learners while individual learners have their own attention module                                 
                                    
                                        
                                            A
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    ⋅
                                    )
                                
                            . The attention function                                 
                                    
                                        
                                            A
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    S
                                    (
                                    x
                                    )
                                    )
                                
                             outputs an attention mask with same size as output of S(x). product. Attended feature output of                                 
                                    S
                                    
                                        
                                            x
                                        
                                    
                                    ∘
                                    
                                        
                                            A
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    S
                                    
                                        
                                            x
                                        
                                    
                                    )
                                
                             (set of vectors that represent one or more features) is then fed into global feature embedding function G(・) to generate an embedding feature vector.” (W. Kim, page 5, paragraph 4). Each learner m generates a unique set of vectors by running the extracted features S through its own attention mechanism.

    PNG
    media_image1.png
    284
    366
    media_image1.png
    Greyscale
 “In attention-based ensemble, single feature embedding function (G) is trained while each learner learns different attention modules (A1,A2,A3)” (W. Kim, page 2, Fig. 1). In this example, there are three learners (m = 3) producing three sets of vectors. If S is producing global and local features (see mapping of Cao regarding claim 1 below), this results in three unique sets of vectors, each representing global and local features.
computing, using the input image, a proxy-based loss function and pairwise-based loss function: 
“The loss for training aforementioned attention model is defined as: 
    PNG
    media_image2.png
    65
    711
    media_image2.png
    Greyscale
 where {(                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            ,                                 
                                    
                                        
                                            c
                                        
                                        
                                            i
                                        
                                    
                                
                            )} is a set of all training samples and labels,                                 
                                    
                                        
                                            L
                                        
                                        
                                            m
                                            e
                                            t
                                            r
                                            i
                                            c
                                        
                                    
                                
                            ,(m)(・) is the loss for the isometric embedding for the m-th learner (loss) … divergence loss Ldiv is defined as the following: 
    PNG
    media_image3.png
    69
    749
    media_image3.png
    Greyscale
” (W. Kim, page 6, paragraph 2).
“The divergence loss encourages each learner to attend to the different part of the input image by increasing the distance between the points embedded by the input image” (W. Kim, page 6, paragraph 3).
“We use contrastive loss (pairwise-based loss) as our distance metric loss function which is defined as the following: 
    PNG
    media_image4.png
    92
    602
    media_image4.png
    Greyscale
” (W. Kim, page 7, paragraph 1). As stated in paragraph [0070] of the instant Specification, Contrastive loss is a pairwise-based loss.
computing, using the proxy-based loss function and the pairwise-based loss function, a combined feature set using data from both a) the first set of vectors that represent one or more global features that indicate at least the geometry and spatial location data for the input image as a whole and b) the second set of vectors from that represent one or more local features that indicate at least the geometry and spatial location data for a region from a plurality of regions for the input image:
“The aim of the deep metric learning is to find an embedding function f : X → Y which maps samples x from a data space X to a feature embedding space Y so that f(xi) and f(xj) are closer in some metric when xi and xj are semantically similar” (W. Kim, page 2, paragraph 2).
“Let f : X → Y be an isometric embedding function between metric spaces X and Y” (W. Kim, page 3, paragraph 4); “Our goal is to approximate f with a deep neural network” (W. Kim, page 3, paragraph 5)
“In addition to the classical ensemble, we can consider the ensemble of two-step embedding function. Consider a function s : X → Z ... And we consider the isometric embedding g : Z → Y ... If we combine them into one function b(x) = g(s(x)), x ∈ X, the combined function is also an isometric embedding b : X → Y between metric spaces X and Y” (W. Kim, page 4, paragraph 4) “We are interested in another case where there are multiple embedding functions                                 
                                    
                                        
                                            b
                                        
                                        
                                            m
                                        
                                    
                                
                             : X → Y with multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             (sets of vectors) and a single g as the following: 
    PNG
    media_image5.png
    36
    222
    media_image5.png
    Greyscale
” (W. Kim, page 4, paragraph 7).
“With the attention-based ensemble, union of metric spaces by multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             is mapped by a single embedding function g” (W. Kim, page 5, paragraph 1).
“[T]he combined embedding function                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    x
                                    )
                                
                             for the learner m is defined as the following: 
    PNG
    media_image6.png
    33
    352
    media_image6.png
    Greyscale
 where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             with 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
. As discussed above, each 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
 produces a unique set of vectors for the mth attention pipeline. The embedding function g maps the vectors from each attention channel, each having a unique feature embedding, to the same embedding space.
“The loss for training aforementioned attention model is defined as: 
    PNG
    media_image2.png
    65
    711
    media_image2.png
    Greyscale
 where {(                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            ,                                 
                                    
                                        
                                            c
                                        
                                        
                                            i
                                        
                                    
                                
                            )} is a set of all training samples and labels,                                 
                                    
                                        
                                            L
                                        
                                        
                                            m
                                            e
                                            t
                                            r
                                            i
                                            c
                                        
                                    
                                
                            ,(m)(・) is the loss for the isometric embedding for the m-th learner (loss)” (W. Kim, page 6, paragraph 2); “We use contrastive loss (pairwise-based loss) as our distance metric loss function which is defined as the following: 
    PNG
    media_image4.png
    92
    602
    media_image4.png
    Greyscale
” (W. Kim, page 7, paragraph 1). The aforementioned model is a neural network serving as the embedding function. As stated in paragraph [0070] of the instant Specification, contrastive loss is a pairwise-based loss. Note that this function takes outputs of                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                
                            , the embedding function, as inputs. That means to minimize this function and train the network, embedding function outputs must be optimized.

    PNG
    media_image8.png
    464
    958
    media_image8.png
    Greyscale
”Fig. 3. The implementation of attention-based ensemble (ABE-M) using GoogLeNet” (W. Kim, page 7, Fig. 3). This illustrates the system of the paper, including a visualization of the embedding function being modified through the loss function.
…wherein the combined feature set corresponds to a final feature vector that is a single vector formed by merging at least one vector from the first set of vectors with at least one vector from the second set of vectors
“Attended feature output of S(x)◦Am(S(x)) is then fed into global feature embedding function G(・) to generate an embedding feature vector.” (W. Kim, page 5, paragraph 1)
“[T]he combined embedding function                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    x
                                    )
                                
                             for the learner m is defined as the following: 
    PNG
    media_image6.png
    33
    352
    media_image6.png
    Greyscale
 where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             with 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
. As discussed above, the embedding function g maps the vectors from each attention channel, merging multiple sets of vectors together as they’re mapped to the same embedding space.

    PNG
    media_image9.png
    565
    858
    media_image9.png
    Greyscale
 (W. Kim, page 7, Fig. 3). As seen in this diagram of the system, the outputs of the global feature embedding function across all attention channels can be considered a final feature vector that is a single vector of dimension 512 / M.
generating, using the combined feature set, a feature representation that is based on a final embedding output that is representative of content information, geometry information, and spatial information of the input image: 
“In addition to the classical ensemble, we can consider the ensemble of two-step embedding function. Consider a function s : X → Z ... And we consider the isometric embedding g : Z → Y ... We are interested in another case where there are multiple embedding functions                                 
                                    
                                        
                                            b
                                        
                                        
                                            m
                                        
                                    
                                
                             : X → Y with multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             and a single g (combined feature function) as the following:
    PNG
    media_image10.png
    66
    410
    media_image10.png
    Greyscale
” (W. Kim, page 4, paragraphs 4-7); The outputs of g (feature representation) constitute a set of features from different embeddings (one per learner) mapped to the same space (final embedding output).
“[W]e design an architecture which has multiple attention modules for multiple learners. By attending to different locations for different learners, diverse feature embedding functions are trained” (W. Kim, page 2, paragraph 1); “For deep metric learning, ensemble concatenates the feature embeddings learned by multiple learners” (W. Kim, page 1, paragraph 2). Each learner represents different kinds of features from the inputs. The concatenated embedding is representative of this information as a whole.
training, using the feature representation, a machine-learning model to output a prediction about an image based on inferences derived using the feature representation:
“The loss for training aforementioned attention model (machine-learning model) is defined as: 
    PNG
    media_image2.png
    65
    711
    media_image2.png
    Greyscale
 where {(                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            ,                                 
                                    
                                        
                                            c
                                        
                                        
                                            i
                                        
                                    
                                
                            )} is a set of all training samples and labels,                                 
                                    
                                        
                                            L
                                        
                                        
                                            m
                                            e
                                            t
                                            r
                                            i
                                            c
                                        
                                    
                                
                            ,(m)(・) is the loss for the isometric embedding for the m-th learner”(W. Kim, page 6, paragraph 2); “We use contrastive loss as our distance metric loss function which is defined as the following: 
    PNG
    media_image4.png
    92
    602
    media_image4.png
    Greyscale
… A pair (Bp(xi), Bq(xi)) represents feature embeddings (feature representation[s]) of a single image embedded by two different learners. We call it self pair from now on while positive and negative pairs refer to pairs of feature embeddings with same labels and different labels, respectively” (W. Kim, page 7, paragraph 1). The model is trained using feature representations.
“During testing, we compute the feature embeddings (feature representations) for all the test images from our network. For every test image, we then retrieve top K similar images from the test set excluding test image itself ... We evaluate the model (machine-learning model) after every 1000 iteration and report the results for the iteration with highest Recall@1.” (W. Kim, page 8, paragraph 2). The trained model is used to make predictions about similar images.
W. Kim relates to deep metric learning for images and is analogous to the claimed invention.
While W. Kim fails to disclose the further limitations of the claim, Cao teaches a method, comprising:
maintaining an input dataset of a plurality of images for a training process: “Training details. We use the training set of the Google Landmarks dataset (GLD) [39], containing 1:2M images from 15k landmarks, and divide it into two subsets ‘train’/‘val’ with 80%/20% split. The ‘train’ split is used for the actual learning, and the ‘val’ split is used for validating the learned classifier as training progresses” (Cao, page 8, paragraph 2)
maintaining, for the input dataset, a plurality of features derived from data values of the input dataset: “We use the training set (input dataset) of the Google Landmarks dataset (GLD) [39], containing 1:2M images” (Cao, page 8, paragraph 2); “Given an image, we apply a convolutional neural network backbone to obtain two feature maps: 
    PNG
    media_image11.png
    35
    535
    media_image11.png
    Greyscale
, representing shallower and Unifying Deep Local and Global Features for Image Search 5 deeper activations respectively” (Cao, page 4, paragraph 6)
maintaining, for an input image of the plurality of images in the input dataset, global features from the plurality of features and local features from the plurality of features: “Our proposed DELG (DEep Local and Global features) model (left) jointly extracts deep local and global features. Global features can be used in the first stage of a retrieval system, to efficiently select the most similar images (bottom). Local features can then be employed to re-rank top results (top-right), increasing precision of the system.” (Cao, page 2, Fig. 1). For global features to be used for similar image selection, and for local features to be used for ranking results, they each must be maintain[ed] in some capacity.
determining, for the input image from the input dataset
i) a first set of vectors that represent one or more global features that indicate at least data for the content of the input image as a whole: “A global feature … summarizes the contents of an image … Global features can learn similarity across very different poses where local features would not be able to find correspondences” (Cao, page 1, paragraph 2); “These two components produce a global feature 
    PNG
    media_image12.png
    31
    86
    media_image12.png
    Greyscale
 that summarizes the discriminative contents of the whole image” (Cao, page 5, paragraph 2)
ii) a second set of vectors that represent one or more local features that indicate at least geometry and spatial location data for a region from a plurality of regions for the input image, at least some regions from the plurality of regions having different local features:
“The local descriptors (local features) are obtained as L = T(S), where                                 
                                    L
                                    ∈
                                    
                                        
                                            R
                                        
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    S
                                                
                                            
                                            ×
                                            
                                                
                                                    W
                                                
                                                
                                                    S
                                                
                                            
                                            ×
                                            
                                                
                                                    C
                                                
                                                
                                                    T
                                                
                                            
                                        
                                    
                                
                            ” (Cao, page 5, paragraph 5)
“Local features [28,7,64,39,34], on the other hand, comprise descriptors and geometry information about specific image regions (plurality of regions); they are especially useful to match images depicting rigid objects” (Cao, page 1, paragraph 2);
“The key advantage of local features over global ones for retrieval is the ability to perform spatial matching (spatial information)” (Cao, page 3, paragraph 2). 

    PNG
    media_image13.png
    632
    498
    media_image13.png
    Greyscale
”Fig. 5: Examples of correct local feature matches, for image pairs depicting the same object/scene” (Cao, page 18, Figure 5). Here, each line corresponds to a local feature in some part of the image. Different image regions have different local features.
generating, using the combined feature set, a feature representation that is based on a final embedding output that is representative of content information, geometry information, and spatial information of the input image: “A global feature, also commonly referred to as ‘global descriptor’ or ‘embedding’, summarizes the contents of an image (content information) ... Local features, on the other hand, comprise descriptors and geometry information about specific image regions” (Cao, page 1, paragraph 2); “The key advantage of local features over global ones for retrieval is the ability to perform spatial matching (spatial information)” (Cao, page 3, paragraph 2).
storing the machine-learning model in memory for use by a system to detect objects with at least some features from the plurality of features
“Our proposed DELG (DEep Local and Global features) model (left) jointly extracts deep local and global features” (Cao, page 2, Figure 1)
“For optimal performance, image retrieval requires semantic understanding of the types of objects that a user may be interested in, such that the system can distinguish between relevant objects versus clutter/background” (Cao, page 4, paragraph 2)

    PNG
    media_image14.png
    414
    681
    media_image14.png
    Greyscale
”Table 6: Feature extraction latency and database memory requirements for different image retrieval models … (C) DELG and DELG? (the machine-learning model) are compared with different configurations. As a reference, we also provide numbers for DELF in the last rows” (Cao, page 13, Table 5)
	Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified W. Kim to obtain local and global features from the input images, as disclosed by Cao. Both types are necessary for high image retrieval performance (Cao, page 1, paragraph 2).
	While W. Kim and Cao fail to disclose the further limitations of the claim, S. Kim teaches a method of computing, using the input image, a proxy-based loss function and a pairwise-based loss function:
“We propose a novel metric learning loss that takes advantages of both pair-based and proxy-based methods“ (S. Kim, page 2, Left column, paragraph 3).
“Specifically, for each proxy, the loss aims to pull data of the same class close to the proxy and to push others away in the embedding space” (S. Kim, page 2, Left column, paragraph 2).
“We evaluate our method with Inception-BN backbone while varying the sizes of input images: (224 x 224; 256 x 256; 324 x 324; 448 x 448}. Table 7 also shows that the accuracy improves consistently as the sizes of the input images increase” (S. Kim, page 11, left column, paragraph 3).
	S. Kim relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim and Cao to use proxy-based and pairwise-based loss methods, as disclosed by S. Kim. Doing so would achieve state-of-the-art accuracy through quick convergence while still accounting for data-to-data relations, as well as build resistance to noisy labels and outliers. See S. Kim, page 2, paragraph 2.
Regarding claim 2, the rejection of claim 1 in view of W. Kim, Cao, and S. Kim is incorporated. W. Kim further discloses a method of generating a first set of embeddings corresponding to the first set of vectors based on the proxy-based loss function and the pairwise-based loss function; and generating a second set of embeddings corresponding to the second set of vectors based on the proxy-based loss function and the pairwise-based loss function:
“[T]he combined embedding function                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    x
                                    )
                                
                             for the learner m is defined as the following: 
    PNG
    media_image6.png
    33
    352
    media_image6.png
    Greyscale
 where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             with 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
. As discussed regarding claim 1, each 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
 produces a unique set of vectors for the mth attention pipeline.

    PNG
    media_image1.png
    284
    366
    media_image1.png
    Greyscale
(W. Kim, page 2, Fig. 1(b)). This model supports at least three attention pipelines (three set[s] of embeddings corresponding to … set[s] of vectors)
“The loss for training aforementioned attention model is defined as: 
    PNG
    media_image2.png
    65
    711
    media_image2.png
    Greyscale
 where {(                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            ,                                 
                                    
                                        
                                            c
                                        
                                        
                                            i
                                        
                                    
                                
                            )} is a set of all training samples and labels,                                 
                                    
                                        
                                            L
                                        
                                        
                                            m
                                            e
                                            t
                                            r
                                            i
                                            c
                                        
                                    
                                
                            ,(m)(・) is the loss for the isometric embedding for the m-th learner (loss)” (W. Kim, page 6, paragraph 2); “We use contrastive loss (pairwise-based loss) as our distance metric loss function which is defined as the following: 
    PNG
    media_image4.png
    92
    602
    media_image4.png
    Greyscale
” (W. Kim, page 7, paragraph 1).
While W. Kim fails to disclose the further limitations of the claim, S. Kim further teaches a method of generating a … set of embeddings corresponding to the … set of vectors based on the proxy-based loss function and the pairwise-based loss function:
“The networks are trained to project data (set of vectors) onto an embedding space in which semantically similar data (e.g., images of the same class) are closely grouped together. Such a quality of the embedding space is given mainly by loss functions used for training the networks, and most of the losses are categorized into two classes: pair-based and proxy-based.”(S. Kim, page 1, left column, paragraph 1)
“We propose a novel metric learning loss that takes advantages of both pair-based and proxy-based methods“ (S. Kim, page 2, Left column, paragraph 3); “Specifically, for each proxy, the loss aims to pull data of the same class close to the proxy and to push others away in the embedding space” (S. Kim, page 2, Left column, paragraph 2).
“Let x denote the embedding vector of the input, p+ be the positive proxy, and p􀀀 be a negative proxy. The loss is then given by 
    PNG
    media_image15.png
    217
    655
    media_image15.png
    Greyscale
where X is a batch of embedding vectors (set of embeddings)”
	S. Kim relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, and S. Kim to use proxy-based and pairwise-based loss methods for feature embeddings, as disclosed by S. Kim. Doing so would achieve state-of-the-art accuracy through quick convergence while still accounting for data-to-data relations, as well as build resistance to noisy labels and outliers. See S. Kim, page 2, paragraph 2.

	Regarding claim 3, the rejection of claim 2 in view of W. Kim, Cao, and S. Kim is incorporated. W. Kim further teaches a method, wherein generating a feature representation comprises: generating, from the first and second sets of embeddings, a final embedding output that is representative of content information, geometry information, and spatial information of the input image:
“In addition to the classical ensemble, we can consider the ensemble of two-step embedding function. Consider a function s : X → Z ... And we consider the isometric embedding g : Z → Y ... We are interested in another case where there are multiple embedding functions                                 
                                    
                                        
                                            b
                                        
                                        
                                            m
                                        
                                    
                                
                             : X → Y with multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             (sets of embeddings) and a single g as the following:
    PNG
    media_image10.png
    66
    410
    media_image10.png
    Greyscale
” (W. Kim, page 4, paragraphs 4-7); “With the attention-based ensemble, union of metric spaces by multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             is mapped by a single embedding function g” (W. Kim, page 5, paragraph 1); “the combined embedding function Bm(x) for the learner m is defined as the following: 
    PNG
    media_image16.png
    31
    334
    media_image16.png
    Greyscale
” (W. Kim, page 5, paragraph 3);” The outputs of g constitute a set of features from different embeddings (one per learner) mapped to the same space (final embedding output).
“[W]e design an architecture which has multiple attention modules for multiple learners. By attending to different locations for different learners, diverse feature embedding functions are trained” (W. Kim, page 2, paragraph 1); “For deep metric learning, ensemble concatenates the feature embeddings learned by multiple learners” (W. Kim, page 1, paragraph 2). Each learner represents different kinds of features from the inputs. The concatenated embedding is representative of this information as a whole.
	While W. Kim fails to disclose the further limitations of the claim, Cao teaches a method of deriving features that are representative of content information, geometry information, and spatial information of the input image: “A global feature, also commonly referred to as ‘global descriptor’ or ‘embedding’, summarizes the contents of an image (content information) ... Local features, on the other hand, comprise descriptors and geometry information about specific image regions” (Cao, page 1, paragraph 2); “The key advantage of local features over global ones for retrieval is the ability to perform spatial matching (spatial information)” (Cao, page 3, paragraph 2).
	Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, and S. Kim to use both local and global features, as disclosed by Cao. Both types are necessary for high image retrieval performance (Cao, page 1, paragraph 2).

Regarding claim 4, the rejection of claim 1 in view of W. Kim, Cao, and S. Kim is incorporated. Cao further discloses a method, wherein maintaining the global features and the local features comprises: encoding, using an encoder module of the architecture, the input image to an attribute range comprising a range that spans from low-level descriptors of the input image to high-level descriptors of the input image: “Our first contribution is a unified model to represent both local and global features, using a convolutional neural network (CNN) (encoder module), referred to as DELG (DEep Local and Global features). This allows for efficient inference by extracting an image’s (input image of the input dataset) global feature, detected keypoints and local descriptors within a single model” (Cao, page 2, paragraph 3); “A global feature, also commonly referred to as ‘global descriptor’ or ‘embedding’, summarizes the contents of an image (high-level descriptors) ... Local features, on the other hand, comprise descriptors and geometry information about specific image regions (low-level descriptors)” (Cao, page 1, paragraph 2). Together, low and high-level features encompass an attribute range. Low and high-level features are being interpreted according to the definition given in paragraph [0023] of the instant Specification.
Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, and S. Kim to identify global and local features using a neural network encoder, as disclosed by Cao. Using Cao’s method would allow for efficient inference by extracting local and global features at very low cost within a single model (Cao, page 15, paragraph 3).

Regarding claim 9, the rejection of claim 1 in view of W. Kim, Cao, and S. Kim is incorporated. W. Kim further teaches a method, wherein: the input dataset comprises a plurality of images; and the data values of the input dataset are image pixel values for at least one image: “We follow earlier works [29,38] for preprocessing and unless stated otherwise, we use the input image size of 224×224. All training and testing images (plurality of images) are scaled such that their longer side is 256, keeping the aspect ratio fixed, and padding the shorter side to get 256×256 images.” (W. Kim, page 7, paragraph 3). While the units aren’t outright specified, 224x224 is referring to 224 pixels by 224 pixels. 

	Regarding claim 21, the rejection of claim 1 in view of W. Kim, Cao, and S. Kim is incorporated. W. Kim further teaches a method, wherein the combined feature set comprises a concatenated feature set:
“Ensemble is a widely used technique of training multiple learners to get a combined model, which performs better than individual models. For deep metric learning, ensemble concatenates the feature embeddings learned by multiple learners which often leads to better embedding space under given constraints on the distances between image pairs” (W. Kim, page 1, paragraph 2).
“we present M-way attention-based ensemble (ABE-M) which learns feature embedding with M diverse attention masks” (W. Kim, page 2, paragraph 1).
“With the attention-based ensemble, union of metric spaces by multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             is mapped by a single embedding function g” (W. Kim, page 5, paragraph 1). g maps multiple sets of inputs from different embeddings to the same space, concatenating them into the same space.
“[T]he combined embedding function                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    x
                                    )
                                
                             for the learner m is defined as the following: 
    PNG
    media_image6.png
    33
    352
    media_image6.png
    Greyscale
 where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             with 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
.

Claims 5-8 are rejected under 35 U.S.C. 103 as being unpatentable over Wonsik Kim et al. (Attention-based Ensemble for Deep Metric Learning, 2018, arXiv:1804.00382v2), hereafter referred to as W. Kim, in view of Cao et al. (Unifying Deep Local and Global Features for Image Search, Sep. 2020, arXiv:2001.05027v4), and further in view of, Sungyeon Kim et al. (Proxy Anchor Loss for Deep Metric Learning, Mar. 2020, arXiv:2003.13911v1), hereafter referred to as S. Kim, and Ng et al. (SOLAR: Second-Order Loss and Attention for Image Retrieval, Aug. 2020, arXiv:2001.08972v5).

Regarding claim 5, the rejection of claim 1 in view of W. Kim, Cao, and S. Kim is incorporated. Ng teaches a method, wherein determining the first set of vectors that represent one or more global features comprises: generating an enhanced set of global features in response to processing the global features by a first second-order attention block; and determining the first set of vectors from the enhanced set of global features:
In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global (global features)” (Ng, page 1, Abstract). The following method is applicable to global features.
“From an input image 
    PNG
    media_image17.png
    34
    152
    media_image17.png
    Greyscale
 processed through a Fully-Convolutional Net- work (FCN) denoted by                                 
                                    θ
                                
                            , we obtain a feature map 
    PNG
    media_image18.png
    37
    235
    media_image18.png
    Greyscale
(features) where h, w and d are height, width and feature dimensionality, respectively” (Ng, page 4, paragraph 2).
“Finally,                                 
                                    
                                        
                                            f
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                             map (enhanced set of ... features) is obtained from the first-order features f by the second-order attention
    PNG
    media_image19.png
    41
    265
    media_image19.png
    Greyscale
 where                                 
                                    ψ
                                
                             is another 1 x 1 convolution to control the influence of the attention. Thus, a new feature                                 
                                    
                                        
                                            f
                                        
                                        
                                            i
                                            ,
                                            j
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                             in the second-order map                                 
                                    
                                        
                                            f
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                             (reshaped to h x w x d), is a function of features from all locations in f ... This is referred to as the Second-Order Attention (SOA) block (second-order attention block)” (Ng, page 5, paragraph 1).                                 
                                    
                                        
                                            f
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                            maps a 2D image to a set of 3D feature values, the set of which can be considered a set of vectors.
	Ng relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, and S. Kim to enhance features with a second-order attention block, as disclosed by Ng. Second-order attention allows the learning of optimal relative contributions for individual locations in an image (Ng, page 4, paragraph 5), and has been shown to improve patch descriptors for image matching (Ng, page 1, paragraph 1).

	Regarding claim 6, the rejection of claim 5 in view of W. Kim, Cao, S. Kim, and Ng is incorporated. Cao further teaches a method, wherein: the enhanced set of global features comprises second-order information from spatial locations in high-level descriptors of the input image: “A global feature, also commonly referred to as ‘global descriptor’ or ‘embedding’, summarizes the contents of an image (high-level descriptors)” (Cao, page 1, paragraph 2).
Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, and Ng to use global features associated with high-level descriptors, as disclosed by Cao. These features have high recall, can learn similarity across very different poses, and excel at high image retrieval performance with compact representations. See (Cao, page 1, paragraph 2) and (Cao, page 3, paragraph 3).
	While Cao fails to disclose the further limitations of the claim, Ng further teaches a method, wherein: the enhanced set of global features comprises second-order information from spatial locations in high-level descriptors of the input image: “In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global“ (Ng, page 1, Abstract); “From an input image 
    PNG
    media_image20.png
    38
    156
    media_image20.png
    Greyscale
processed through a Fully-Convolutional Net-work (FCN) denoted by                         
                            θ
                        
                    , we obtain a feature map 
    PNG
    media_image21.png
    37
    238
    media_image21.png
    Greyscale
” (Ng, page 4, paragraph 2); “Therefore we propose to generate a map                         
                            
                                
                                    f
                                
                                
                                    s
                                    o
                                
                            
                        
                     (enhanced set of ... features) with local features                         
                            
                                
                                    f
                                
                                
                                    i
                                    ,
                                    j
                                
                                
                                    s
                                    o
                                
                            
                        
                     that reflect the correlations between all spatial locations from within                         
                            
                                
                                    f
                                
                                
                                    s
                                    o
                                
                            
                        
                    , hence the `second-order'” (Ng, page 4, paragraph 5). As evident from the abstract, this is applicable to global and local features f. Note that global or local features are extracted from images as feature map f. The local features encoding second-order information in                          
                            
                                
                                    f
                                
                                
                                    s
                                    o
                                
                            
                        
                     differ from these extracted features f, and can be applied to either global or local features.
	Ng relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, and Ng to use enhanced features comprising second-order information from spatial locations, as disclosed by Ng. Features with second-order information allow the learning of optimal relative contributions for individual locations in an image (Ng, page 4, paragraph 5), and has been shown to improve patch descriptors for image matching.

	Regarding claim 7, the rejection of claim 6 in view of W. Kim, Cao, S. Kim, and Ng is incorporated. Ng further teaches a method, wherein determining the second set of vectors that represent one or more local features comprises: generating an enhanced set of local features in response to processing the local features by a second second-order attention block; and determining the second set of vectors from the enhanced set of local features: 
In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local (local features) and global” (Ng, page 1, Abstract). The following method is applicable to local features.
“From an input image 
    PNG
    media_image17.png
    34
    152
    media_image17.png
    Greyscale
 processed through a Fully-Convolutional Net- work (FCN) denoted by                                 
                                    θ
                                
                            , we obtain a feature map 
    PNG
    media_image18.png
    37
    235
    media_image18.png
    Greyscale
(features) where h, w and d are height, width and feature dimensionality, respectively” (Ng, page 4, paragraph 2).
“Finally,                                 
                                    
                                        
                                            f
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                             map (enhanced set of ... features) is obtained from the first-order features f by the second-order attention
    PNG
    media_image19.png
    41
    265
    media_image19.png
    Greyscale
 where                                 
                                    ψ
                                
                             is another 1 x 1 convolution to control the influence of the attention. Thus, a new feature                                 
                                    
                                        
                                            f
                                        
                                        
                                            i
                                            ,
                                            j
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                             in the second-order map                                 
                                    
                                        
                                            f
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                             (reshaped to h x w x d), is a function of features from all locations in f ... This is referred to as the Second-Order Attention (SOA) block (second-order attention block)” (Ng, page 5, paragraph 1).                                 
                                    
                                        
                                            f
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                            maps a 2D image to a set of 3D feature values, the set of which can be considered a set of vectors.
Ng relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, and Ng to enhance features with a second-order attention block, as disclosed by Ng. Second-order attention allows the learning of optimal relative contributions for individual locations in an image (Ng, page 4, paragraph 5), and has been shown to improve patch descriptors for image matching.

Regarding claim 8, the rejection of claim 7 in view of W. Kim, Cao, S. Kim, and Ng is incorporated. Cao further teaches a method, wherein: the enhanced set of local features comprises second-order information from spatial locations in local-level descriptors of the input image: “Local features, on the other hand, comprise descriptors and geometry information about specific image regions (local-level descriptors)” (Cao, page 1, paragraph 2)
Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, and Ng to use local features associated with local-level descriptors, as disclosed by Cao. These features have high precision, reliably learn image similarity, and can perform spatial matching to produce reliable and interpretable scores. See (Cao, page 1, paragraph 2) and (Cao, page 2, paragraph 3).
Ng further teaches a method, wherein: the enhanced set of local features comprises second-order information from spatial locations in low-level descriptors of the input image: “In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global“ (Ng, page 1, Abstract); “From an input image 
    PNG
    media_image20.png
    38
    156
    media_image20.png
    Greyscale
processed through a Fully-Convolutional Net-work (FCN) denoted by                         
                            θ
                        
                    , we obtain a feature map 
    PNG
    media_image21.png
    37
    238
    media_image21.png
    Greyscale
” (Ng, page 4, paragraph 2); “Therefore we propose to generate a map                         
                            
                                
                                    f
                                
                                
                                    s
                                    o
                                
                            
                        
                     (enhanced set of ... features) with local features                         
                            
                                
                                    f
                                
                                
                                    i
                                    ,
                                    j
                                
                                
                                    s
                                    o
                                
                            
                        
                     that reflect the correlations between all spatial locations from within                         
                            
                                
                                    f
                                
                                
                                    s
                                    o
                                
                            
                        
                    , hence the `second-order'” (Ng, page 4, paragraph 5). As evident from the abstract, this is applicable to global and local features f. Note that global or local features are extracted from images as feature map f. The local features encoding second-order information in                          
                            
                                
                                    f
                                
                                
                                    s
                                    o
                                
                            
                        
                     differ from these extracted features f, and can be applied to either global or local features.
Ng relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, and Ng to use enhanced features comprising second-order information from spatial locations, as disclosed by Ng. Features with second-order information allow the learning of optimal relative contributions for individual locations in an image (Ng, page 4, paragraph 5), and has been shown to improve patch descriptors for image matching.

Claim(s) 10-13 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wonsik Kim et al. (Attention-based Ensemble for Deep Metric Learning, 2018, arXiv:1804.00382v2), hereafter referred to as W. Kim, in view of Cao et al. (Unifying Deep Local and Global Features for Image Search, Sep. 2020, arXiv:2001.05027v4), hereafter referred to as Cao, and further in view of Sungyeon Kim et al. (Proxy Anchor Loss for Deep Metric Learning, Mar. 2020, arXiv:2003.13911v1), hereafter referred to as S. Kim, and Wang et al. (US 2019/0311223 A1, Image Processing Methods and Apparatus, and Electronic Devices).

Regarding claim 10, W. Kim teaches a method, comprising:
determining, for the input image from the input dataset, i) a first set of vectors that represent one or more global features that indicate at least data for the input image as a whole and ii) a second set of vectors that represent one or more local features that indicate at least geometry and spatial location data for a region from a plurality of regions for the input image, at least some regions from the plurality of regions having different local features:
“In deep metric learning, feature embedding function is modeled as a deep neural network. This feature embedding function embeds input images into feature embedding space with a certain desired condition” (W. Kim, page 1, paragraph 1).
“We call S(⋅) a spatial feature extractor and G(⋅) a global feature embedding function” (W. Kim, page 5, paragraph 3). S extracts features from images.
“                                
                                    
                                        
                                            A
                                        
                                        
                                            m
                                        
                                        
                                            '
                                        
                                    
                                
                            (⋅) consists of a convolution layer of 480 kernels of size 1×1 to match the output of S(⋅) for the element-wise product” (W. Kim, page 6, paragraph 4). The output of S is a vector. In this case, S outputs a 480-dimensional vector.
“Note that, same feature extraction module is shared across different learners while individual learners have their own attention module                                 
                                    
                                        
                                            A
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    ⋅
                                    )
                                
                            . The attention function                                 
                                    
                                        
                                            A
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    S
                                    (
                                    x
                                    )
                                    )
                                
                             outputs an attention mask with same size as output of S(x). product. Attended feature output of                                 
                                    S
                                    
                                        
                                            x
                                        
                                    
                                    ∘
                                    
                                        
                                            A
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    S
                                    
                                        
                                            x
                                        
                                    
                                    )
                                
                             (set of vectors that represent one or more features) is then fed into global feature embedding function G(・) to generate an embedding feature vector.” (W. Kim, page 5, paragraph 4). Each learner m generates a unique set of vectors by running the extracted features S through its own attention mechanism.

    PNG
    media_image1.png
    284
    366
    media_image1.png
    Greyscale
 “In attention-based ensemble, single feature embedding function (G) is trained while each learner learns different attention modules (A1,A2,A3)” (W. Kim, page 2, Fig. 1). In this example, there are three learners (m = 3) producing three sets of vectors. If S is producing global and local features (see mapping of Cao regarding claim 1 below), this results in three unique sets of vectors, each representing global and local features.
computing, using the input image, a proxy-based loss function and pairwise-based loss function: 
“The loss for training aforementioned attention model is defined as: 
    PNG
    media_image2.png
    65
    711
    media_image2.png
    Greyscale
 where {(                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            ,                                 
                                    
                                        
                                            c
                                        
                                        
                                            i
                                        
                                    
                                
                            )} is a set of all training samples and labels,                                 
                                    
                                        
                                            L
                                        
                                        
                                            m
                                            e
                                            t
                                            r
                                            i
                                            c
                                        
                                    
                                
                            ,(m)(・) is the loss for the isometric embedding for the m-th learner (loss) … divergence loss Ldiv is defined as the following: 
    PNG
    media_image3.png
    69
    749
    media_image3.png
    Greyscale
” (W. Kim, page 6, paragraph 2).
“The divergence loss encourages each learner to attend to the different part of the input image by increasing the distance between the points embedded by the input image” (W. Kim, page 6, paragraph 3).
“We use contrastive loss (pairwise-based loss) as our distance metric loss function which is defined as the following: 
    PNG
    media_image4.png
    92
    602
    media_image4.png
    Greyscale
” (W. Kim, page 7, paragraph 1). As stated in paragraph [0070] of the instant Specification, Contrastive loss is a pairwise-based loss.
computing, using the proxy-based loss function and the pairwise-based loss function, a combined feature set using data from both a) the first set of vectors that represent one or more global features that indicate at least the geometry and spatial location data for the input image as a whole and b) the second set of vectors from that represent one or more local features that indicate at least the geometry and spatial location data for a region from a plurality of regions for the input image:
“The aim of the deep metric learning is to find an embedding function f : X → Y which maps samples x from a data space X to a feature embedding space Y so that f(xi) and f(xj) are closer in some metric when xi and xj are semantically similar” (W. Kim, page 2, paragraph 2).
“Let f : X → Y be an isometric embedding function between metric spaces X and Y” (W. Kim, page 3, paragraph 4); “Our goal is to approximate f with a deep neural network” (W. Kim, page 3, paragraph 5)
“In addition to the classical ensemble, we can consider the ensemble of two-step embedding function. Consider a function s : X → Z ... And we consider the isometric embedding g : Z → Y ... If we combine them into one function b(x) = g(s(x)), x ∈ X, the combined function is also an isometric embedding b : X → Y between metric spaces X and Y” (W. Kim, page 4, paragraph 4) “We are interested in another case where there are multiple embedding functions                                 
                                    
                                        
                                            b
                                        
                                        
                                            m
                                        
                                    
                                
                             : X → Y with multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             (sets of vectors) and a single g as the following: 
    PNG
    media_image5.png
    36
    222
    media_image5.png
    Greyscale
” (W. Kim, page 4, paragraph 7).
“With the attention-based ensemble, union of metric spaces by multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             is mapped by a single embedding function g” (W. Kim, page 5, paragraph 1).
“[T]he combined embedding function                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    x
                                    )
                                
                             for the learner m is defined as the following: 
    PNG
    media_image6.png
    33
    352
    media_image6.png
    Greyscale
 where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             with 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
. As discussed above, each 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
 produces a unique set of vectors for the mth attention pipeline. The embedding function g maps the vectors from each attention channel, each having a unique feature embedding, to the same embedding space.
“The loss for training aforementioned attention model is defined as: 
    PNG
    media_image2.png
    65
    711
    media_image2.png
    Greyscale
 where {(                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            ,                                 
                                    
                                        
                                            c
                                        
                                        
                                            i
                                        
                                    
                                
                            )} is a set of all training samples and labels,                                 
                                    
                                        
                                            L
                                        
                                        
                                            m
                                            e
                                            t
                                            r
                                            i
                                            c
                                        
                                    
                                
                            ,(m)(・) is the loss for the isometric embedding for the m-th learner (loss)” (W. Kim, page 6, paragraph 2); “We use contrastive loss (pairwise-based loss) as our distance metric loss function which is defined as the following: 
    PNG
    media_image4.png
    92
    602
    media_image4.png
    Greyscale
” (W. Kim, page 7, paragraph 1). The aforementioned model is a neural network serving as the embedding function. As stated in paragraph [0070] of the instant Specification, contrastive loss is a pairwise-based loss. Note that this function takes outputs of                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                
                            , the embedding function, as inputs. That means to minimize this function and train the network, embedding function outputs must be optimized.

    PNG
    media_image8.png
    464
    958
    media_image8.png
    Greyscale
”Fig. 3. The implementation of attention-based ensemble (ABE-M) using GoogLeNet” (W. Kim, page 7, Fig. 3). This illustrates the system of the paper, including a visualization of the embedding function being modified through the loss function.
…wherein the combined feature set corresponds to a final feature vector that is a single vector formed by merging at least one vector from the first set of vectors with at least one vector from the second set of vectors
“Attended feature output of S(x)◦Am(S(x)) is then fed into global feature embedding function G(・) to generate an embedding feature vector.” (W. Kim, page 5, paragraph 1)
“[T]he combined embedding function                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    x
                                    )
                                
                             for the learner m is defined as the following: 
    PNG
    media_image6.png
    33
    352
    media_image6.png
    Greyscale
 where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             with 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
. As discussed above, the embedding function g maps the vectors from each attention channel, merging multiple sets of vectors together as they’re mapped to the same embedding space.

    PNG
    media_image9.png
    565
    858
    media_image9.png
    Greyscale
 (W. Kim, page 7, Fig. 3). As seen in this diagram of the system, the outputs of the global feature embedding function across all attention channels can be considered a final feature vector that is a single vector of dimension 512 / M.
generating, using the combined feature set, a feature representation that is based on a final embedding output that is representative of content information, geometry information, and spatial information of the input image: 
“In addition to the classical ensemble, we can consider the ensemble of two-step embedding function. Consider a function s : X → Z ... And we consider the isometric embedding g : Z → Y ... We are interested in another case where there are multiple embedding functions                                 
                                    
                                        
                                            b
                                        
                                        
                                            m
                                        
                                    
                                
                             : X → Y with multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             and a single g (combined feature function) as the following:
    PNG
    media_image10.png
    66
    410
    media_image10.png
    Greyscale
” (W. Kim, page 4, paragraphs 4-7); The outputs of g (feature representation) constitute a set of features from different embeddings (one per learner) mapped to the same space (final embedding output).
“[W]e design an architecture which has multiple attention modules for multiple learners. By attending to different locations for different learners, diverse feature embedding functions are trained” (W. Kim, page 2, paragraph 1); “For deep metric learning, ensemble concatenates the feature embeddings learned by multiple learners” (W. Kim, page 1, paragraph 2). Each learner represents different kinds of features from the inputs. The concatenated embedding is representative of this information as a whole.
training, using the feature representation, a machine-learning model to output a prediction about an image based on inferences derived using the feature representation:
“The loss for training aforementioned attention model (machine-learning model) is defined as: 
    PNG
    media_image2.png
    65
    711
    media_image2.png
    Greyscale
 where {(                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            ,                                 
                                    
                                        
                                            c
                                        
                                        
                                            i
                                        
                                    
                                
                            )} is a set of all training samples and labels,                                 
                                    
                                        
                                            L
                                        
                                        
                                            m
                                            e
                                            t
                                            r
                                            i
                                            c
                                        
                                    
                                
                            ,(m)(・) is the loss for the isometric embedding for the m-th learner”(W. Kim, page 6, paragraph 2); “We use contrastive loss as our distance metric loss function which is defined as the following: 
    PNG
    media_image4.png
    92
    602
    media_image4.png
    Greyscale
… A pair (Bp(xi), Bq(xi)) represents feature embeddings (feature representation[s]) of a single image embedded by two different learners. We call it self pair from now on while positive and negative pairs refer to pairs of feature embeddings with same labels and different labels, respectively” (W. Kim, page 7, paragraph 1). The model is trained using feature representations.
“During testing, we compute the feature embeddings (feature representations) for all the test images from our network. For every test image, we then retrieve top K similar images from the test set excluding test image itself ... We evaluate the model (machine-learning model) after every 1000 iteration and report the results for the iteration with highest Recall@1.” (W. Kim, page 8, paragraph 2). The trained model is used to make predictions about similar images.
W. Kim relates to deep metric learning for images and is analogous to the claimed invention.
While W. Kim fails to disclose the further limitations of the claim, Cao teaches a method, comprising:
maintaining an input dataset of a plurality of images for a training process: “Training details. We use the training set of the Google Landmarks dataset (GLD) [39], containing 1:2M images from 15k landmarks, and divide it into two subsets ‘train’/‘val’ with 80%/20% split. The ‘train’ split is used for the actual learning, and the ‘val’ split is used for validating the learned classifier as training progresses” (Cao, page 8, paragraph 2)
maintaining, for the input dataset, a plurality of features derived from data values of the input dataset: “We use the training set (input dataset) of the Google Landmarks dataset (GLD) [39], containing 1:2M images” (Cao, page 8, paragraph 2); “Given an image, we apply a convolutional neural network backbone to obtain two feature maps: 
    PNG
    media_image11.png
    35
    535
    media_image11.png
    Greyscale
, representing shallower and Unifying Deep Local and Global Features for Image Search 5 deeper activations respectively” (Cao, page 4, paragraph 6)
maintaining, for an input image of the plurality of images in the input dataset, global features from the plurality of features and local features from the plurality of features: “Our proposed DELG (DEep Local and Global features) model (left) jointly extracts deep local and global features. Global features can be used in the first stage of a retrieval system, to efficiently select the most similar images (bottom). Local features can then be employed to re-rank top results (top-right), increasing precision of the system.” (Cao, page 2, Fig. 1). For global features to be used for similar image selection, and for local features to be used for ranking results, they each must be maintain[ed] in some capacity
determining, for the input image from the input dataset
i) a first set of vectors that represent one or more global features that indicate at least data for the content of the input image as a whole: “A global feature … summarizes the contents of an image … Global features can learn similarity across very different poses where local features would not be able to find correspondences” (Cao, page 1, paragraph 2); “These two components produce a global feature 
    PNG
    media_image12.png
    31
    86
    media_image12.png
    Greyscale
 that summarizes the discriminative contents of the whole image” (Cao, page 5, paragraph 2)
ii) a second set of vectors that represent one or more local features that indicate at least geometry and spatial location data for a region from a plurality of regions for the input image, at least some regions from the plurality of regions having different local features:
“The local descriptors (local features) are obtained as L = T(S), where                                 
                                    L
                                    ∈
                                    
                                        
                                            R
                                        
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    S
                                                
                                            
                                            ×
                                            
                                                
                                                    W
                                                
                                                
                                                    S
                                                
                                            
                                            ×
                                            
                                                
                                                    C
                                                
                                                
                                                    T
                                                
                                            
                                        
                                    
                                
                            ” (Cao, page 5, paragraph 5)
“Local features [28,7,64,39,34], on the other hand, comprise descriptors and geometry information about specific image regions (plurality of regions); they are especially useful to match images depicting rigid objects” (Cao, page 1, paragraph 2);
“The key advantage of local features over global ones for retrieval is the ability to perform spatial matching (spatial information)” (Cao, page 3, paragraph 2). 

    PNG
    media_image13.png
    632
    498
    media_image13.png
    Greyscale
”Fig. 5: Examples of correct local feature matches, for image pairs depicting the same object/scene” (Cao, page 18, Figure 5). Here, each line corresponds to a local feature in some part of the image. Different image regions have different local features.
storing the machine-learning model in memory for use by a system to detect objects with at least some features from the plurality of features
“Our proposed DELG (DEep Local and Global features) model (left) jointly extracts deep local and global features” (Cao, page 2, Figure 1)
“For optimal performance, image retrieval requires semantic understanding of the types of objects that a user may be interested in, such that the system can distinguish between relevant objects versus clutter/background” (Cao, page 4, paragraph 2)

    PNG
    media_image14.png
    414
    681
    media_image14.png
    Greyscale
”Table 6: Feature extraction latency and database memory requirements for different image retrieval models … (C) DELG and DELG? (the machine-learning model) are compared with different configurations. As a reference, we also provide numbers for DELF in the last rows” (Cao, page 13, Table 5)	W. Kim and Cao both relate to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified W. Kim to obtain local and global features from the input images, as disclosed by Cao. Both types are necessary for high image retrieval performance (Cao, page 1, paragraph 2).
	Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified W. Kim to obtain local and global features from the input images, as disclosed by Cao. Both types are necessary for high image retrieval performance (Cao, page 1, paragraph 2).
	While W. Kim and Cao fail to disclose the further limitations of the claim, S. Kim teaches a method of computing, using the input image, a proxy-based loss function and a pairwise-based loss function:
“We propose a novel metric learning loss that takes advantages of both pair-based and proxy-based methods“ (S. Kim, page 2, Left column, paragraph 3).
“Specifically, for each proxy, the loss aims to pull data of the same class close to the proxy and to push others away in the embedding space” (S. Kim, page 2, Left column, paragraph 2).
“We evaluate our method with Inception-BN backbone while varying the sizes of input images: (224 x 224; 256 x 256; 324 x 324; 448 x 448}. Table 7 also shows that the accuracy improves consistently as the sizes of the input images increase” (S. Kim, page 11, left column, paragraph 3).
	S. Kim relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim and Cao to use proxy-based and pairwise-based loss methods, as disclosed by S. Kim. Doing so would achieve state-of-the-art accuracy through quick convergence while still accounting for data-to-data relations, as well as build resistance to noisy labels and outliers. See S. Kim, page 2, paragraph 2.
While W. Kim, Cao, and S. Kim don’t disclose the further limitations of the claim, Wang teaches [a] system comprising a processing device and a non-transitory machine-readable storage device storing instructions that are executable by the processing device to cause performance of operations: “According to still another aspect of the embodiments of the present disclosure, a non-transitory computer-readable storage medium is provided, which has computer instructions stored thereon, where execution of the computer-readable instructions by a processor causes the processor to implement the image processing method as described above” (Wang, [0007]).
Wang relates to deep metric learning for images and is analogous to the claimed invention. The combination of W. Kim, Cao, and S. Kim teaches a computational method of processing images with neural networks. The claimed invention improves upon this method by including computer hardware to store and execute its instructions. Wang teaches a computer apparatus that can execute computational methods, applicable to methods using neural networks for deep metric learning in images. A person of ordinary skill in the art would have recognized that running a computer method on computer hardware would lead to the predictable result of the method’s algorithm being performed by said computer, and would improve the method by allowing it to be used for practical purposes (MPEP 2143 I. (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results).

Regarding claim 11, the rejection of claim 10 in view of W. Kim, Cao, S. Kim, and Wang is incorporated. W. Kim further discloses a method of generating a first set of embeddings corresponding to the first set of vectors based on the proxy-based loss function and the pairwise-based loss function; and generating a second set of embeddings corresponding to the second set of vectors based on the proxy-based loss function and the pairwise-based loss function:
“[T]he combined embedding function                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    x
                                    )
                                
                             for the learner m is defined as the following: 
    PNG
    media_image6.png
    33
    352
    media_image6.png
    Greyscale
 where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             with 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
. As discussed regarding claim 1, each 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
 produces a unique set of vectors for the mth attention pipeline.

    PNG
    media_image1.png
    284
    366
    media_image1.png
    Greyscale
(W. Kim, page 2, Fig. 1(b)). This model supports at least three attention pipelines (three set[s] of embeddings corresponding to … set[s] of vectors)
“The loss for training aforementioned attention model is defined as: 
    PNG
    media_image2.png
    65
    711
    media_image2.png
    Greyscale
 where {(                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            ,                                 
                                    
                                        
                                            c
                                        
                                        
                                            i
                                        
                                    
                                
                            )} is a set of all training samples and labels,                                 
                                    
                                        
                                            L
                                        
                                        
                                            m
                                            e
                                            t
                                            r
                                            i
                                            c
                                        
                                    
                                
                            ,(m)(・) is the loss for the isometric embedding for the m-th learner (loss)” (W. Kim, page 6, paragraph 2); “We use contrastive loss (pairwise-based loss) as our distance metric loss function which is defined as the following: 
    PNG
    media_image4.png
    92
    602
    media_image4.png
    Greyscale
” (W. Kim, page 7, paragraph 1).
While W. Kim fails to disclose the further limitations of the claim, S. Kim further teaches a method of generating a … set of embeddings corresponding to the … set of vectors based on the proxy-based loss function and the pairwise-based loss function:
“The networks are trained to project data (set of vectors) onto an embedding space in which semantically similar data (e.g., images of the same class) are closely grouped together. Such a quality of the embedding space is given mainly by loss functions used for training the networks, and most of the losses are categorized into two classes: pair-based and proxy-based.”(S. Kim, page 1, left column, paragraph 1)
“We propose a novel metric learning loss that takes advantages of both pair-based and proxy-based methods“ (S. Kim, page 2, Left column, paragraph 3); “Specifically, for each proxy, the loss aims to pull data of the same class close to the proxy and to push others away in the embedding space” (S. Kim, page 2, Left column, paragraph 2).
“Let x denote the embedding vector of the input, p+ be the positive proxy, and p􀀀 be a negative proxy. The loss is then given by 
    PNG
    media_image15.png
    217
    655
    media_image15.png
    Greyscale
where X is a batch of embedding vectors (set of embeddings)”
	S. Kim relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, and Wang to use proxy-based and pairwise-based loss methods for feature embeddings, as disclosed by S. Kim. Doing so would achieve state-of-the-art accuracy through quick convergence while still accounting for data-to-data relations, as well as build resistance to noisy labels and outliers. See S. Kim, page 2, paragraph 2.

	Regarding claim 12, the rejection of claim 11 in view of W. Kim, Cao, S. Kim, and Wang is incorporated. W. Kim further teaches a method, wherein generating a feature representation comprises: generating, from the first and second sets of embeddings, a final embedding output that is representative of content information, geometry information, and spatial information of the input image:
“In addition to the classical ensemble, we can consider the ensemble of two-step embedding function. Consider a function s : X → Z ... And we consider the isometric embedding g : Z → Y ... We are interested in another case where there are multiple embedding functions                                 
                                    
                                        
                                            b
                                        
                                        
                                            m
                                        
                                    
                                
                             : X → Y with multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             (sets of embeddings) and a single g as the following:
    PNG
    media_image10.png
    66
    410
    media_image10.png
    Greyscale
” (W. Kim, page 4, paragraphs 4-7); “With the attention-based ensemble, union of metric spaces by multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             is mapped by a single embedding function g” (W. Kim, page 5, paragraph 1); “the combined embedding function Bm(x) for the learner m is defined as the following: 
    PNG
    media_image16.png
    31
    334
    media_image16.png
    Greyscale
” (W. Kim, page 5, paragraph 3);” The outputs of g constitute a set of features from different embeddings (one per learner) mapped to the same space (final embedding output).
“[W]e design an architecture which has multiple attention modules for multiple learners. By attending to different locations for different learners, diverse feature embedding functions are trained” (W. Kim, page 2, paragraph 1); “For deep metric learning, ensemble concatenates the feature embeddings learned by multiple learners” (W. Kim, page 1, paragraph 2). Each learner represents different kinds of features from the inputs. The concatenated embedding is representative of this information as a whole.
	While W. Kim fails to disclose the further limitations of the claim, Cao teaches a method of deriving features that are representative of content information, geometry information, and spatial information of the input image: “A global feature, also commonly referred to as ‘global descriptor’ or ‘embedding’, summarizes the contents of an image (content information) ... Local features, on the other hand, comprise descriptors and geometry information about specific image regions” (Cao, page 1, paragraph 2); “The key advantage of local features over global ones for retrieval is the ability to perform spatial matching (spatial information)” (Cao, page 3, paragraph 2).
	Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, and Wang to use both local and global features, as disclosed by Cao. Both types are necessary for high image retrieval performance (Cao, page 1, paragraph 2).

Regarding claim 13, the rejection of claim 10 in view of W. Kim, Cao, S. Kim, and Wang is incorporated. Cao further discloses a method, wherein maintaining the global features and the local features comprises: encoding, using an encoder module of the architecture, the input image to an attribute range comprising a range that spans from low-level descriptors of the input image to high- level descriptors of the input image: “Our first contribution is a unified model to represent both local and global features, using a convolutional neural network (CNN) (encoder module), referred to as DELG (DEep Local and Global features). This allows for efficient inference by extracting an image’s (input image of the input dataset) global feature, detected keypoints and local descriptors within a single model” (Cao, page 2, paragraph 3); “A global feature, also commonly referred to as ‘global descriptor’ or ‘embedding’, summarizes the contents of an image (high-level descriptors) ... Local features, on the other hand, comprise descriptors and geometry information about specific image regions (low-level descriptors)” (Cao, page 1, paragraph 2). Together, low and high-level features encompass an attribute range. Low and high-level features are being interpreted according to the definition given in paragraph [0023] of the instant Specification.
Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, and Wang to identify global and local features using a neural network encoder, as disclosed by Cao. Using Cao’s method would allow for efficient inference by extracting local and global features at very low cost within a single model (Cao, page 15, paragraph 3).

Regarding claim 19, W. Kim teaches a method, comprising:
determining, for the input image from the input dataset, i) a first set of vectors that represent one or more global features that indicate at least data for the input image as a whole and ii) a second set of vectors that represent one or more local features that indicate at least geometry and spatial location data for a region from a plurality of regions for the input image, at least some regions from the plurality of regions having different local features:
“In deep metric learning, feature embedding function is modeled as a deep neural network. This feature embedding function embeds input images into feature embedding space with a certain desired condition” (W. Kim, page 1, paragraph 1).
“We call S(⋅) a spatial feature extractor and G(⋅) a global feature embedding function” (W. Kim, page 5, paragraph 3). S extracts features from images.
“                                
                                    
                                        
                                            A
                                        
                                        
                                            m
                                        
                                        
                                            '
                                        
                                    
                                
                            (⋅) consists of a convolution layer of 480 kernels of size 1×1 to match the output of S(⋅) for the element-wise product” (W. Kim, page 6, paragraph 4). The output of S is a vector. In this case, S outputs a 480-dimensional vector.
“Note that, same feature extraction module is shared across different learners while individual learners have their own attention module                                 
                                    
                                        
                                            A
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    ⋅
                                    )
                                
                            . The attention function                                 
                                    
                                        
                                            A
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    S
                                    (
                                    x
                                    )
                                    )
                                
                             outputs an attention mask with same size as output of S(x). product. Attended feature output of                                 
                                    S
                                    
                                        
                                            x
                                        
                                    
                                    ∘
                                    
                                        
                                            A
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    S
                                    
                                        
                                            x
                                        
                                    
                                    )
                                
                             (set of vectors that represent one or more features) is then fed into global feature embedding function G(・) to generate an embedding feature vector.” (W. Kim, page 5, paragraph 4). Each learner m generates a unique set of vectors by running the extracted features S through its own attention mechanism.

    PNG
    media_image1.png
    284
    366
    media_image1.png
    Greyscale
 “In attention-based ensemble, single feature embedding function (G) is trained while each learner learns different attention modules (A1,A2,A3)” (W. Kim, page 2, Fig. 1). In this example, there are three learners (m = 3) producing three sets of vectors. If S is producing global and local features (see mapping of Cao regarding claim 1 below), this results in three unique sets of vectors, each representing global and local features.
computing, using the input image, a proxy-based loss function and pairwise-based loss function: 
“The loss for training aforementioned attention model is defined as: 
    PNG
    media_image2.png
    65
    711
    media_image2.png
    Greyscale
 where {(                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            ,                                 
                                    
                                        
                                            c
                                        
                                        
                                            i
                                        
                                    
                                
                            )} is a set of all training samples and labels,                                 
                                    
                                        
                                            L
                                        
                                        
                                            m
                                            e
                                            t
                                            r
                                            i
                                            c
                                        
                                    
                                
                            ,(m)(・) is the loss for the isometric embedding for the m-th learner (loss) … divergence loss Ldiv is defined as the following: 
    PNG
    media_image3.png
    69
    749
    media_image3.png
    Greyscale
” (W. Kim, page 6, paragraph 2).
“The divergence loss encourages each learner to attend to the different part of the input image by increasing the distance between the points embedded by the input image” (W. Kim, page 6, paragraph 3).
“We use contrastive loss (pairwise-based loss) as our distance metric loss function which is defined as the following: 
    PNG
    media_image4.png
    92
    602
    media_image4.png
    Greyscale
” (W. Kim, page 7, paragraph 1). As stated in paragraph [0070] of the instant Specification, Contrastive loss is a pairwise-based loss.
computing, using the proxy-based loss function and the pairwise-based loss function, a combined feature set using data from both a) the first set of vectors that represent one or more global features that indicate at least the geometry and spatial location data for the input image as a whole and b) the second set of vectors from that represent one or more local features that indicate at least the geometry and spatial location data for a region from a plurality of regions for the input image:
“The aim of the deep metric learning is to find an embedding function f : X → Y which maps samples x from a data space X to a feature embedding space Y so that f(xi) and f(xj) are closer in some metric when xi and xj are semantically similar” (W. Kim, page 2, paragraph 2).
“Let f : X → Y be an isometric embedding function between metric spaces X and Y” (W. Kim, page 3, paragraph 4); “Our goal is to approximate f with a deep neural network” (W. Kim, page 3, paragraph 5)
“In addition to the classical ensemble, we can consider the ensemble of two-step embedding function. Consider a function s : X → Z ... And we consider the isometric embedding g : Z → Y ... If we combine them into one function b(x) = g(s(x)), x ∈ X, the combined function is also an isometric embedding b : X → Y between metric spaces X and Y” (W. Kim, page 4, paragraph 4) “We are interested in another case where there are multiple embedding functions                                 
                                    
                                        
                                            b
                                        
                                        
                                            m
                                        
                                    
                                
                             : X → Y with multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             (sets of vectors) and a single g as the following: 
    PNG
    media_image5.png
    36
    222
    media_image5.png
    Greyscale
” (W. Kim, page 4, paragraph 7).
“With the attention-based ensemble, union of metric spaces by multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             is mapped by a single embedding function g” (W. Kim, page 5, paragraph 1).
“[T]he combined embedding function                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    x
                                    )
                                
                             for the learner m is defined as the following: 
    PNG
    media_image6.png
    33
    352
    media_image6.png
    Greyscale
 where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             with 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
. As discussed above, each 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
 produces a unique set of vectors for the mth attention pipeline. The embedding function g maps the vectors from each attention channel, each having a unique feature embedding, to the same embedding space.
“The loss for training aforementioned attention model is defined as: 
    PNG
    media_image2.png
    65
    711
    media_image2.png
    Greyscale
 where {(                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            ,                                 
                                    
                                        
                                            c
                                        
                                        
                                            i
                                        
                                    
                                
                            )} is a set of all training samples and labels,                                 
                                    
                                        
                                            L
                                        
                                        
                                            m
                                            e
                                            t
                                            r
                                            i
                                            c
                                        
                                    
                                
                            ,(m)(・) is the loss for the isometric embedding for the m-th learner (loss)” (W. Kim, page 6, paragraph 2); “We use contrastive loss (pairwise-based loss) as our distance metric loss function which is defined as the following: 
    PNG
    media_image4.png
    92
    602
    media_image4.png
    Greyscale
” (W. Kim, page 7, paragraph 1). The aforementioned model is a neural network serving as the embedding function. As stated in paragraph [0070] of the instant Specification, contrastive loss is a pairwise-based loss. Note that this function takes outputs of                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                
                            , the embedding function, as inputs. That means to minimize this function and train the network, embedding function outputs must be optimized.

    PNG
    media_image8.png
    464
    958
    media_image8.png
    Greyscale
”Fig. 3. The implementation of attention-based ensemble (ABE-M) using GoogLeNet” (W. Kim, page 7, Fig. 3). This illustrates the system of the paper, including a visualization of the embedding function being modified through the loss function.
…wherein the combined feature set corresponds to a final feature vector that is a single vector formed by merging at least one vector from the first set of vectors with at least one vector from the second set of vectors
“Attended feature output of S(x)◦Am(S(x)) is then fed into global feature embedding function G(・) to generate an embedding feature vector.” (W. Kim, page 5, paragraph 1)
“[T]he combined embedding function                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    x
                                    )
                                
                             for the learner m is defined as the following: 
    PNG
    media_image6.png
    33
    352
    media_image6.png
    Greyscale
 where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             with 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
. As discussed above, the embedding function g maps the vectors from each attention channel, merging multiple sets of vectors together as they’re mapped to the same embedding space.

    PNG
    media_image9.png
    565
    858
    media_image9.png
    Greyscale
 (W. Kim, page 7, Fig. 3). As seen in this diagram of the system, the outputs of the global feature embedding function across all attention channels can be considered a final feature vector that is a single vector of dimension 512 / M.
generating, using the combined feature set, a feature representation that is based on a final embedding output that is representative of content information, geometry information, and spatial information of the input image: 
“In addition to the classical ensemble, we can consider the ensemble of two-step embedding function. Consider a function s : X → Z ... And we consider the isometric embedding g : Z → Y ... We are interested in another case where there are multiple embedding functions                                 
                                    
                                        
                                            b
                                        
                                        
                                            m
                                        
                                    
                                
                             : X → Y with multiple                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             and a single g (combined feature function) as the following:
    PNG
    media_image10.png
    66
    410
    media_image10.png
    Greyscale
” (W. Kim, page 4, paragraphs 4-7); The outputs of g (feature representation) constitute a set of features from different embeddings (one per learner) mapped to the same space (final embedding output).
“[W]e design an architecture which has multiple attention modules for multiple learners. By attending to different locations for different learners, diverse feature embedding functions are trained” (W. Kim, page 2, paragraph 1); “For deep metric learning, ensemble concatenates the feature embeddings learned by multiple learners” (W. Kim, page 1, paragraph 2). Each learner represents different kinds of features from the inputs. The concatenated embedding is representative of this information as a whole.
training, using the feature representation, a machine-learning model to output a prediction about an image based on inferences derived using the feature representation:
“The loss for training aforementioned attention model (machine-learning model) is defined as: 
    PNG
    media_image2.png
    65
    711
    media_image2.png
    Greyscale
 where {(                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            ,                                 
                                    
                                        
                                            c
                                        
                                        
                                            i
                                        
                                    
                                
                            )} is a set of all training samples and labels,                                 
                                    
                                        
                                            L
                                        
                                        
                                            m
                                            e
                                            t
                                            r
                                            i
                                            c
                                        
                                    
                                
                            ,(m)(・) is the loss for the isometric embedding for the m-th learner”(W. Kim, page 6, paragraph 2); “We use contrastive loss as our distance metric loss function which is defined as the following: 
    PNG
    media_image4.png
    92
    602
    media_image4.png
    Greyscale
… A pair (Bp(xi), Bq(xi)) represents feature embeddings (feature representation[s]) of a single image embedded by two different learners. We call it self pair from now on while positive and negative pairs refer to pairs of feature embeddings with same labels and different labels, respectively” (W. Kim, page 7, paragraph 1). The model is trained using feature representations.
“During testing, we compute the feature embeddings (feature representations) for all the test images from our network. For every test image, we then retrieve top K similar images from the test set excluding test image itself ... We evaluate the model (machine-learning model) after every 1000 iteration and report the results for the iteration with highest Recall@1.” (W. Kim, page 8, paragraph 2). The trained model is used to make predictions about similar images.
W. Kim relates to deep metric learning for images and is analogous to the claimed invention.
While W. Kim fails to disclose the further limitations of the claim, Cao teaches a method, comprising:
maintaining an input dataset of a plurality of images for a training process: “Training details. We use the training set of the Google Landmarks dataset (GLD) [39], containing 1:2M images from 15k landmarks, and divide it into two subsets ‘train’/‘val’ with 80%/20% split. The ‘train’ split is used for the actual learning, and the ‘val’ split is used for validating the learned classifier as training progresses” (Cao, page 8, paragraph 2)
maintaining, for the input dataset, a plurality of features derived from data values of the input dataset: “We use the training set (input dataset) of the Google Landmarks dataset (GLD) [39], containing 1:2M images” (Cao, page 8, paragraph 2); “Given an image, we apply a convolutional neural network backbone to obtain two feature maps: 
    PNG
    media_image11.png
    35
    535
    media_image11.png
    Greyscale
, representing shallower and Unifying Deep Local and Global Features for Image Search 5 deeper activations respectively” (Cao, page 4, paragraph 6)
maintaining, for an input image of the plurality of images in the input dataset, global features from the plurality of features and local features from the plurality of features: “Our proposed DELG (DEep Local and Global features) model (left) jointly extracts deep local and global features. Global features can be used in the first stage of a retrieval system, to efficiently select the most similar images (bottom). Local features can then be employed to re-rank top results (top-right), increasing precision of the system.” (Cao, page 2, Fig. 1). For global features to be used for similar image selection, and for local features to be used for ranking results, they each must be maintain[ed] in some capacity.
determining, for the input image from the input dataset
i) a first set of vectors that represent one or more global features that indicate at least data for the content of the input image as a whole: “A global feature … summarizes the contents of an image … Global features can learn similarity across very different poses where local features would not be able to find correspondences” (Cao, page 1, paragraph 2); “These two components produce a global feature 
    PNG
    media_image12.png
    31
    86
    media_image12.png
    Greyscale
 that summarizes the discriminative contents of the whole image” (Cao, page 5, paragraph 2)
ii) a second set of vectors that represent one or more local features that indicate at least geometry and spatial location data for a region from a plurality of regions for the input image, at least some regions from the plurality of regions having different local features:
“The local descriptors (local features) are obtained as L = T(S), where                                 
                                    L
                                    ∈
                                    
                                        
                                            R
                                        
                                        
                                            
                                                
                                                    H
                                                
                                                
                                                    S
                                                
                                            
                                            ×
                                            
                                                
                                                    W
                                                
                                                
                                                    S
                                                
                                            
                                            ×
                                            
                                                
                                                    C
                                                
                                                
                                                    T
                                                
                                            
                                        
                                    
                                
                            ” (Cao, page 5, paragraph 5)
“Local features [28,7,64,39,34], on the other hand, comprise descriptors and geometry information about specific image regions (plurality of regions); they are especially useful to match images depicting rigid objects” (Cao, page 1, paragraph 2);
“The key advantage of local features over global ones for retrieval is the ability to perform spatial matching (spatial information)” (Cao, page 3, paragraph 2). 

    PNG
    media_image13.png
    632
    498
    media_image13.png
    Greyscale
”Fig. 5: Examples of correct local feature matches, for image pairs depicting the same object/scene” (Cao, page 18, Figure 5). Here, each line corresponds to a local feature in some part of the image. Different image regions have different local features.
generating, using the combined feature set, a feature representation that is based on a final embedding output that is representative of content information, geometry information, and spatial information of the input image: “A global feature, also commonly referred to as ‘global descriptor’ or ‘embedding’, summarizes the contents of an image (content information) ... Local features, on the other hand, comprise descriptors and geometry information about specific image regions” (Cao, page 1, paragraph 2); “The key advantage of local features over global ones for retrieval is the ability to perform spatial matching (spatial information)” (Cao, page 3, paragraph 2).
storing the machine-learning model in memory for use by a system to detect objects with at least some features from the plurality of features
“Our proposed DELG (DEep Local and Global features) model (left) jointly extracts deep local and global features” (Cao, page 2, Figure 1)
“For optimal performance, image retrieval requires semantic understanding of the types of objects that a user may be interested in, such that the system can distinguish between relevant objects versus clutter/background” (Cao, page 4, paragraph 2)

    PNG
    media_image14.png
    414
    681
    media_image14.png
    Greyscale
”Table 6: Feature extraction latency and database memory requirements for different image retrieval models … (C) DELG and DELG? (the machine-learning model) are compared with different configurations. As a reference, we also provide numbers for DELF in the last rows” (Cao, page 13, Table 5)
	Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified W. Kim to obtain local and global features from the input images, as disclosed by Cao. Both types are necessary for high image retrieval performance (Cao, page 1, paragraph 2).
	While W. Kim and Cao fail to disclose the further limitations of the claim, S. Kim teaches a method of computing, using the input image, a proxy-based loss function and a pairwise-based loss function:
“We propose a novel metric learning loss that takes advantages of both pair-based and proxy-based methods“ (S. Kim, page 2, Left column, paragraph 3).
“Specifically, for each proxy, the loss aims to pull data of the same class close to the proxy and to push others away in the embedding space” (S. Kim, page 2, Left column, paragraph 2).
“We evaluate our method with Inception-BN backbone while varying the sizes of input images: (224 x 224; 256 x 256; 324 x 324; 448 x 448}. Table 7 also shows that the accuracy improves consistently as the sizes of the input images increase” (S. Kim, page 11, left column, paragraph 3).
S. Kim relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim and Cao to use proxy-based and pairwise-based loss methods, as disclosed by S. Kim. Doing so would achieve state-of-the-art accuracy through quick convergence while still accounting for data-to-data relations, as well as build resistance to noisy labels and outliers. See S. Kim, page 2, paragraph 2.
While W. Kim, Cao, and S. Kim don’t disclose the further limitations of the claim, Wang teaches [a] non-transitory machine-readable storage device storing instructions that are executable by a processing device to cause performance of operations: “According to still another aspect of the embodiments of the present disclosure, a non-transitory computer-readable storage medium is provided, which has computer instructions stored thereon, where execution of the computer-readable instructions by a processor causes the processor to implement the image processing method as described above” (Wang, [0007]).
Wang relates to deep metric learning for images and is analogous to the claimed invention. The combination of W. Kim, Cao, and S. Kim teaches a computational method of processing images with neural networks. The claimed invention improves upon this method by including computer hardware to store and execute its instructions. Wang teaches a computer apparatus that can execute computational methods, applicable to methods using neural networks for deep metric learning in images. A person of ordinary skill in the art would have recognized that running a computer method on computer hardware would lead to the predictable result of the method’s algorithm being performed by said computer, and would improve the method by allowing it to be used for practical purposes (MPEP 2143 I. (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results).

Regarding claim 20, the rejection of claim 19 in view of W. Kim, Cao, S. Kim, and Wang is incorporated. W. Kim further discloses a method of generating a first set of embeddings corresponding to the first set of vectors based on the proxy-based loss function and the pairwise-based loss function; and generating a second set of embeddings corresponding to the second set of vectors based on the proxy-based loss function and the pairwise-based loss function:
“[T]he combined embedding function                                 
                                    
                                        
                                            B
                                        
                                        
                                            m
                                        
                                    
                                    (
                                    x
                                    )
                                
                             for the learner m is defined as the following: 
    PNG
    media_image6.png
    33
    352
    media_image6.png
    Greyscale
 where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing                                 
                                    
                                        
                                            s
                                        
                                        
                                            m
                                        
                                    
                                
                             with 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
. As discussed regarding claim 1, each 
    PNG
    media_image7.png
    29
    168
    media_image7.png
    Greyscale
 produces a unique set of vectors for the mth attention pipeline.

    PNG
    media_image1.png
    284
    366
    media_image1.png
    Greyscale
(W. Kim, page 2, Fig. 1(b)). This model supports at least three attention pipelines (three set[s] of embeddings corresponding to … set[s] of vectors)
“The loss for training aforementioned attention model is defined as: 
    PNG
    media_image2.png
    65
    711
    media_image2.png
    Greyscale
 where {(                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            ,                                 
                                    
                                        
                                            c
                                        
                                        
                                            i
                                        
                                    
                                
                            )} is a set of all training samples and labels,                                 
                                    
                                        
                                            L
                                        
                                        
                                            m
                                            e
                                            t
                                            r
                                            i
                                            c
                                        
                                    
                                
                            ,(m)(・) is the loss for the isometric embedding for the m-th learner (loss)” (W. Kim, page 6, paragraph 2); “We use contrastive loss (pairwise-based loss) as our distance metric loss function which is defined as the following: 
    PNG
    media_image4.png
    92
    602
    media_image4.png
    Greyscale
” (W. Kim, page 7, paragraph 1).
While W. Kim fails to disclose the further limitations of the claim, S. Kim further teaches a method of generating a … set of embeddings corresponding to the … set of vectors based on the proxy-based loss function and the pairwise-based loss function:
“The networks are trained to project data (set of vectors) onto an embedding space in which semantically similar data (e.g., images of the same class) are closely grouped together. Such a quality of the embedding space is given mainly by loss functions used for training the networks, and most of the losses are categorized into two classes: pair-based and proxy-based.”(S. Kim, page 1, left column, paragraph 1)
“We propose a novel metric learning loss that takes advantages of both pair-based and proxy-based methods“ (S. Kim, page 2, Left column, paragraph 3); “Specifically, for each proxy, the loss aims to pull data of the same class close to the proxy and to push others away in the embedding space” (S. Kim, page 2, Left column, paragraph 2).
“Let x denote the embedding vector of the input, p+ be the positive proxy, and p􀀀 be a negative proxy. The loss is then given by 
    PNG
    media_image15.png
    217
    655
    media_image15.png
    Greyscale
where X is a batch of embedding vectors (set of embeddings)”
	S. Kim relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, and Wang to use proxy-based and pairwise-based loss methods for feature embeddings, as disclosed by S. Kim. Doing so would achieve state-of-the-art accuracy through quick convergence while still accounting for data-to-data relations, as well as build resistance to noisy labels and outliers. See S. Kim, page 2, paragraph 2.

Claims 14-17 are rejected under 35 U.S.C. 103 as being unpatentable over Wonsik Kim et al. (Attention-based Ensemble for Deep Metric Learning, 2018, arXiv:1804.00382v2), hereafter referred to as W. Kim, in view of Cao et al. (Unifying Deep Local and Global Features for Image Search, Sep. 2020, arXiv:2001.05027v4), hereafter referred to as Cao, and further in view of Sungyeon Kim et al. (Proxy Anchor Loss for Deep Metric Learning, Mar. 2020, arXiv:2003.13911v1), hereafter referred to as S. Kim, Wang et al. (US 2019/0311223 A1, Image Processing Methods and Apparatus, and Electronic Devices), and Ng et al. (SOLAR: Second-Order Loss and Attention for Image Retrieval, Aug. 2020, arXiv:2001.08972v5).
	
Regarding claim 14, the rejection of claim 10 in view of W. Kim, Cao, S. Kim, and Wang is incorporated. Ng teaches a method, wherein determining the first set of vectors that represent one or more global features comprises: generating an enhanced set of global features in response to processing the global features by a first second-order attention block; and determining the first set of vectors from the enhanced set of global features:
In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global (global features)” (Ng, page 1, Abstract). The following method is applicable to global features.
“From an input image 
    PNG
    media_image17.png
    34
    152
    media_image17.png
    Greyscale
 processed through a Fully-Convolutional Net- work (FCN) denoted by                                 
                                    θ
                                
                            , we obtain a feature map 
    PNG
    media_image18.png
    37
    235
    media_image18.png
    Greyscale
(features) where h, w and d are height, width and feature dimensionality, respectively” (Ng, page 4, paragraph 2).
“Finally,                                 
                                    
                                        
                                            f
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                             map (enhanced set of ... features) is obtained from the first-order features f by the second-order attention
    PNG
    media_image19.png
    41
    265
    media_image19.png
    Greyscale
 where                                 
                                    ψ
                                
                             is another 1 x 1 convolution to control the influence of the attention. Thus, a new feature                                 
                                    
                                        
                                            f
                                        
                                        
                                            i
                                            ,
                                            j
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                             in the second-order map                                 
                                    
                                        
                                            f
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                             (reshaped to h x w x d), is a function of features from all locations in f ... This is referred to as the Second-Order Attention (SOA) block (second-order attention block)” (Ng, page 5, paragraph 1).                                 
                                    
                                        
                                            f
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                            maps a 2D image to a set of 3D feature values, the set of which can be considered a set of vectors.
	Ng relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, and Wang to enhance features with a second-order attention block, as disclosed by Ng. Second-order attention allows the learning of optimal relative contributions for individual locations in an image (Ng, page 4, paragraph 5), and has been shown to improve patch descriptors for image matching (Ng, page 1, paragraph 1).

	Regarding claim 15, the rejection of claim 14 in view of W. Kim, Cao, S. Kim, Wang, and Ng is incorporated. Cao further teaches a method, wherein: the enhanced set of global features comprises second-order information from spatial locations in high-level descriptors of the input image: “A global feature, also commonly referred to as ‘global descriptor’ or ‘embedding’, summarizes the contents of an image (high-level descriptors)” (Cao, page 1, paragraph 2).
Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, Wang, and Ng to use global features associated with high-level descriptors, as disclosed by Cao. These features have high recall, can learn similarity across very different poses, and excel at high image retrieval performance with compact representations. See (Cao, page 1, paragraph 2) and (Cao, page 3, paragraph 3).
	While Cao fails to disclose the further limitations of the claim, Ng further teaches a method, wherein: the enhanced set of global features comprises second-order information from spatial locations in high-level descriptors of the input image: “In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global“ (Ng, page 1, Abstract); “From an input image 
    PNG
    media_image20.png
    38
    156
    media_image20.png
    Greyscale
processed through a Fully-Convolutional Net-work (FCN) denoted by                         
                            θ
                        
                    , we obtain a feature map 
    PNG
    media_image21.png
    37
    238
    media_image21.png
    Greyscale
” (Ng, page 4, paragraph 2); “Therefore we propose to generate a map                         
                            
                                
                                    f
                                
                                
                                    s
                                    o
                                
                            
                        
                     (enhanced set of ... features) with local features                         
                            
                                
                                    f
                                
                                
                                    i
                                    ,
                                    j
                                
                                
                                    s
                                    o
                                
                            
                        
                     that reflect the correlations between all spatial locations from within                         
                            
                                
                                    f
                                
                                
                                    s
                                    o
                                
                            
                        
                    , hence the `second-order'” (Ng, page 4, paragraph 5). As evident from the abstract, this is applicable to global and local features f. Note that global or local features are extracted from images as feature map f. The local features encoding second-order information in                          
                            
                                
                                    f
                                
                                
                                    s
                                    o
                                
                            
                        
                     differ from these extracted features f, and can be applied to either global or local features.
	Ng relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, Wang, and Ng to use enhanced features comprising second-order information from spatial locations, as disclosed by Ng. Features with second-order information allow the learning of optimal relative contributions for individual locations in an image (Ng, page 4, paragraph 5), and has been shown to improve patch descriptors for image matching.

	Regarding claim 16, the rejection of claim 15 in view of W. Kim, Cao, S. Kim, Wang, and Ng is incorporated. Ng further teaches a method, wherein determining the second set of vectors that represent one or more local features comprises: generating an enhanced set of local features in response to processing the local features by a second second-order attention block; and determining the second set of vectors from the enhanced set of local features: 
In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local (local features) and global” (Ng, page 1, Abstract). The following method is applicable to local features.
“From an input image 
    PNG
    media_image17.png
    34
    152
    media_image17.png
    Greyscale
 processed through a Fully-Convolutional Net- work (FCN) denoted by                                 
                                    θ
                                
                            , we obtain a feature map 
    PNG
    media_image18.png
    37
    235
    media_image18.png
    Greyscale
(features) where h, w and d are height, width and feature dimensionality, respectively” (Ng, page 4, paragraph 2).
“Finally,                                 
                                    
                                        
                                            f
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                             map (enhanced set of ... features) is obtained from the first-order features f by the second-order attention
    PNG
    media_image19.png
    41
    265
    media_image19.png
    Greyscale
 where                                 
                                    ψ
                                
                             is another 1 x 1 convolution to control the influence of the attention. Thus, a new feature                                 
                                    
                                        
                                            f
                                        
                                        
                                            i
                                            ,
                                            j
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                             in the second-order map                                 
                                    
                                        
                                            f
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                             (reshaped to h x w x d), is a function of features from all locations in f ... This is referred to as the Second-Order Attention (SOA) block (second-order attention block)” (Ng, page 5, paragraph 1).                                 
                                    
                                        
                                            f
                                        
                                        
                                            s
                                            o
                                        
                                    
                                
                            maps a 2D image to a set of 3D feature values, the set of which can be considered a set of vectors.
Ng relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, Wang, and Ng to enhance features with a second-order attention block, as disclosed by Ng. Second-order attention allows the learning of optimal relative contributions for individual locations in an image (Ng, page 4, paragraph 5), and has been shown to improve patch descriptors for image matching.

Regarding claim 17, the rejection of claim 16 in view of W. Kim, Cao, S. Kim, Wang, and Ng is incorporated. Cao further teaches a method, wherein: the enhanced set of local features comprises second-order information from spatial locations in local-level descriptors of the input image: “Local features, on the other hand, comprise descriptors and geometry information about specific image regions (local-level descriptors)” (Cao, page 1, paragraph 2)
Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, Wang, and Ng to use local features associated with local-level descriptors, as disclosed by Cao. These features have high precision, reliably learn image similarity, and can perform spatial matching to produce reliable and interpretable scores. See (Cao, page 1, paragraph 2) and (Cao, page 2, paragraph 3).
Ng further teaches a method, wherein: the enhanced set of local features comprises second-order information from spatial locations in low-level descriptors of the input image: “In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global“ (Ng, page 1, Abstract); “From an input image 
    PNG
    media_image20.png
    38
    156
    media_image20.png
    Greyscale
processed through a Fully-Convolutional Net-work (FCN) denoted by                         
                            θ
                        
                    , we obtain a feature map 
    PNG
    media_image21.png
    37
    238
    media_image21.png
    Greyscale
” (Ng, page 4, paragraph 2); “Therefore we propose to generate a map                         
                            
                                
                                    f
                                
                                
                                    s
                                    o
                                
                            
                        
                     (enhanced set of ... features) with local features                         
                            
                                
                                    f
                                
                                
                                    i
                                    ,
                                    j
                                
                                
                                    s
                                    o
                                
                            
                        
                     that reflect the correlations between all spatial locations from within                         
                            
                                
                                    f
                                
                                
                                    s
                                    o
                                
                            
                        
                    , hence the `second-order'” (Ng, page 4, paragraph 5). As evident from the abstract, this is applicable to global and local features f. Note that global or local features are extracted from images as feature map f. The local features encoding second-order information in                          
                            
                                
                                    f
                                
                                
                                    s
                                    o
                                
                            
                        
                     differ from these extracted features f, and can be applied to either global or local features.
Ng relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, S. Kim, Wang, and Ng to use enhanced features comprising second-order information from spatial locations, as disclosed by Ng. Features with second-order information allow the learning of optimal relative contributions for individual locations in an image (Ng, page 4, paragraph 5), and has been shown to improve patch descriptors for image matching.

Response to Arguments
Objections
	Previous objections to the specification have been withdrawn in light of the instant amendments.

	Previous objections to the abstract have been withdrawn in light of the instant amendments.

	Previous objections to the claims have been withdrawn in light of the instant amendments.

112 Rejections
	In light of the instant amendments, previous rejections under 35 U.S.C. 112 have been withdrawn.

101 Rejections
On pages 11-13 of the instant remarks, the Applicant argues that in light of recent statements by USPTO Director Squires, the claimed invention improves on existing technology and thus 101 rejections should be withdrawn:
“In Desjardins, the Appeals Review Panel indicated that the claimed solutions were patent
eligible given various improvements cited by the applicant from the specification and that,
"[w]hen evaluating the claim as a whole," were reflected in the claims. Id at p. 8-9.

The claimed solutions provide analogous technical advantages. For example, the
specification describes how the claimed features can provide an improvement in the functioning
of a computer:
"The techniques in this document can be used to obtain accurate data models that
are optimized for certain image processing tasks, but that requires a shorter duration to be
fully trained relative to prior training approaches. Using these data models, the disclosed
techniques can allow for improvements in processing outcomes for verification and
identification tasks as well as fast and accurate similarity searching across content that
spans multiple images." Specification at paragraph 10.
These advantages are reflected at least in the claim language of: "computing ... a
combined feature set using data from both a) the first set of vectors that represent one or more
global features that indicate at least the geometry and spatial location data for the input image as
a whole and b) the second set of vectors [[from]] that represent one or more local features that
indicate at least the geometry and spatial location data for a region from a plurality of regions for
the input image ... ; generating, using the combined feature set, a feature representation that is
based on a final embedding output that is representative of content information, geometry
information, and spatial information of the input image; [and] training, using the feature
representation, a machine-learning model to output a prediction about an image based on
inferences derived using the feature representation."

For at least the above reasons, Applicant respectfully requests withdrawal of the
rejection.”
	The Applicant’s arguments above are persuasive. Accordingly, rejections under 35 U.S.C. 101 have been withdrawn.

103 Rejections
On pages 13-14 of the instant remarks, the Applicant argues that the relied upon prior art doesn’t disclose “"final feature vector that is a single vector formed by merging at least one vector from the first set of vectors with at least one vector from the second set of vectors.”:
“W. Kim does not disclose or suggest the claimed "final feature vector that is a single vector
formed by merging at least one vector from the first set of vectors with at least one vector from
the second set of vectors."

Amended claim 1 recites:
"computing, using the proxy-based loss function and the pairwise-based loss
function, a combined feature set ... , wherein the combined feature set corresponds to a
final feature vector that is a single vector formed by merging at least one vector from the
first set of vectors with at least one vector from the second set of vectors"

The Office asserted that W. Kim allegedly discloses these features. Action at pp. 24-30.

Applicant respectfully disagrees.

The Office asserts that W. Kim discloses a single components which computes a single
set of extracted features. Action at p. 90. W. Kim describes passing the single set of extracted
features computed bys to "multiple embedding functions bm." W. Kim, page 4 paragraph 5; see
also equation (4) and Fig. lb ofW. Kim. The multiple embedding functions correspond to
"multiple attention modules for multiple learners." W. Kim, page 2 paragraph I; see also Fig. lb
and caption: "each learner learns different attention modules (Al, A2, A3)". W. Kim describes
that this results in the generation of different "feature embeddings from [the] different learners"
W. Kim, page 2, paragraph 1.

However, this disclosure of W. Kim does not describe "merging at least one vector from
the first set of vectors with at least one vector from the second set of vectors", as recited in
amended claim 1. W. Kim describes that "a point in X [e.g., the output computed bys] can be
embedded into multiple points [e.g., feature embeddings] in Y by multiple learners." W. Kim,
page 4, paragraph 7. Rather than combining the feature embeddings, W. Kim describes that
"divergence loss pulls apart the feature embeddings of different learners using the same input."
W. Kim, page 5, Fig. 2 caption. Merely mapping different feature embeddings to the same
embedding space Y does not include, or even suggest, "merging at least one vector from the first
set of vectors with at least one vector from the second set of vectors" as recited in claim 1.

For at least the above reasons, Applicant respectfully requests withdrawal of the
rejection.”
	Regarding the argument that W. Kim fails to disclose this limitation, the Examiner respectfully disagrees. W. Kim discloses a system that generates several sets of vectors, one for each unique attention mechanism / learner, each in a unique embedding space (W. Kim, page 4, paragraph 7 & page 5, paragraph 3). Each of these sets is passed through a single global feature embedding function, i.e., the multiple sets of vectors are merged together into a unified embedding space. W. Kim makes it explicitly clear that the output of this global feature embedding function can be considered a single unified vector (W. Kim, page 7, Fig. 3). W. Kim’s disclosure is commensurate in scope with that of the claim language. See the 103 rejections section for more detail.
	Regarding the argument that W. Kim is pushing vectors apart rather than merging them together, the Examiner respectfully disagrees. The divergence loss used by W. Kim differentiates the embedding spaces of the different attention mechanisms, where vectors are embedded before being unified by the global feature embedding function (W. Kim, 5, paragraph 4 & page 6, paragraph 3). The Examiner isn’t arguing that the attention mechanisms merge vectors together, rather the global feature embedding function that receives the attention mechanism outputs is merging the vectors together, as argued above.
	Thus, no rejections are withdrawn on these grounds.

On pages 14-15 of the instant remarks, the Applicant argues that the relied upon prior art doesn’t disclose "feature representation that is based on a final embedding output that is representative of content information, geometry information, and spatial information of the input image":
“The combination of W. Kim with Cao does not disclose or suggest the claimed "feature
representation that is based on a final embedding output that is representative of content
information, geometry information, and spatial information of the input image".

Amended claim 1 recites:
"generating, using the combined feature set, a feature representation that is based
on a final embedding output that is representative of content information, geometry
information, and spatial information of the input image"

The Office asserted that W. Kim allegedly discloses these features. Action at pp. 30-31.
Applicant respectfully disagrees.

The cited portions of W. Kim describe "generat[ing] an embedding feature vector." W.
Kim at p. 6. However, the embedding feature vector ofW. Kim does not describe "generating ... a
feature representation that is based on a final embedding output that is representative of content
information, geometry information, and spatial information of the input image," as recited in
amended claim 1.

The Office asserts that Cao describes "global features ... representative of at least
geometry and spatial location data for the image as a whole" and "local features ... representative
of at least geometry and spatial location data for a region of a plurality of regions in the image."
Action at p.90. However, Cao does not describe that geometry and spatial information is
represented in a final embedding output of an input image, as recited in amended claim 1.

Cao describes using both global and local features "for high image retrieval performance"
(Cao, page 1 paragraph 2). Cao describes "produc[ing]. .. a global feature" (Cao, page 5
paragraph 2) and "extract[ing]. .. local features" (Cao, page 5 paragraph 3) to perform "an
instance-level recognition problem" of distinguishing landmarks in images (Cao, page 6 Fig. 2
caption). However, the cited portions of Cao do not describe "a final embedding output that is
representative of content information, geometry information, and spatial information of the input
image", as recited in amended claim 1. Merely using global and local features to perform image
retrieval or image recognition does not amount to generating a final embedding output that is
representative of content information, geometry information, and spatial information of an input
image.

Therefore, the cited portions of W. Kim and Cao do not disclose at least the above-cited
portions of amended claim 1. The cited portions of S. Kim do not cure the deficiencies of any
combination of W. Kim and Cao; and in fact, the Office Action does not assert that S. Kim
discloses the above-cited features.”
In response to the Applicant's argument that Cao fails to disclose limitations of amended claim 1, the Examiner notes that one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Cao discloses extracting local and global features representative of content information, geometry information, and spatial information of an input image (Cao, page 1, paragraph 2 & page 3, paragraph 2).
While Cao fails to disclose generating a feature representation based on a final embedding representative of this information, this deficiency is remedied by W. Kim, which discloses generating a unified final embedding output representative of information from its constituent vector set inputs (W. Kim, page 4, paragraphs 4-7 & page 2, paragraph 1 & page 1, paragraph 2). Were the extracted local and global features of Cao to be used as inputs for W. Kim’s system, the final feature representation generated by W. Kim would be representative of content, geometry, and spatial information of the input image. The combination of W. Kim and Cao would have suggested as much to one of ordinary skill in the art, as both local and global features are required for high performance in image retrieval tasks (Cao, page 1, paragraph 2, such as the image retrieval testing performed by W. Kim (W. Kim, page 8, paragraph 2).
Thus, the combination of W. Kim and Cao discloses this limitation, and no rejections are withdrawn on these grounds. See the 103 rejections section for more detail.

On page 16 of the instant remarks, the Applicant argues that other independent claims and dependent claims are allowable in view of claim 1:
“Independent claims 10, and 19, although different in scope from independent claim 1 and
each other, are allowable for at least the same reasons as independent claim 1. The dependent
claims are allowable for at least the same reasons as their respective independent claims.”
	As argued above and in the 103 rejections section, amended claim 1 is rejected in view of W. Kim, Cao, and S. Kim. Independent claims 10 and 19 are substantially similar, and are rejected under this same rationale in addition to Wang disclosing basic computer hardware (Wang, [0007]). No rejections of dependent claims are withdrawn on these grounds.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
(Dai, Second-order Attention Network for Single Image Super-Resolution, 2019, IEEE) teaches the usage of second-order attention blocks to modify feature vectors of an image
(Xia, Second-order Non-local Attention Networks for Person Re-identification, 2019, arXiv:1909.00295v1) teaches the usage of using multiple losses and attention blocks to ultimately concatenate two sets of feature embeddings 

While not considered prior art, a scientific paper published by the applicants after the filing of the instant application is also considered pertinent to the applicant’s disclosure: (Ebrahimpour, Multi-Head Deep Metric Learning Using Global and Local Representations, 2022, Computer Vision Foundation).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Aaron P Gormley whose telephone number is (571)272-1372. The examiner can normally be reached Monday - Friday 12:00 PM - 8:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle T Bechtold can be reached on (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AG/Examiner, Art Unit 2148                                                                                                                                                                                         
/MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148
Read full office action
Prosecution Timeline

Nov 16, 2021
Application Filed
Apr 29, 2025
Non-Final Rejection mailed — §101, §103
Jul 23, 2025
Response Filed
Aug 19, 2025
Final Rejection mailed — §101, §103
Oct 07, 2025
Response after Non-Final Action
Nov 18, 2025
Request for Continued Examination
Nov 28, 2025
Response after Non-Final Action
Dec 17, 2025
Non-Final Rejection mailed — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/944,595
Patent 12613937
IDENTITY RECOGNITION METHOD AND IDENTITY RECOGNITION SYSTEM
3y 7m to grant Granted Apr 28, 2026
17/537,475
Patent 12585955
Minimal Trust Data Sharing
4y 3m to grant Granted Mar 24, 2026
17/524,338
Patent 12579440
Training Artificial Neural Networks Using Context-Dependent Gating with Weight Stabilization
4y 4m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 3 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
50%
Grant Probability
-10%
With Interview (-60.0%)
3y 11m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 6 resolved cases by this examiner. Grant probability derived from career allowance rate.