Prosecution Insights
Last updated: April 19, 2026
Application No. 17/527,917

MULTI-HEAD DEEP METRIC MACHINE-LEARNING ARCHITECTURE

Non-Final OA §103
Filed
Nov 16, 2021
Examiner
GORMLEY, AARON PATRICK
Art Unit
2148
Tech Center
2100 — Computer Architecture & Software
Assignee
Objectvideo Labs LLC
OA Round
3 (Non-Final)
60%
Grant Probability
Moderate
3-4
OA Rounds
4y 4m
To Grant
0%
With Interview

Examiner Intelligence

Grants 60% of resolved cases
60%
Career Allow Rate
3 granted / 5 resolved
+5.0% vs TC avg
Minimal -60% lift
Without
With
+-60.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 4m
Avg Prosecution
30 currently pending
Career history
35
Total Applications
across all art units

Statute-Specific Performance

§101
30.2%
-9.8% vs TC avg
§103
36.0%
-4.0% vs TC avg
§102
8.4%
-31.6% vs TC avg
§112
21.5%
-18.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 5 resolved cases

Office Action

§103
DETAILED ACTION This action is in response to the application filed 11/16/2021. Claims 1-17 and 19-21 are pending and have been examined. Continued Examination Under 37 CFR 1.114 A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 11/18/2025 has been entered. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. Claim(s) 1-4, 9, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Wonsik Kim et al. (Attention-based Ensemble for Deep Metric Learning, 2018, arXiv:1804.00382v2), hereafter referred to as W. Kim, in view of Cao et al. (Unifying Deep Local and Global Features for Image Search, Sep. 2020, arXiv:2001.05027v4), hereafter referred to as Cao, and further in view of Sungyeon Kim et al. (Proxy Anchor Loss for Deep Metric Learning, Mar. 2020, arXiv:2003.13911v1), hereafter referred to as S. Kim. Regarding claim 1, W. Kim teaches [a] method implemented using a machine-learning architecture, the method comprising: determining, for the input image from the input dataset, i) a first set of vectors that represent one or more global features that indicate at least data for the input image as a whole and ii) a second set of vectors that represent one or more local features that indicate at least geometry and spatial location data for a region from a plurality of regions for the input image, at least some regions from the plurality of regions having different local features: “In deep metric learning, feature embedding function is modeled as a deep neural network. This feature embedding function embeds input images into feature embedding space with a certain desired condition” (W. Kim, page 1, paragraph 1). “We call S(⋅) a spatial feature extractor and G(⋅) a global feature embedding function” (W. Kim, page 5, paragraph 3). S extracts features from images. “ A m ' (⋅) consists of a convolution layer of 480 kernels of size 1×1 to match the output of S(⋅) for the element-wise product” (W. Kim, page 6, paragraph 4). The output of S is a vector. In this case, S outputs a 480-dimensional vector. “Note that, same feature extraction module is shared across different learners while individual learners have their own attention module A m ( ⋅ ) . The attention function A m ( S ( x ) ) outputs an attention mask with same size as output of S(x). product. Attended feature output of S x ∘ A m ( S x ) (set of vectors that represent one or more features) is then fed into global feature embedding function G(・) to generate an embedding feature vector.” (W. Kim, page 5, paragraph 4). Each learner m generates a unique set of vectors by running the extracted features S through its own attention mechanism. PNG media_image1.png 284 366 media_image1.png Greyscale “In attention-based ensemble, single feature embedding function (G) is trained while each learner learns different attention modules (A1,A2,A3)” (W. Kim, page 2, Fig. 1). In this example, there are three learners (m = 3) producing three sets of vectors. If S is producing global and local features (see mapping of Cao regarding claim 1 below), this results in three unique sets of vectors, each representing global and local features. computing, using the input image, a proxy-based loss function and pairwise-based loss function: “The loss for training aforementioned attention model is defined as: PNG media_image2.png 65 711 media_image2.png Greyscale where {( x i , c i )} is a set of all training samples and labels, L m e t r i c ,(m)(・) is the loss for the isometric embedding for the m-th learner (loss) … divergence loss Ldiv is defined as the following: PNG media_image3.png 69 749 media_image3.png Greyscale ” (W. Kim, page 6, paragraph 2). “The divergence loss encourages each learner to attend to the different part of the input image by increasing the distance between the points embedded by the input image” (W. Kim, page 6, paragraph 3). “We use contrastive loss (pairwise-based loss) as our distance metric loss function which is defined as the following: PNG media_image4.png 92 602 media_image4.png Greyscale ” (W. Kim, page 7, paragraph 1). As stated in paragraph [0070] of the instant Specification, Contrastive loss is a pairwise-based loss. computing, using the proxy-based loss function and the pairwise-based loss function, a combined feature set using data from both a) the first set of vectors that represent one or more global features that indicate at least the geometry and spatial location data for the input image as a whole and b) the second set of vectors from that represent one or more local features that indicate at least the geometry and spatial location data for a region from a plurality of regions for the input image: “The aim of the deep metric learning is to find an embedding function f : X → Y which maps samples x from a data space X to a feature embedding space Y so that f(xi) and f(xj) are closer in some metric when xi and xj are semantically similar” (W. Kim, page 2, paragraph 2). “Let f : X → Y be an isometric embedding function between metric spaces X and Y” (W. Kim, page 3, paragraph 4); “Our goal is to approximate f with a deep neural network” (W. Kim, page 3, paragraph 5) “In addition to the classical ensemble, we can consider the ensemble of two-step embedding function. Consider a function s : X → Z ... And we consider the isometric embedding g : Z → Y ... If we combine them into one function b(x) = g(s(x)), x ∈ X, the combined function is also an isometric embedding b : X → Y between metric spaces X and Y” (W. Kim, page 4, paragraph 4) “We are interested in another case where there are multiple embedding functions b m : X → Y with multiple s m (sets of vectors) and a single g as the following: PNG media_image5.png 36 222 media_image5.png Greyscale ” (W. Kim, page 4, paragraph 7). “With the attention-based ensemble, union of metric spaces by multiple s m is mapped by a single embedding function g” (W. Kim, page 5, paragraph 1). “[T]he combined embedding function B m ( x ) for the learner m is defined as the following: PNG media_image6.png 33 352 media_image6.png Greyscale where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing s m with PNG media_image7.png 29 168 media_image7.png Greyscale . As discussed above, each PNG media_image7.png 29 168 media_image7.png Greyscale produces a unique set of vectors for the mth attention pipeline. The embedding function g maps the vectors from each attention channel, each having a unique feature embedding, to the same embedding space. “The loss for training aforementioned attention model is defined as: PNG media_image2.png 65 711 media_image2.png Greyscale where {( x i , c i )} is a set of all training samples and labels, L m e t r i c ,(m)(・) is the loss for the isometric embedding for the m-th learner (loss)” (W. Kim, page 6, paragraph 2); “We use contrastive loss (pairwise-based loss) as our distance metric loss function which is defined as the following: PNG media_image4.png 92 602 media_image4.png Greyscale ” (W. Kim, page 7, paragraph 1). The aforementioned model is a neural network serving as the embedding function. As stated in paragraph [0070] of the instant Specification, contrastive loss is a pairwise-based loss. Note that this function takes outputs of B m , the embedding function, as inputs. That means to minimize this function and train the network, embedding function outputs must be optimized. PNG media_image8.png 464 958 media_image8.png Greyscale ”Fig. 3. The implementation of attention-based ensemble (ABE-M) using GoogLeNet” (W. Kim, page 7, Fig. 3). This illustrates the system of the paper, including a visualization of the embedding function being modified through the loss function. …wherein the combined feature set corresponds to a final feature vector that is a single vector formed by merging at least one vector from the first set of vectors with at least one vector from the second set of vectors “Attended feature output of S(x)◦Am(S(x)) is then fed into global feature embedding function G(・) to generate an embedding feature vector.” (W. Kim, page 5, paragraph 1) “[T]he combined embedding function B m ( x ) for the learner m is defined as the following: PNG media_image6.png 33 352 media_image6.png Greyscale where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing s m with PNG media_image7.png 29 168 media_image7.png Greyscale . As discussed above, the embedding function g maps the vectors from each attention channel, merging multiple sets of vectors together as they’re mapped to the same embedding space. PNG media_image9.png 565 858 media_image9.png Greyscale (W. Kim, page 7, Fig. 3). As seen in this diagram of the system, the outputs of the global feature embedding function across all attention channels can be considered a final feature vector that is a single vector of dimension 512 / M. generating, using the combined feature set, a feature representation that is based on a final embedding output that is representative of content information, geometry information, and spatial information of the input image: “In addition to the classical ensemble, we can consider the ensemble of two-step embedding function. Consider a function s : X → Z ... And we consider the isometric embedding g : Z → Y ... We are interested in another case where there are multiple embedding functions b m : X → Y with multiple s m and a single g (combined feature function) as the following: PNG media_image10.png 66 410 media_image10.png Greyscale ” (W. Kim, page 4, paragraphs 4-7); The outputs of g (feature representation) constitute a set of features from different embeddings (one per learner) mapped to the same space (final embedding output). “[W]e design an architecture which has multiple attention modules for multiple learners. By attending to different locations for different learners, diverse feature embedding functions are trained” (W. Kim, page 2, paragraph 1); “For deep metric learning, ensemble concatenates the feature embeddings learned by multiple learners” (W. Kim, page 1, paragraph 2). Each learner represents different kinds of features from the inputs. The concatenated embedding is representative of this information as a whole. training, using the feature representation, a machine-learning model to output a prediction about an image based on inferences derived using the feature representation: “The loss for training aforementioned attention model (machine-learning model) is defined as: PNG media_image2.png 65 711 media_image2.png Greyscale where {( x i , c i )} is a set of all training samples and labels, L m e t r i c ,(m)(・) is the loss for the isometric embedding for the m-th learner”(W. Kim, page 6, paragraph 2); “We use contrastive loss as our distance metric loss function which is defined as the following: PNG media_image4.png 92 602 media_image4.png Greyscale … A pair (Bp(xi), Bq(xi)) represents feature embeddings (feature representation[s]) of a single image embedded by two different learners. We call it self pair from now on while positive and negative pairs refer to pairs of feature embeddings with same labels and different labels, respectively” (W. Kim, page 7, paragraph 1). The model is trained using feature representations. “During testing, we compute the feature embeddings (feature representations) for all the test images from our network. For every test image, we then retrieve top K similar images from the test set excluding test image itself ... We evaluate the model (machine-learning model) after every 1000 iteration and report the results for the iteration with highest Recall@1.” (W. Kim, page 8, paragraph 2). The trained model is used to make predictions about similar images. W. Kim relates to deep metric learning for images and is analogous to the claimed invention. While W. Kim fails to disclose the further limitations of the claim, Cao teaches a method, comprising: maintaining an input dataset of a plurality of images for a training process: “Training details. We use the training set of the Google Landmarks dataset (GLD) [39], containing 1:2M images from 15k landmarks, and divide it into two subsets ‘train’/‘val’ with 80%/20% split. The ‘train’ split is used for the actual learning, and the ‘val’ split is used for validating the learned classifier as training progresses” (Cao, page 8, paragraph 2) maintaining, for the input dataset, a plurality of features derived from data values of the input dataset: “We use the training set (input dataset) of the Google Landmarks dataset (GLD) [39], containing 1:2M images” (Cao, page 8, paragraph 2); “Given an image, we apply a convolutional neural network backbone to obtain two feature maps: PNG media_image11.png 35 535 media_image11.png Greyscale , representing shallower and Unifying Deep Local and Global Features for Image Search 5 deeper activations respectively” (Cao, page 4, paragraph 6) maintaining, for an input image of the plurality of images in the input dataset, global features from the plurality of features and local features from the plurality of features: “Our proposed DELG (DEep Local and Global features) model (left) jointly extracts deep local and global features. Global features can be used in the first stage of a retrieval system, to efficiently select the most similar images (bottom). Local features can then be employed to re-rank top results (top-right), increasing precision of the system.” (Cao, page 2, Fig. 1). For global features to be used for similar image selection, and for local features to be used for ranking results, they each must be maintain[ed] in some capacity. determining, for the input image from the input dataset i) a first set of vectors that represent one or more global features that indicate at least data for the content of the input image as a whole: “A global feature … summarizes the contents of an image … Global features can learn similarity across very different poses where local features would not be able to find correspondences” (Cao, page 1, paragraph 2); “These two components produce a global feature PNG media_image12.png 31 86 media_image12.png Greyscale that summarizes the discriminative contents of the whole image” (Cao, page 5, paragraph 2) ii) a second set of vectors that represent one or more local features that indicate at least geometry and spatial location data for a region from a plurality of regions for the input image, at least some regions from the plurality of regions having different local features: “The local descriptors (local features) are obtained as L = T(S), where L ∈ R H S × W S × C T ” (Cao, page 5, paragraph 5) “Local features [28,7,64,39,34], on the other hand, comprise descriptors and geometry information about specific image regions (plurality of regions); they are especially useful to match images depicting rigid objects” (Cao, page 1, paragraph 2); “The key advantage of local features over global ones for retrieval is the ability to perform spatial matching (spatial information)” (Cao, page 3, paragraph 2). PNG media_image13.png 632 498 media_image13.png Greyscale ”Fig. 5: Examples of correct local feature matches, for image pairs depicting the same object/scene” (Cao, page 18, Figure 5). Here, each line corresponds to a local feature in some part of the image. Different image regions have different local features. generating, using the combined feature set, a feature representation that is based on a final embedding output that is representative of content information, geometry information, and spatial information of the input image: “A global feature, also commonly referred to as ‘global descriptor’ or ‘embedding’, summarizes the contents of an image (content information) ... Local features, on the other hand, comprise descriptors and geometry information about specific image regions” (Cao, page 1, paragraph 2); “The key advantage of local features over global ones for retrieval is the ability to perform spatial matching (spatial information)” (Cao, page 3, paragraph 2). storing the machine-learning model in memory for use by a system to detect objects with at least some features from the plurality of features “Our proposed DELG (DEep Local and Global features) model (left) jointly extracts deep local and global features” (Cao, page 2, Figure 1) “For optimal performance, image retrieval requires semantic understanding of the types of objects that a user may be interested in, such that the system can distinguish between relevant objects versus clutter/background” (Cao, page 4, paragraph 2) PNG media_image14.png 414 681 media_image14.png Greyscale ”Table 6: Feature extraction latency and database memory requirements for different image retrieval models … (C) DELG and DELG? (the machine-learning model) are compared with different configurations. As a reference, we also provide numbers for DELF in the last rows” (Cao, page 13, Table 5) Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified W. Kim to obtain local and global features from the input images, as disclosed by Cao. Both types are necessary for high image retrieval performance (Cao, page 1, paragraph 2). While W. Kim and Cao fail to disclose the further limitations of the claim, S. Kim teaches a method of computing, using the input image, a proxy-based loss function and a pairwise-based loss function: “We propose a novel metric learning loss that takes advantages of both pair-based and proxy-based methods“ (S. Kim, page 2, Left column, paragraph 3). “Specifically, for each proxy, the loss aims to pull data of the same class close to the proxy and to push others away in the embedding space” (S. Kim, page 2, Left column, paragraph 2). “We evaluate our method with Inception-BN backbone while varying the sizes of input images: (224 x 224; 256 x 256; 324 x 324; 448 x 448}. Table 7 also shows that the accuracy improves consistently as the sizes of the input images increase” (S. Kim, page 11, left column, paragraph 3). S. Kim relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim and Cao to use proxy-based and pairwise-based loss methods, as disclosed by S. Kim. Doing so would achieve state-of-the-art accuracy through quick convergence while still accounting for data-to-data relations, as well as build resistance to noisy labels and outliers. See S. Kim, page 2, paragraph 2. Regarding claim 2, the rejection of claim 1 in view of W. Kim, Cao, and S. Kim is incorporated. W. Kim further discloses a method of generating a first set of embeddings corresponding to the first set of vectors based on the proxy-based loss function and the pairwise-based loss function; and generating a second set of embeddings corresponding to the second set of vectors based on the proxy-based loss function and the pairwise-based loss function: “[T]he combined embedding function B m ( x ) for the learner m is defined as the following: PNG media_image6.png 33 352 media_image6.png Greyscale where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing s m with PNG media_image7.png 29 168 media_image7.png Greyscale . As discussed regarding claim 1, each PNG media_image7.png 29 168 media_image7.png Greyscale produces a unique set of vectors for the mth attention pipeline. PNG media_image1.png 284 366 media_image1.png Greyscale (W. Kim, page 2, Fig. 1(b)). This model supports at least three attention pipelines (three set[s] of embeddings corresponding to … set[s] of vectors) “The loss for training aforementioned attention model is defined as: PNG media_image2.png 65 711 media_image2.png Greyscale where {( x i , c i )} is a set of all training samples and labels, L m e t r i c ,(m)(・) is the loss for the isometric embedding for the m-th learner (loss)” (W. Kim, page 6, paragraph 2); “We use contrastive loss (pairwise-based loss) as our distance metric loss function which is defined as the following: PNG media_image4.png 92 602 media_image4.png Greyscale ” (W. Kim, page 7, paragraph 1). While W. Kim fails to disclose the further limitations of the claim, S. Kim further teaches a method of generating a … set of embeddings corresponding to the … set of vectors based on the proxy-based loss function and the pairwise-based loss function: “The networks are trained to project data (set of vectors) onto an embedding space in which semantically similar data (e.g., images of the same class) are closely grouped together. Such a quality of the embedding space is given mainly by loss functions used for training the networks, and most of the losses are categorized into two classes: pair-based and proxy-based.”(S. Kim, page 1, left column, paragraph 1) “We propose a novel metric learning loss that takes advantages of both pair-based and proxy-based methods“ (S. Kim, page 2, Left column, paragraph 3); “Specifically, for each proxy, the loss aims to pull data of the same class close to the proxy and to push others away in the embedding space” (S. Kim, page 2, Left column, paragraph 2). “Let x denote the embedding vector of the input, p+ be the positive proxy, and p􀀀 be a negative proxy. The loss is then given by PNG media_image15.png 217 655 media_image15.png Greyscale where X is a batch of embedding vectors (set of embeddings)” S. Kim relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, and S. Kim to use proxy-based and pairwise-based loss methods for feature embeddings, as disclosed by S. Kim. Doing so would achieve state-of-the-art accuracy through quick convergence while still accounting for data-to-data relations, as well as build resistance to noisy labels and outliers. See S. Kim, page 2, paragraph 2. Regarding claim 3, the rejection of claim 2 in view of W. Kim, Cao, and S. Kim is incorporated. W. Kim further teaches a method, wherein generating a feature representation comprises: generating, from the first and second sets of embeddings, a final embedding output that is representative of content information, geometry information, and spatial information of the input image: “In addition to the classical ensemble, we can consider the ensemble of two-step embedding function. Consider a function s : X → Z ... And we consider the isometric embedding g : Z → Y ... We are interested in another case where there are multiple embedding functions b m : X → Y with multiple s m (sets of embeddings) and a single g as the following: PNG media_image10.png 66 410 media_image10.png Greyscale ” (W. Kim, page 4, paragraphs 4-7); “With the attention-based ensemble, union of metric spaces by multiple s m is mapped by a single embedding function g” (W. Kim, page 5, paragraph 1); “the combined embedding function Bm(x) for the learner m is defined as the following: PNG media_image16.png 31 334 media_image16.png Greyscale ” (W. Kim, page 5, paragraph 3);” The outputs of g constitute a set of features from different embeddings (one per learner) mapped to the same space (final embedding output). “[W]e design an architecture which has multiple attention modules for multiple learners. By attending to different locations for different learners, diverse feature embedding functions are trained” (W. Kim, page 2, paragraph 1); “For deep metric learning, ensemble concatenates the feature embeddings learned by multiple learners” (W. Kim, page 1, paragraph 2). Each learner represents different kinds of features from the inputs. The concatenated embedding is representative of this information as a whole. While W. Kim fails to disclose the further limitations of the claim, Cao teaches a method of deriving features that are representative of content information, geometry information, and spatial information of the input image: “A global feature, also commonly referred to as ‘global descriptor’ or ‘embedding’, summarizes the contents of an image (content information) ... Local features, on the other hand, comprise descriptors and geometry information about specific image regions” (Cao, page 1, paragraph 2); “The key advantage of local features over global ones for retrieval is the ability to perform spatial matching (spatial information)” (Cao, page 3, paragraph 2). Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, and S. Kim to use both local and global features, as disclosed by Cao. Both types are necessary for high image retrieval performance (Cao, page 1, paragraph 2). Regarding claim 4, the rejection of claim 1 in view of W. Kim, Cao, and S. Kim is incorporated. Cao further discloses a method, wherein maintaining the global features and the local features comprises: encoding, using an encoder module of the architecture, the input image to an attribute range comprising a range that spans from low-level descriptors of the input image to high-level descriptors of the input image: “Our first contribution is a unified model to represent both local and global features, using a convolutional neural network (CNN) (encoder module), referred to as DELG (DEep Local and Global features). This allows for efficient inference by extracting an image’s (input image of the input dataset) global feature, detected keypoints and local descriptors within a single model” (Cao, page 2, paragraph 3); “A global feature, also commonly referred to as ‘global descriptor’ or ‘embedding’, summarizes the contents of an image (high-level descriptors) ... Local features, on the other hand, comprise descriptors and geometry information about specific image regions (low-level descriptors)” (Cao, page 1, paragraph 2). Together, low and high-level features encompass an attribute range. Low and high-level features are being interpreted according to the definition given in paragraph [0023] of the instant Specification. Cao relates to deep metric learning for images and is analogous to the claimed invention. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of W. Kim, Cao, and S. Kim to identify global and local features using a neural network encoder, as disclosed by Cao. Using Cao’s method would allow for efficient inference by extracting local and global features at very low cost within a single model (Cao, page 15, paragraph 3). Regarding claim 9, the rejection of claim 1 in view of W. Kim, Cao, and S. Kim is incorporated. W. Kim further teaches a method, wherein: the input dataset comprises a plurality of images; and the data values of the input dataset are image pixel values for at least one image: “We follow earlier works [29,38] for preprocessing and unless stated otherwise, we use the input image size of 224×224. All training and testing images (plurality of images) are scaled such that their longer side is 256, keeping the aspect ratio fixed, and padding the shorter side to get 256×256 images.” (W. Kim, page 7, paragraph 3). While the units aren’t outright specified, 224x224 is referring to 224 pixels by 224 pixels. Regarding claim 21, the rejection of claim 1 in view of W. Kim, Cao, and S. Kim is incorporated. W. Kim further teaches a method, wherein the combined feature set comprises a concatenated feature set: “Ensemble is a widely used technique of training multiple learners to get a combined model, which performs better than individual models. For deep metric learning, ensemble concatenates the feature embeddings learned by multiple learners which often leads to better embedding space under given constraints on the distances between image pairs” (W. Kim, page 1, paragraph 2). “we present M-way attention-based ensemble (ABE-M) which learns feature embedding with M diverse attention masks” (W. Kim, page 2, paragraph 1). “With the attention-based ensemble, union of metric spaces by multiple s m is mapped by a single embedding function g” (W. Kim, page 5, paragraph 1). g maps multiple sets of inputs from different embeddings to the same space, concatenating them into the same space. “[T]he combined embedding function B m ( x ) for the learner m is defined as the following: PNG media_image6.png 33 352 media_image6.png Greyscale where ◦ denotes element-wise product” (W. Kim, page 5, paragraph 3). This is W. Kim’s instantiation of g(), replacing
Read full office action

Prosecution Timeline

Nov 16, 2021
Application Filed
Apr 22, 2025
Non-Final Rejection — §103
Jul 23, 2025
Response Filed
Aug 15, 2025
Final Rejection — §103
Oct 07, 2025
Response after Non-Final Action
Nov 18, 2025
Request for Continued Examination
Nov 28, 2025
Response after Non-Final Action
Dec 03, 2025
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12585955
Minimal Trust Data Sharing
2y 5m to grant Granted Mar 24, 2026
Patent 12579440
Training Artificial Neural Networks Using Context-Dependent Gating with Weight Stabilization
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
60%
Grant Probability
0%
With Interview (-60.0%)
4y 4m
Median Time to Grant
High
PTA Risk
Based on 5 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month