Last updated: May 04, 2026
Application No. 17/328,779
SYSTEMS AND METHODS FOR HIERARCHICAL MULTI-LABEL CONTRASTIVE LEARNING

Non-Final OA §103
Filed
May 24, 2021
Priority
Mar 17, 2021 — continuation of 63/162,405
Examiner
TRAN, AMY NMN
Art Unit
2126
Tech Center
2100 — Computer Architecture & Software
Assignee
Salesforce Com Inc.
OA Round
3 (Non-Final)
Interview Optional

— +50.5% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 38% grant rate with +50.5% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 29 resolved cases, 2023–2026
Examiner Intelligence

TRAN, AMY NMN View full profile →
Grants only 38% of cases
Career Allowance Rate
11 granted / 29 resolved
-17.1% vs TC avg
Strong +50% interview lift
Without
With
+50.5%
Interview Lift
resolved cases with interview
Typical timeline
4y 9m
Avg Prosecution
23 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
32.4%
-7.6% vs TC avg
§103
44.4%
+4.4% vs TC avg
§102
5.9%
-34.1% vs TC avg
§112
15.6%
-24.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 29 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 07/22/2025 has been entered.
				Status of Claims
The amendments filed on 07/22/2025 has been entered. The status of the claims is as follows:
Claims 1-20 remain pending in the application.
Claims 1, 3, 6, 8-9, 11, 14 and 17-20 are amended.
Response to Arguments
In reference to the Claim Rejections under 35 U.S.C 103:
Applicant’s arguments, see Remarks pg. 10-13, filed on 07/22/2025, with respect to the rejection(s) of claim(s) under 35 U.S.C 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Goyal & Ghosh (“Hierarchical Class-Based Curriculum Loss”) and further in view of Maksym Bekuzarov (“Losses explained: Contrastive Loss”).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-5, 9-13 and 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Ge et al. (“Deep Metric Learning with Hierarchical Triplet Loss”) (hereafter referred to as “Ge”) in view of Yuan et al (“Hard-Aware Deeply Cascaded Embedding”) and further in view of Goyal & Ghosh (“Hierarchical Class-Based Curriculum Loss”) (hereafter referred to as “Goyal”)and Maksym Bekuzarov (“Losses explained: Contrastive Loss”) (hereafter referred to as “Bekuzarov”).
	As per claim 1, Ge explicitly discloses:
A computer-implemented method for training a machine learning model for computer  vision, the method comprising: receiving a set of image samples including at least one image sample that is associated with a set of hierarchical labels at a plurality of levels of image categories; (Ge, Page 6, Section 4, Figure 2: “(a) A toy example of the hierarchical tree H. Different colors represent different image classes in CUB-200-2011 [31]. The leaves are the image classes in the training set. Then they are merged recursively until to the root node. (b) The training data distribution of 100 classes visualized by using t-SNE [16] to reduce the dimension of triplet embedding from 512 to 2.” 
    PNG
    media_image1.png
    300
    802
    media_image1.png
    Greyscale
, Page 4, Figure 1: “(a) Caltech-UCSD Bird Species Dataset [31]. Images in each row are from the same class. There are four classes in different colors - red, green, blue and yellow.”, and Page 4, Section 3.1, ¶[1]: “”) [Examiner’s note: the training dataset i.e., the Caltech-UCSD Bird Species Dataset, a set of hierarchical labels at a plurality of levels i.e., level L0, L1, L2 and L3 of the hierarchical tree H]
selecting, for the at least one image sample, a plurality of positive image samples corresponding to the plurality of levels in the set of hierarchical labels, each positive image sample having a same label at a respective level with the at least one image sample, and at least one negative image sample having completely different labels from those of the at least one image sample; (Ge, Page 7, Section 4.2, ¶[2]: “We randomly select l’ nodes at the 0-th level of the constructed hierarchical tree H. Each node represents an original class, and collecting classes at the 0-th level aims to preserve the diversity of training samples in a mini-batch… Finally, t images for each class are randomly collected, resulting in n (n = l’ mt) images in a mini-batch M.”, Page 7, Section 4.2, ¶[3]: “                        
                            
                                
                                    A
                                
                                
                                    l
                                    '
                                    m
                                
                                
                                    2
                                
                            
                        
                     indicates randomly selecting two classes - a positive class and a negative class, from all l’m classes in the mini-batch.                         
                            
                                
                                    A
                                
                                
                                    t
                                
                                
                                    2
                                
                            
                        
                     means selecting two samples - a anchor sample (                        
                            
                                
                                    x
                                
                                
                                    a
                                
                                
                                    z
                                
                            
                        
                    ) and a positive sample (                        
                            
                                
                                    x
                                
                                
                                    p
                                
                                
                                    z
                                
                            
                        
                    ), from the positive class, and                         
                            
                                
                                    C
                                
                                
                                    t
                                
                                
                                    1
                                
                            
                        
                     means randomly selecting a negative sample (                        
                            
                                
                                    x
                                
                                
                                    n
                                
                                
                                    z
                                
                            
                        
                    ) from the negative class.”, Pg. 4, Section 3.1: “During the neural network training, training samples are selected and formed into triplets, each of which Tz = (xa; xp; xn) are consisted of an anchor sample xa, a positive sample xp and a negative sample xn. The labels of the triplet Tz satisfy ya = yp # yn. Triplet loss aims to pull samples belonging to the same class into nearby points on a manifold surface, and push samples with different labels apart from each other.”) [Examiner’s note: plurality of levels in the set of hierarchical labels i.e., the plurality levels of hierarchical tree H, labels yp # yn means that positive images have similar labels negative images have different labels]
generating a first training dataset including a plurality of positive sample pairs formed by the at least one image sample and the plurality of positive image samples (Ge, Page 2, ¶[2]: “These loss functions are calculated on correlated samples, with a common goal of encouraging samples from the same class to be closer, and pushing samples of different classes apart from each other, in a projected feature space. The correlated samples are grouped into contrastive pairs, triplets or quadruplets, which form the training samples for these loss functions on deep metric learning.”, Page 4, Section 3.1, ¶[1]: “During the neural network training, training samples are selected and formed into triplets, each of which Tz = (xa, xp, xn) are consisted of an anchor sample xa, a positive sample xp and a negative sample xn.”) [Examiner’s note: the contrastive outputs i.e., the contrastive triplet T = (xa, xp, xn)]
	Ge fails to disclose:
wherein the plurality of levels includes a first level of image categories and a second level of image sub-categories to the first level of image categories;
training a machine learning model using the first training dataset while enforcing that a first maximum contrastive loss computed in a training epoch using the at least one image sample and first positive image samples corresponding to the second level of the plurality of levels of hierarchical labels is no greater than a second maximum contrastive loss computed in the same training epoch using the same at least one image sample and second positive image samples at the first level of the plurality of levels of hierarchical labels when the second level is lower than the first level;
generating a second training dataset including the first training dataset and at least one negative sample pair formed by the at least one image sample and the at least one negative image sample; and
training the machine learning model using the second training dataset while enforcing that samples of positive sample pairs of the plurality of positive sample pairs are closer in label space than samples of the at least one negative sample pair; and
in response to an input image at inference of the trained machine learning model, generating, by the trained machine learning model, an image category classification output classifying the input image to an image category of the image categories or an image retrieval output identifying an image from an image gallery that has a same image identifier as the input image
	However, Goyal explicitly discloses:
wherein the plurality of levels includes a first level of image categories and a second level of image sub-categories to the first level of image categories; (Goyal, Pg. 2, ¶[1]: “We extend our hierarchically constrained loss function to incorporate a class-based curriculum learning paradigm, implicitly providing higher weights to simpler classes. With the hierarchical constraints, the model ensures that the classes higher in the hierarchy are selected to provide training examples until the model learns to identify them correctly, before moving on to classes deeper in the hierarchy”, Pg. 7, Section 4.1: “We evaluate our loss function on two real world image data sets… Each diatom can correspond to one or many of the categories arranged in a hierarchy. Overall, there are 399 categories in this data set arranged into a hierarchy of height 4 containing 47 categories”) 
training a machine learning model using the first training dataset while enforcing that a first maximum contrastive loss computed in a training epoch using the at least one image sample and first positive image samples corresponding to the second level of the plurality of levels of hierarchical labels is no greater than a second maximum contrastive loss computed in the same training epoch using the same at least one image sample and second positive image samples at the first level of the plurality of levels of hierarchical labels when the second level is lower than the first level; (Goyal, Pg. 4, Section 3.1: “Consider the learning framework with training set T… with N training examples and input image features… We represent the labels as yi ϵ {-1, 1}C where C is the number of classes and yi, j= 1 means that the ith example belongs to jth class… Let the set of classes C be arranged in a hierarchy H defined by a hierarchy mapping function h : C → 2C which maps a category c ∈ C to its children categories. We use the function m: C → M to denote the mapping from a category c to its level in the hierarchy. We now define the following hierarchical constraint on a generic loss function l, the satisfaction of which would yield the loss function… The constraint implies that the loss increases monotonically with the level of the hierarchy i.e. loss of higher (i.e. closer to the root) levels in the hierarchy is lesser than that of the lower levels (i.e. closer to the leaves). The intuition is that identifying categories in higher level is easier than categories in lower level as they are coarser.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge and Goyal. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Goyal teaches Hierarchical multi-label classification (HMC) methods, which utilize the hierarchy of class labels to train a machine learning model in real world scenarios. One of ordinary skill would have motivation to combine Ge and Goyal because incorporate hierarchical relationships into the training objective results in more semantically meaningful and robust predictions.
However, Yuan explicitly discloses:
generating a second training dataset including the first training dataset and at least one negative sample pair formed by the at least one image sample and the at least one negative image sample; and (Yuan, Pg. 4, Col. 2, Section 3.2, ¶[2]: “Here, we use a toy dataset with positive pairs as illustrated in Figure 3(a) and negative pairs as illustrated in Figure 3(b), together with the model with K = 3 illustrated in Figure 2 to schematically the process of hard example mining. Cascade Model-1 will forward all pairs in P0 and N0, and try to push all positive points towards the anchor point while pushing all negative points away from the anchor point, and form P1, N1 (points in the 2nd and 3rd tier) by selecting hard samples according to its loss. Similarly, P2 and N2 (points in the 3rd tier) are formed by Cascade Model-2.”) [Examiner’s note: Cascade Model-1 uses P0 (positive pairs) and N0(negative pairs) -> This is the first training dataset. After processing, hard positive and hard negative samples are selected based on loss, these include harder positive pairs (P1) and harder negative pairs (N1). These harder samples are forwarded to Cascade Model-2 -> meaning they form a second training dataset that include the first dataset (P0, N0) and the additional hard negative pairs formed from newer combinations.]
training the machine learning model using the second training dataset while enforcing that samples of positive sample pairs of the plurality of positive sample pairs are closer in label space than samples of the at least one negative sample pair; and (Yuan, Pg. 1, Section 1: “Although deep metric embedding is modified into different forms for various tasks, it shares the same objective to learn an embedding space that pulls similar images closer and pushes dissimilar images far away. Typically, the target embedding space is learned with a convolutional neural network equipped with contrastive/triplet loss.”, Pg. 4, Col. 2, Section 3.2: “Figure 2 to schematically the process of hard example mining. Cascade Model-1 will forward all pairs in P0 and N0, and try to push all positive points towards the anchor point while pushing all negative points away from the anchor point, and form P1, N1 (points in the 2nd and 3rd tier) by selecting hard samples according to its loss. Similarly, P2 and N2 (points in the 3rd tier) are formed by Cascade Model-2.”)”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge and Yuan. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Yuan teaches training a cascade of embedding models with increasing depth, where each level handles easy pairs of image and passes only the harder positive pairs to the next deeper model. One of ordinary skill would have motivation to combine Ge and Yuan to ensure that at a lower level, same-label pairs with small distances (low contrastive loss) are resolved, and only more difficult pairs (those with larger loss) are forwarded – effectively enforcing that the hardest positive pair at a shallow level is no harder than those handled at deeper levels.
However, Bekuzarov explicitly discloses:
in response to an input image at inference of the trained machine learning model, generating, by the trained machine learning model, an image category classification output classifying the input image to an image category of the image categories or an image retrieval output identifying an image from an image gallery that has a same image identifier as the input image (Bekuzarov, Pg. 5, ¶[4]: “Indeed, this is how Face Verification can be implemented — a CNN (convolutional neural network) is trained to map input train images of different people to vectors of real numbers (also called “feature-vectors” or “embeddings”) — for example, 128-d vectors, in such a way, that these embeddings of photos of the same person are very close to each other”, Pg. 16: “The input to the entire system is a pair of images (X1, X2) and a label Y . The images are passed through the functions, yielding two outputs G(X1) and G(X2).”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge and Bekuzarov. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Bekuzarov teaches using contrastive loss to train a machine learning model in face verification and face recognition tasks. One of ordinary skill would have motivation to combine Ge and Bekuzarov because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
	As per claim 2, the combination of Ge, Yuan, Goyal and Bekuzarov discloses all the limitations of claim 1 (as shown in the rejection above). 
	Ge in view of Yuan, Goyal and Bekuzarov further discloses:
wherein the set of hierarchical labels takes a form of a tree structure according to the plurality of levels, and wherein the tree structure has a root corresponding to a broadest label of the set of hierarchical labels. (Ge, Page 6, Section 4, Figure 2: “A toy example of the hierarchical tree H. Different colors represent different image classes in CUB-200-2011 [31]. The leaves are the image classes in the training set. Then they are merged recursively until to the root node 
    PNG
    media_image1.png
    300
    802
    media_image1.png
    Greyscale
”) [Examiner’s note: the set of hierarchical labels takes a form of a tree structure according to the plurality of levels i.e., L0, L1, L2 and L3, the root node is L3]
As per claim 3, the combination of Ge, Yuan, Goyal and Bekuzarov discloses all the limitations of claim 2 (as shown in the rejection above). 
	Ge in view of Yuan, Goyal and Bekuzarov further discloses:
randomly selecting the at least one image sample an anchor image sample from the set of image samples; (Ge, Page 9, Algorithm 1, Line 3: “Sample anchors randomly and their neighborhoods according to H;”)
determining an anchor set of hierarchical labels in the tree structure associated with anchor image sample; (Ge, Page 8, Figure 3: “(a)Sampling strategy of each mini-batch. The images in red stand for anchors and the images in blue stand for the nearest neighbors. (b) Train CNNs with the hierarchical triplet loss. (c) Online update of the hierarchical tree.” 
    PNG
    media_image2.png
    317
    855
    media_image2.png
    Greyscale
) [Examiner’s note: In part (a) of figure 3, it shows the sampling strategy where images marked in red are selected as anchor samples, and those marked in blue are selected as the nearest neighbors. The hierarchical tree structure depicted in part (c) is associated with updating the tree based on the loss from the loss function. This process involves determining the anchor set (the red-marked images) and associating these with hierarchical labels in the tree structure.]
randomly selecting, for the anchor image sample at a third level from the plurality of levels, a third positive image sample, from the plurality of positive image samples, that shares common label ancestry from the root up to the third level with the anchor image sample; and (Ge, Page 8, ¶[1]: “                        
                            
                                
                                    A
                                
                                
                                    t
                                
                                
                                    2
                                
                            
                        
                     means selecting two samples - a anchor sample (                        
                            
                                
                                    x
                                
                                
                                    a
                                
                                
                                    z
                                
                            
                        
                    ) and a positive sample (                        
                            
                                
                                    x
                                
                                
                                    n
                                
                                
                                    z
                                
                            
                        
                    ), from the positive class,”, Page 7, Section 4.2, ¶[2]: “We randomly select l’ nodes at the 0-th level of the constructed hierarchical tree H. Each node represents an original class, and collecting classes at the 0-th level aims to preserve the diversity of training samples in a mini-batch, which is important for training deep networks with batch normalization [9]. Then m - 1 nearest classes at the 0-th level are selected for each of the l’ nodes, based on the distance between classes computed in the feature space.”) [Examiner’s note: a positive sample is selected from the same positive class with an anchor sample, wherein the positive class is from level 0 (i.e., shares common label L0) of the hierarchical tree]
forming a first positive pair of the plurality of positive sample pairs from the anchor image sample and the third positive image sample (Ge, Page 4, Section 3.1: “During the neural network training, training samples are selected and formed into triplets, each of which Tz = (xa, xp, xn) are consisted of an anchor sample xa, a positive sample xp and a negative sample xn… 
    PNG
    media_image3.png
    38
    190
    media_image3.png
    Greyscale
 denotes the hinge loss function, and is the violate margin that requires the distance of 
    PNG
    media_image4.png
    45
    138
    media_image4.png
    Greyscale
negative pairs to be larger than the distance of 
    PNG
    media_image5.png
    43
    137
    media_image5.png
    Greyscale
positive pairs.”) [Examiner’s note: the positive pair i.e., 
    PNG
    media_image5.png
    43
    137
    media_image5.png
    Greyscale
]
As per claim 4, the combination of Ge, Yuan, Goyal and Bekuzarov discloses all the limitations of claim 3 (as shown in the rejection above). 
	Ge in view of Yuan, Goyal and Bekuzarov further discloses:
randomly selecting, for the anchor image sample at another level from the plurality of levels, another positive image sample, from the plurality of positive image samples, until positive image samples according to the plurality of levels have been sampled.  (Ge, Page 7, Section 4.2,  ¶[3]: “                        
                            
                                
                                    A
                                
                                
                                    l
                                    '
                                    m
                                
                                
                                    2
                                
                            
                        
                     indicates randomly selecting two classes - a positive class and a negative class, from all l’m classes in the mini-batch.                         
                            
                                
                                    A
                                
                                
                                    t
                                
                                
                                    2
                                
                            
                        
                     means selecting two samples - a anchor sample (                        
                            
                                
                                    x
                                
                                
                                    a
                                
                                
                                    z
                                
                            
                        
                    ) and a positive sample (                        
                            
                                
                                    x
                                
                                
                                    n
                                
                                
                                    z
                                
                            
                        
                    ), from the positive class,”, Page 7, Section 4.2, ¶[2]: “We randomly select l’ nodes at the 0-th level of the constructed hierarchical tree H. Each node represents an original class, and collecting classes at the 0-th level aims to preserve the diversity of training samples in a mini-batch, which is important for training deep networks with batch normalization [9]. Then m - 1 nearest classes at the 0-th level are selected for each of the l’ nodes, based on the distance between classes computed in the feature space.”, and Page 6, Figure 2: “The leaves are the image classes in the training set. Then they are merged recursively until to the root node.
    PNG
    media_image2.png
    317
    855
    media_image2.png
    Greyscale
”) [Examiner’s note: classes are selected randomly from all l’m classes (i.e., all classes of all nodes l’)]
As per claim 5, the combination of Ge, Yuan, Goyal and Bekuzarov discloses all the limitations of claim 4 (as shown in the rejection above). 
	Ge in view of Yuan, Goyal and Bekuzarov further discloses:
randomly selecting, from the set of image samples, another anchor image until each image sample of the set of image samples [[have]] has been sampled in a training epoch. (Ge, Page 9, Algorithm 1, Lines 1-3: “while not converge do … Sample anchors randomly and their neighborhoods according to H ;”) [Examiner’s note: the while do loop discloses the training epoch, and the anchors are sampled randomly according to the hierarchical tree in the loop]
As per claim 9, Ge explicitly discloses:
A computer-implemented method for training a machine learning model for computer  vision, the method comprising: a memory storing a plurality of processor-executable instructions for training a machine learning model; and (Ge, Pg. 8, ¶[4]: “All our experiments are implemented using Caffe [10]
and run on an NVIDIA TITAN X(Maxwell) GPU with 12GB memory.”)
one or more hardware processors reading the plurality of processor-executable instructions to perform operations comprising: (Ge, Pg. 8, ¶[4]: “All our experiments are implemented using Caffe [10] and run on an NVIDIA TITAN X(Maxwell) GPU with 12GB memory.”)
receiving a set of image samples including at least one image sample that is associated with a set of hierarchical labels at a plurality of levels of image categories; (Ge, Page 6, Section 4, Figure 2: “(a) A toy example of the hierarchical tree H. Different colors represent different image classes in CUB-200-2011 [31]. The leaves are the image classes in the training set. Then they are merged recursively until to the root node. (b) The training data distribution of 100 classes visualized by using t-SNE [16] to reduce the dimension of triplet embedding from 512 to 2.” 
    PNG
    media_image1.png
    300
    802
    media_image1.png
    Greyscale
, Page 4, Figure 1: “(a) Caltech-UCSD Bird Species Dataset [31]. Images in each row are from the same class. There are four classes in different colors - red, green, blue and yellow.”, and Page 4, Section 3.1, ¶[1]: “”) [Examiner’s note: the training dataset i.e., the Caltech-UCSD Bird Species Dataset, a set of hierarchical labels at a plurality of levels i.e., level L0, L1, L2 and L3 of the hierarchical tree H]
selecting, for the at least one image sample, a plurality of positive image samples corresponding to the plurality of levels in the set of hierarchical labels, each positive image sample having a same label at a respective level with the at least one image sample, and at least one negative image sample having completely different labels from those of the at least one image sample; (Ge, Page 7, Section 4.2, ¶[2]: “We randomly select l’ nodes at the 0-th level of the constructed hierarchical tree H. Each node represents an original class, and collecting classes at the 0-th level aims to preserve the diversity of training samples in a mini-batch… Finally, t images for each class are randomly collected, resulting in n (n = l’ mt) images in a mini-batch M.”, Page 7, Section 4.2, ¶[3]: “                        
                            
                                
                                    A
                                
                                
                                    l
                                    '
                                    m
                                
                                
                                    2
                                
                            
                        
                     indicates randomly selecting two classes - a positive class and a negative class, from all l’m classes in the mini-batch.                         
                            
                                
                                    A
                                
                                
                                    t
                                
                                
                                    2
                                
                            
                        
                     means selecting two samples - a anchor sample (                        
                            
                                
                                    x
                                
                                
                                    a
                                
                                
                                    z
                                
                            
                        
                    ) and a positive sample (                        
                            
                                
                                    x
                                
                                
                                    p
                                
                                
                                    z
                                
                            
                        
                    ), from the positive class, and                         
                            
                                
                                    C
                                
                                
                                    t
                                
                                
                                    1
                                
                            
                        
                     means randomly selecting a negative sample (                        
                            
                                
                                    x
                                
                                
                                    n
                                
                                
                                    z
                                
                            
                        
                    ) from the negative class.”, Pg. 4, Section 3.1: “During the neural network training, training samples are selected and formed into triplets, each of which Tz = (xa; xp; xn) are consisted of an anchor sample xa, a positive sample xp and a negative sample xn. The labels of the triplet Tz satisfy ya = yp # yn. Triplet loss aims to pull samples belonging to the same class into nearby points on a manifold surface, and push samples with different labels apart from each other.”) [Examiner’s note: plurality of levels in the set of hierarchical labels i.e., the plurality levels of hierarchical tree H, labels yp # yn means that positive images have similar labels negative images have different labels]
generating a first training dataset including a plurality of positive sample pairs formed by the at least one image sample and the plurality of positive image samples (Ge, Page 2, ¶[2]: “These loss functions are calculated on correlated samples, with a common goal of encouraging samples from the same class to be closer, and pushing samples of different classes apart from each other, in a projected feature space. The correlated samples are grouped into contrastive pairs, triplets or quadruplets, which form the training samples for these loss functions on deep metric learning.”, Page 4, Section 3.1, ¶[1]: “During the neural network training, training samples are selected and formed into triplets, each of which Tz = (xa, xp, xn) are consisted of an anchor sample xa, a positive sample xp and a negative sample xn.”) [Examiner’s note: the contrastive outputs i.e., the contrastive triplet T = (xa, xp, xn)]
	Ge fails to disclose:
wherein the plurality of levels includes a first level of image categories and a second level of image sub-categories to the first level of image categories;
training a machine learning model using the first training dataset while enforcing that a first maximum contrastive loss computed in a training epoch using the at least one image sample and first positive image samples corresponding to the second level of the plurality of levels of hierarchical labels is no greater than a second maximum contrastive loss computed in the same training epoch using the same at least one image sample and second positive image samples at the first level of the plurality of levels of hierarchical labels when the second level is lower than the first level;
generating a second training dataset including the first training dataset and at least one negative sample pair formed by the at least one image sample and the at least one negative image sample; and
training the machine learning model using the second training dataset while enforcing that samples of positive sample pairs of the plurality of positive sample pairs are closer in label space than samples of the at least one negative sample pair; and
in response to an input image at inference of the trained machine learning model, generating, by the trained machine learning model, an image category classification output classifying the input image to an image category of the image categories or an image retrieval output identifying an image from an image gallery that has a same image identifier as the input image
	However, Goyal explicitly discloses:
wherein the plurality of levels includes a first level of image categories and a second level of image sub-categories to the first level of image categories; (Goyal, Pg. 2, ¶[1]: “We extend our hierarchically constrained loss function to incorporate a class-based curriculum learning paradigm, implicitly providing higher weights to simpler classes. With the hierarchical constraints, the model ensures that the classes higher in the hierarchy are selected to provide training examples until the model learns to identify them correctly, before moving on to classes deeper in the hierarchy”, Pg. 7, Section 4.1: “We evaluate our loss function on two real world image data sets… Each diatom can correspond to one or many of the categories arranged in a hierarchy. Overall, there are 399 categories in this data set arranged into a hierarchy of height 4 containing 47 categories”) 
training a machine learning model using the first training dataset while enforcing that a first maximum contrastive loss computed in a training epoch using the at least one image sample and first positive image samples corresponding to the second level of the plurality of levels of hierarchical labels is no greater than a second maximum contrastive loss computed in the same training epoch using the same at least one image sample and second positive image samples at the first level of the plurality of levels of hierarchical labels when the second level is lower than the first level; (Goyal, Pg. 4, Section 3.1: “Consider the learning framework with training set T… with N training examples and input image features… We represent the labels as yi ϵ {-1, 1}C where C is the number of classes and yi, j= 1 means that the ith example belongs to jth class… Let the set of classes C be arranged in a hierarchy H defined by a hierarchy mapping function h : C → 2C which maps a category c ∈ C to its children categories. We use the function m: C → M to denote the mapping from a category c to its level in the hierarchy. We now define the following hierarchical constraint on a generic loss function l, the satisfaction of which would yield the loss function… The constraint implies that the loss increases monotonically with the level of the hierarchy i.e. loss of higher (i.e. closer to the root) levels in the hierarchy is lesser than that of the lower levels (i.e. closer to the leaves). The intuition is that identifying categories in higher level is easier than categories in lower level as they are coarser.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge and Goyal. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Goyal teaches Hierarchical multi-label classification (HMC) methods, which utilize the hierarchy of class labels to train a machine learning model in real world scenarios. One of ordinary skill would have motivation to combine Ge and Goyal because incorporate hierarchical relationships into the training objective results in more semantically meaningful and robust predictions.
However, Yuan explicitly discloses:
generating a second training dataset including the first training dataset and at least one negative sample pair formed by the at least one image sample and the at least one negative image sample; and (Yuan, Pg. 4, Col. 2, Section 3.2, ¶[2]: “Here, we use a toy dataset with positive pairs as illustrated in Figure 3(a) and negative pairs as illustrated in Figure 3(b), together with the model with K = 3 illustrated in Figure 2 to schematically the process of hard example mining. Cascade Model-1 will forward all pairs in P0 and N0, and try to push all positive points towards the anchor point while pushing all negative points away from the anchor point, and form P1, N1 (points in the 2nd and 3rd tier) by selecting hard samples according to its loss. Similarly, P2 and N2 (points in the 3rd tier) are formed by Cascade Model-2.”) [Examiner’s note: Cascade Model-1 uses P0 (positive pairs) and N0(negative pairs) -> This is the first training dataset. After processing, hard positive and hard negative samples are selected based on loss, these include harder positive pairs (P1) and harder negative pairs (N1). These harder samples are forwarded to Cascade Model-2 -> meaning they form a second training dataset that include the first dataset (P0, N0) and the additional hard negative pairs formed from newer combinations.]
training the machine learning model using the second training dataset while enforcing that samples of positive sample pairs of the plurality of positive sample pairs are closer in label space than samples of the at least one negative sample pair; and (Yuan, Pg. 1, Section 1: “Although deep metric embedding is modified into different forms for various tasks, it shares the same objective to learn an embedding space that pulls similar images closer and pushes dissimilar images far away. Typically, the target embedding space is learned with a convolutional neural network equipped with contrastive/triplet loss.”, Pg. 4, Col. 2, Section 3.2: “Figure 2 to schematically the process of hard example mining. Cascade Model-1 will forward all pairs in P0 and N0, and try to push all positive points towards the anchor point while pushing all negative points away from the anchor point, and form P1, N1 (points in the 2nd and 3rd tier) by selecting hard samples according to its loss. Similarly, P2 and N2 (points in the 3rd tier) are formed by Cascade Model-2.”)”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge and Yuan. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Yuan teaches training a cascade of embedding models with increasing depth, where each level handles easy pairs of image and passes only the harder positive pairs to the next deeper model. One of ordinary skill would have motivation to combine Ge and Yuan to ensure that at a lower level, same-label pairs with small distances (low contrastive loss) are resolved, and only more difficult pairs (those with larger loss) are forwarded – effectively enforcing that the hardest positive pair at a shallow level is no harder than those handled at deeper levels.
However, Bekuzarov explicitly discloses:
in response to an input image at inference of the trained machine learning model, generating, by the trained machine learning model, an image category classification output classifying the input image to an image category of the image categories or an image retrieval output identifying an image from an image gallery that has a same image identifier as the input image (Bekuzarov, Pg. 5, ¶[4]: “Indeed, this is how Face Verification can be implemented — a CNN (convolutional neural network) is trained to map input train images of different people to vectors of real numbers (also called “feature-vectors” or “embeddings”) — for example, 128-d vectors, in such a way, that these embeddings of photos of the same person are very close to each other”, Pg. 16: “The input to the entire system is a pair of images (X1, X2) and a label Y . The images are passed through the functions, yielding two outputs G(X1) and G(X2).”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge and Bekuzarov. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Bekuzarov teaches using contrastive loss to train a machine learning model in face verification and face recognition tasks. One of ordinary skill would have motivation to combine Ge and Bekuzarov because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
	As per claim 10, the combination of Ge, Yuan, Goyal and Bekuzarov discloses all the limitations of Claim 9 (as shown in the rejections above).
	Ge in view of Yuan, Goyal and Bekuzarov further discloses:
wherein the set of hierarchical labels takes a form of a tree structure according to the plurality of levels, and wherein the tree structure has a root corresponding to a broadest label of the set of hierarchical labels.  (Ge, Page 6, Section 4, Figure 2: “A toy example of the hierarchical tree H. Different colors represent different image classes in CUB-200-2011 [31]. The leaves are the image classes in the training set. Then they are merged recursively until to the root node 
    PNG
    media_image1.png
    300
    802
    media_image1.png
    Greyscale
”) [Examiner’s note: the set of hierarchical labels takes a form of a tree structure according to the plurality of levels i.e., L0, L1, L2 and L3, the root node is L3]
As per claim 11, the combination of Ge, Yuan, Goyal and Bekuzarov discloses all the limitations of Claim 10 (as shown in the rejections above).
	Ge in view of Yuan, Goyal and Bekuzarov further discloses:
wherein the one or more hardware processors read the plurality of processor-executable instructions to further perform: randomly selecting the at least one image sample as an anchor image sample from the set of image samples; (Ge, Page 9, Algorithm 1, Line 3: “Sample anchors randomly and their neighborhoods according to H;”)
determining an anchor set of hierarchical labels in the tree structure associated with the anchor image sample; (Ge, Page 8, Figure 3: “(a)Sampling strategy of each mini-batch. The images in red stand for anchors and the images in blue stand for the nearest neighbors. (b) Train CNNs with the hierarchical triplet loss. (c) Online update of the hierarchical tree.” 
    PNG
    media_image2.png
    317
    855
    media_image2.png
    Greyscale
) [Examiner’s note: In part (a) of figure 3, it shows the sampling strategy where images marked in red are selected as anchor samples, and those marked in blue are selected as the nearest neighbors. The hierarchical tree structure depicted in part (c) is associated with updating the tree based on the loss from the loss function. This process involves determining the anchor set (the red-marked images) and associating these with hierarchical labels in the tree structure.]
randomly selecting, for the anchor image sample at a third level from the plurality of levels, a third positive image sample, from the plurality of positive image samples, that shares common label ancestry from the root up to the third level with the anchor image sample; and (Ge, Page 8, ¶[1]: “                        
                            
                                
                                    A
                                
                                
                                    t
                                
                                
                                    2
                                
                            
                        
                     means selecting two samples - a anchor sample (                        
                            
                                
                                    x
                                
                                
                                    a
                                
                                
                                    z
                                
                            
                        
                    ) and a positive sample (                        
                            
                                
                                    x
                                
                                
                                    n
                                
                                
                                    z
                                
                            
                        
                    ), from the positive class,”, Page 7, Section 4.2, ¶[2]: “We randomly select l’ nodes at the 0-th level of the constructed hierarchical tree H. Each node represents an original class, and collecting classes at the 0-th level aims to preserve the diversity of training samples in a mini-batch, which is important for training deep networks with batch normalization [9]. Then m - 1 nearest classes at the 0-th level are selected for each of the l’ nodes, based on the distance between classes computed in the feature space.”) [Examiner’s note: a positive sample is selected from the same positive class with an anchor sample, wherein the positive class is from level 0 (i.e., shares common label L0) of the hierarchical tree]
forming a first positive pair of the plurality of positive sample pairs from the anchor image sample and the third positive image sample (Ge, Page 4, Section 3.1: “During the neural network training, training samples are selected and formed into triplets, each of which Tz = (xa, xp, xn) are consisted of an anchor sample xa, a positive sample xp and a negative sample xn… 
    PNG
    media_image3.png
    38
    190
    media_image3.png
    Greyscale
 denotes the hinge loss function, and is the violate margin that requires the distance of 
    PNG
    media_image4.png
    45
    138
    media_image4.png
    Greyscale
negative pairs to be larger than the distance of 
    PNG
    media_image5.png
    43
    137
    media_image5.png
    Greyscale
positive pairs.”) [Examiner’s note: the positive pair i.e., 
    PNG
    media_image5.png
    43
    137
    media_image5.png
    Greyscale
]
As per claim 12, the combination of Ge, Yuan, Goyal and Bekuzarov discloses all the limitations of Claim 11 (as shown in the rejections above).
	Ge in view of Yuan, Goyal and Bekuzarov further discloses:
wherein the one or more hardware processors read the plurality of processor-executable instructions to further perform: randomly selecting, for the anchor image sample at another level from the plurality of levels, another positive image sample, from the plurality of positive image samples, until positive image samples according to the plurality of levels have been sampled.  (Ge, Page 7, Section 4.2,  ¶[3]: “                        
                            
                                
                                    A
                                
                                
                                    l
                                    '
                                    m
                                
                                
                                    2
                                
                            
                        
                     indicates randomly selecting two classes - a positive class and a negative class, from all l’m classes in the mini-batch.                         
                            
                                
                                    A
                                
                                
                                    t
                                
                                
                                    2
                                
                            
                        
                     means selecting two samples - a anchor sample (                        
                            
                                
                                    x
                                
                                
                                    a
                                
                                
                                    z
                                
                            
                        
                    ) and a positive sample (                        
                            
                                
                                    x
                                
                                
                                    n
                                
                                
                                    z
                                
                            
                        
                    ), from the positive class,”, Page 7, Section 4.2, ¶[2]: “We randomly select l’ nodes at the 0-th level of the constructed hierarchical tree H. Each node represents an original class, and collecting classes at the 0-th level aims to preserve the diversity of training samples in a mini-batch, which is important for training deep networks with batch normalization [9]. Then m - 1 nearest classes at the 0-th level are selected for each of the l’ nodes, based on the distance between classes computed in the feature space.”, and Page 6, Figure 2: “The leaves are the image classes in the training set. Then they are merged recursively until to the root node.
    PNG
    media_image2.png
    317
    855
    media_image2.png
    Greyscale
”) [Examiner’s note: classes are selected randomly from all l’m classes (i.e., all classes of all nodes l’)]
As per claim 13, the combination of Ge, Yuan, Goyal and Bekuzarov discloses all the limitations of Claim 12 (as shown in the rejections above).
	Ge in view of Yuan, Goyal and Bekuzarov further discloses:
wherein the one or more hardware processors read the plurality of processor-executable instructions to further perform: randomly selecting, from the set of image samples, another anchor image until each image sample of the set of image samples [[have]] has been sampled in a training epoch.  (Ge, Page 9, Algorithm 1, Lines 1-3: “while not converge do … Sample anchors randomly and their neighborhoods according to H ;”) [Examiner’s note: the while do loop discloses the training epoch, and the anchors are sampled randomly according to the hierarchical tree in the loop]
	As per claim 17, Ge explicitly discloses:
A processor-readable non-transitory storage medium storing a plurality of processor-executable instructions for training a machine learning model for computer vision, the plurality of processor-executable instructions being executed by one or more processors to perform operations comprising: (Ge, Pg. 8, ¶[4]: “Implementation Details. All our experiments are implemented using Caffe [10] and run on an NVIDIA TITAN X(Maxwell) GPU with 12GB memory.”) 
receiving a set of image samples including at least one image sample that is associated with a set of hierarchical labels at a plurality of levels of image categories; (Ge, Page 6, Section 4, Figure 2: “(a) A toy example of the hierarchical tree H. Different colors represent different image classes in CUB-200-2011 [31]. The leaves are the image classes in the training set. Then they are merged recursively until to the root node. (b) The training data distribution of 100 classes visualized by using t-SNE [16] to reduce the dimension of triplet embedding from 512 to 2.” 
    PNG
    media_image1.png
    300
    802
    media_image1.png
    Greyscale
, Page 4, Figure 1: “(a) Caltech-UCSD Bird Species Dataset [31]. Images in each row are from the same class. There are four classes in different colors - red, green, blue and yellow.”, and Page 4, Section 3.1, ¶[1]: “”) [Examiner’s note: the training dataset i.e., the Caltech-UCSD Bird Species Dataset, a set of hierarchical labels at a plurality of levels i.e., level L0, L1, L2 and L3 of the hierarchical tree H]
selecting, for the at least one image sample, a plurality of positive image samples corresponding to the plurality of levels in the set of hierarchical labels, each positive image sample having a same label at a respective level with the at least one image sample, and at least one negative image sample having completely different labels from those of the at least one image sample; (Ge, Page 7, Section 4.2, ¶[2]: “We randomly select l’ nodes at the 0-th level of the constructed hierarchical tree H. Each node represents an original class, and collecting classes at the 0-th level aims to preserve the diversity of training samples in a mini-batch… Finally, t images for each class are randomly collected, resulting in n (n = l’ mt) images in a mini-batch M.”, Page 7, Section 4.2, ¶[3]: “                        
                            
                                
                                    A
                                
                                
                                    l
                                    '
                                    m
                                
                                
                                    2
                                
                            
                        
                     indicates randomly selecting two classes - a positive class and a negative class, from all l’m classes in the mini-batch.                         
                            
                                
                                    A
                                
                                
                                    t
                                
                                
                                    2
                                
                            
                        
                     means selecting two samples - a anchor sample (                        
                            
                                
                                    x
                                
                                
                                    a
                                
                                
                                    z
                                
                            
                        
                    ) and a positive sample (                        
                            
                                
                                    x
                                
                                
                                    p
                                
                                
                                    z
                                
                            
                        
                    ), from the positive class, and                         
                            
                                
                                    C
                                
                                
                                    t
                                
                                
                                    1
                                
                            
                        
                     means randomly selecting a negative sample (                        
                            
                                
                                    x
                                
                                
                                    n
                                
                                
                                    z
                                
                            
                        
                    ) from the negative class.”, Pg. 4, Section 3.1: “During the neural network training, training samples are selected and formed into triplets, each of which Tz = (xa; xp; xn) are consisted of an anchor sample xa, a positive sample xp and a negative sample xn. The labels of the triplet Tz satisfy ya = yp # yn. Triplet loss aims to pull samples belonging to the same class into nearby points on a manifold surface, and push samples with different labels apart from each other.”) [Examiner’s note: plurality of levels in the set of hierarchical labels i.e., the plurality levels of hierarchical tree H, labels yp # yn means that positive images have similar labels negative images have different labels]
generating a first training dataset including a plurality of positive sample pairs formed by the at least one image sample and the plurality of positive image samples (Ge, Page 2, ¶[2]: “These loss functions are calculated on correlated samples, with a common goal of encouraging samples from the same class to be closer, and pushing samples of different classes apart from each other, in a projected feature space. The correlated samples are grouped into contrastive pairs, triplets or quadruplets, which form the training samples for these loss functions on deep metric learning.”, Page 4, Section 3.1, ¶[1]: “During the neural network training, training samples are selected and formed into triplets, each of which Tz = (xa, xp, xn) are consisted of an anchor sample xa, a positive sample xp and a negative sample xn.”) [Examiner’s note: the contrastive outputs i.e., the contrastive triplet T = (xa, xp, xn)]
	Ge fails to disclose:
wherein the plurality of levels includes a first level of image categories and a second level of image sub-categories to the first level of image categories;
training a machine learning model using the first training dataset while enforcing that a first maximum contrastive loss computed in a training epoch using the at least one image sample and first positive image samples corresponding to the second level of the plurality of levels of hierarchical labels is no greater than a second maximum contrastive loss computed in the same training epoch using the same at least one image sample and second positive image samples at the first level of the plurality of levels of hierarchical labels when the second level is lower than the first level;
generating a second training dataset including the first training dataset and at least one negative sample pair formed by the at least one image sample and the at least one negative image sample; and
training the machine learning model using the second training dataset while enforcing that samples of positive sample pairs of the plurality of positive sample pairs are closer in label space than samples of the at least one negative sample pair; and
in response to an input image at inference of the trained machine learning model, generating, by the trained machine learning model, an image category classification output classifying the input image to an image category of the image categories or an image retrieval output identifying an image from an image gallery that has a same image identifier as the input image
	However, Goyal explicitly discloses:
wherein the plurality of levels includes a first level of image categories and a second level of image sub-categories to the first level of image categories; (Goyal, Pg. 2, ¶[1]: “We extend our hierarchically constrained loss function to incorporate a class-based curriculum learning paradigm, implicitly providing higher weights to simpler classes. With the hierarchical constraints, the model ensures that the classes higher in the hierarchy are selected to provide training examples until the model learns to identify them correctly, before moving on to classes deeper in the hierarchy”, Pg. 7, Section 4.1: “We evaluate our loss function on two real world image data sets… Each diatom can correspond to one or many of the categories arranged in a hierarchy. Overall, there are 399 categories in this data set arranged into a hierarchy of height 4 containing 47 categories”) 
training a machine learning model using the first training dataset while enforcing that a first maximum contrastive loss computed in a training epoch using the at least one image sample and first positive image samples corresponding to the second level of the plurality of levels of hierarchical labels is no greater than a second maximum contrastive loss computed in the same training epoch using the same at least one image sample and second positive image samples at the first level of the plurality of levels of hierarchical labels when the second level is lower than the first level; (Goyal, Pg. 4, Section 3.1: “Consider the learning framework with training set T… with N training examples and input image features… We represent the labels as yi ϵ {-1, 1}C where C is the number of classes and yi, j= 1 means that the ith example belongs to jth class… Let the set of classes C be arranged in a hierarchy H defined by a hierarchy mapping function h : C → 2C which maps a category c ∈ C to its children categories. We use the function m: C → M to denote the mapping from a category c to its level in the hierarchy. We now define the following hierarchical constraint on a generic loss function l, the satisfaction of which would yield the loss function… The constraint implies that the loss increases monotonically with the level of the hierarchy i.e. loss of higher (i.e. closer to the root) levels in the hierarchy is lesser than that of the lower levels (i.e. closer to the leaves). The intuition is that identifying categories in higher level is easier than categories in lower level as they are coarser.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge and Goyal. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Goyal teaches Hierarchical multi-label classification (HMC) methods, which utilize the hierarchy of class labels to train a machine learning model in real world scenarios. One of ordinary skill would have motivation to combine Ge and Goyal because incorporate hierarchical relationships into the training objective results in more semantically meaningful and robust predictions.
However, Yuan explicitly discloses:
generating a second training dataset including the first training dataset and at least one negative sample pair formed by the at least one image sample and the at least one negative image sample; and (Yuan, Pg. 4, Col. 2, Section 3.2, ¶[2]: “Here, we use a toy dataset with positive pairs as illustrated in Figure 3(a) and negative pairs as illustrated in Figure 3(b), together with the model with K = 3 illustrated in Figure 2 to schematically the process of hard example mining. Cascade Model-1 will forward all pairs in P0 and N0, and try to push all positive points towards the anchor point while pushing all negative points away from the anchor point, and form P1, N1 (points in the 2nd and 3rd tier) by selecting hard samples according to its loss. Similarly, P2 and N2 (points in the 3rd tier) are formed by Cascade Model-2.”) [Examiner’s note: Cascade Model-1 uses P0 (positive pairs) and N0(negative pairs) -> This is the first training dataset. After processing, hard positive and hard negative samples are selected based on loss, these include harder positive pairs (P1) and harder negative pairs (N1). These harder samples are forwarded to Cascade Model-2 -> meaning they form a second training dataset that include the first dataset (P0, N0) and the additional hard negative pairs formed from newer combinations.]
training the machine learning model using the second training dataset while enforcing that samples of positive sample pairs of the plurality of positive sample pairs are closer in label space than samples of the at least one negative sample pair; and (Yuan, Pg. 1, Section 1: “Although deep metric embedding is modified into different forms for various tasks, it shares the same objective to learn an embedding space that pulls similar images closer and pushes dissimilar images far away. Typically, the target embedding space is learned with a convolutional neural network equipped with contrastive/triplet loss.”, Pg. 4, Col. 2, Section 3.2: “Figure 2 to schematically the process of hard example mining. Cascade Model-1 will forward all pairs in P0 and N0, and try to push all positive points towards the anchor point while pushing all negative points away from the anchor point, and form P1, N1 (points in the 2nd and 3rd tier) by selecting hard samples according to its loss. Similarly, P2 and N2 (points in the 3rd tier) are formed by Cascade Model-2.”)”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge and Yuan. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Yuan teaches training a cascade of embedding models with increasing depth, where each level handles easy pairs of image and passes only the harder positive pairs to the next deeper model. One of ordinary skill would have motivation to combine Ge and Yuan to ensure that at a lower level, same-label pairs with small distances (low contrastive loss) are resolved, and only more difficult pairs (those with larger loss) are forwarded – effectively enforcing that the hardest positive pair at a shallow level is no harder than those handled at deeper levels.
However, Bekuzarov explicitly discloses:
in response to an input image at inference of the trained machine learning model, generating, by the trained machine learning model, an image category classification output classifying the input image to an image category of the image categories or an image retrieval output identifying an image from an image gallery that has a same image identifier as the input image (Bekuzarov, Pg. 5, ¶[4]: “Indeed, this is how Face Verification can be implemented — a CNN (convolutional neural network) is trained to map input train images of different people to vectors of real numbers (also called “feature-vectors” or “embeddings”) — for example, 128-d vectors, in such a way, that these embeddings of photos of the same person are very close to each other”, Pg. 16: “The input to the entire system is a pair of images (X1, X2) and a label Y . The images are passed through the functions, yielding two outputs G(X1) and G(X2).”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge and Bekuzarov. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Bekuzarov teaches using contrastive loss to train a machine learning model in face verification and face recognition tasks. One of ordinary skill would have motivation to combine Ge and Bekuzarov because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
	As per claim 18, the combination of Ge, Yuan, Goyal and Bekuzarov discloses all the limitations of Claim 17 (as shown in the rejections above).
	Ge in view of Yuan, Goyal and Bekuzarov further discloses:
wherein the operations comprise: randomly selecting the at least one image sample as an anchor image sample from the set of image samples; (Ge, Page 9, Algorithm 1, Line 3: “Sample anchors randomly and their neighborhoods according to H;”)
determining an anchor set of hierarchical labels in the tree structure associated with the anchor image sample; (Ge, Page 8, Figure 3: “(a)Sampling strategy of each mini-batch. The images in red stand for anchors and the images in blue stand for the nearest neighbors. (b) Train CNNs with the hierarchical triplet loss. (c) Online update of the hierarchical tree.” 
    PNG
    media_image2.png
    317
    855
    media_image2.png
    Greyscale
) [Examiner’s note: In part (a) of figure 3, it shows the sampling strategy where images marked in red are selected as anchor samples, and those marked in blue are selected as the nearest neighbors. The hierarchical tree structure depicted in part (c) is associated with updating the tree based on the loss from the loss function. This process involves determining the anchor set (the red-marked images) and associating these with hierarchical labels in the tree structure.]
randomly selecting, for the anchor image sample at a third level from the plurality of levels, a third positive image sample, from the plurality of positive image samples, that shares common label ancestry from the root up to the third level with the anchor image sample; (Ge, Page 8, ¶[1]: “                        
                            
                                
                                    A
                                
                                
                                    t
                                
                                
                                    2
                                
                            
                        
                     means selecting two samples - a anchor sample (                        
                            
                                
                                    x
                                
                                
                                    a
                                
                                
                                    z
                                
                            
                        
                    ) and a positive sample (                        
                            
                                
                                    x
                                
                                
                                    n
                                
                                
                                    z
                                
                            
                        
                    ), from the positive class,”, Page 7, Section 4.2, ¶[2]: “We randomly select l’ nodes at the 0-th level of the constructed hierarchical tree H. Each node represents an original class, and collecting classes at the 0-th level aims to preserve the diversity of training samples in a mini-batch, which is important for training deep networks with batch normalization [9]. Then m - 1 nearest classes at the 0-th level are selected for each of the l’ nodes, based on the distance between classes computed in the feature space.”) [Examiner’s note: a positive sample is selected from the same positive class with an anchor sample, wherein the positive class is from level 0 (i.e., shares common label L0) of the hierarchical tree]
forming a first positive pair of the plurality of positive sample pairs from the anchor image sample and the third positive image sample; (Ge, Page 4, Section 3.1: “During the neural network training, training samples are selected and formed into triplets, each of which Tz = (xa, xp, xn) are consisted of an anchor sample xa, a positive sample xp and a negative sample xn… 
    PNG
    media_image3.png
    38
    190
    media_image3.png
    Greyscale
 denotes the hinge loss function, and is the violate margin that requires the distance of 
    PNG
    media_image4.png
    45
    138
    media_image4.png
    Greyscale
negative pairs to be larger than the distance of 
    PNG
    media_image5.png
    43
    137
    media_image5.png
    Greyscale
positive pairs.”) [Examiner’s note: the positive pair i.e., 
    PNG
    media_image5.png
    43
    137
    media_image5.png
    Greyscale
]
randomly selecting, for the anchor image sample at another level from the plurality of levels, another positive image sample, from the plurality of positive image samples, until positive image samples according to the plurality of levels have been sampled; and (Ge, Page 7, Section 4.2,  ¶[3]: “                        
                            
                                
                                    A
                                
                                
                                    l
                                    '
                                    m
                                
                                
                                    2
                                
                            
                        
                     indicates randomly selecting two classes - a positive class and a negative class, from all l’m classes in the mini-batch.                         
                            
                                
                                    A
                                
                                
                                    t
                                
                                
                                    2
                                
                            
                        
                     means selecting two samples - a anchor sample (                        
                            
                                
                                    x
                                
                                
                                    a
                                
                                
                                    z
                                
                            
                        
                    ) and a positive sample (                        
                            
                                
                                    x
                                
                                
                                    n
                                
                                
                                    z
                                
                            
                        
                    ), from the positive class,”, Page 7, Section 4.2, ¶[2]: “We randomly select l’ nodes at the 0-th level of the constructed hierarchical tree H. Each node represents an original class, and collecting classes at the 0-th level aims to preserve the diversity of training samples in a mini-batch, which is important for training deep networks with batch normalization [9]. Then m - 1 nearest classes at the 0-th level are selected for each of the l’ nodes, based on the distance between classes computed in the feature space.”, and Page 6, Figure 2: “The leaves are the image classes in the training set. Then they are merged recursively until to the root node.
    PNG
    media_image2.png
    317
    855
    media_image2.png
    Greyscale
”) [Examiner’s note: classes are selected randomly from all l’m classes (i.e., all classes of all nodes l’)]
randomly selecting, from the set of image samples, another anchor image until each image sample of the set of image samples have been sampled in a training epoch. (Ge, Page 9, Algorithm 1, Lines 1-3: “while not converge do … Sample anchors randomly and their neighborhoods according to H ;”) [Examiner’s note: the while do loop discloses the training epoch, and the anchors are sampled randomly according to the hierarchical tree in the loop]
Claim(s) 6-7, 14-15 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Ge et al. (“Deep Metric Learning with Hierarchical Triplet Loss”) (hereafter referred to as “Ge”) in view of Yuan et al (“Hard-Aware Deeply Cascaded Embedding”), Goyal & Ghosh (“Hierarchical Class-Based Curriculum Loss”) (hereafter referred to as “Goyal”), Maksym Bekuzarov (“Losses explained: Contrastive Loss”) (hereafter referred to as “Bekuzarov”) and further in view of Xu et al. (“Hierarchical Semantic Aggregation for Contrastive Representation Learning”) (hereafter referred to as “Xu”).
As per claim 6, the combination of Ge, Yuan, Goyal and Bekuzarov discloses all the limitations of claim 3 (as shown in the rejection above). 
Ge in view of Yuan, Goyal and Bekuzarov further discloses:
computing a first pair loss corresponding to the first positive pair based on a distance between the anchor representation and the first positive representation in a feature space. (Ge, Page 4, Col. 1: “Symmetrically, for positive sample xp, we also have: 
    PNG
    media_image6.png
    197
    571
    media_image6.png
    Greyscale
The overall loss is the combination of the two losses, which is equipped with two positive samples in the query encoder fq, and the corresponding two positive samples in the key encoder fk. Each sample is accompanied with a random data augmentation as described in [5], and is pulled together with all positive samples (also undergo a random data augmentation) from the other encoder.”)
Ge in view of Yuan, Goyal and Bekuzarov fails to disclose:
generating, by an encoder, an anchor representation and a first positive representation from the anchor image sample and the third positive image sample, respectively; and
However, Xu explicitly discloses:
generating, by an encoder, an anchor representation and a first positive representation from the anchor image sample and the third positive image sample, respectively; and (Xu, Page 3, Col. 2, Section 3.2: “For positive samples, we simply make use of k nearest neighbors to search semantically similar images in the embedding space. Specially, given unlabeled training set X = {x1, x2, …, xn} and a query encoder fq, we obtain the corresponding embedding representation V = {v1, v2,…, vn} where vi = fq (xi)… Given an anchor sample xa, we compute the cosine similarity with all other images, and select the top k samples with the highest similarity as positives Ω = {x1, x2,…, xk}”, Page 4, Col. 1, ¶[1]: “We simply adjust the contrastive loss in Eq. 1 to allow for multiple positives per anchor. Given an anchor sample xa and its nearest neighborhood set Ω, we randomly select a positive sample xp ϵ Ω, and the loss function can be reformulated as:… where each anchor sample qa encoded with fq, is pulled with two samples ka and kq encoded with fk, and pushed away with all other samples in the key encoder fk.”, and Page 4, Col. 1, ¶[2]: “The overall loss is the combination of the two losses, which is equipped with two positive samples in the query encoder fq, and the corresponding two positive samples in the key encoder fk. Each sample is accompanied with a random data augmentation as described in [5], and is pulled together with all positive samples (also undergo a random data augmentation) from the other encoder.”) [Examiner’s note: an anchor representation i.e., each anchor sample qa encoded with fq, the positive representation i.e., positive samples in the query encoder fq]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge, Yuan, Goyal, Bekuzarov and Xu. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Bekuzarov teaches using contrastive loss to train a machine learning model in face verification and face recognition tasks. Xu tackles the representation inefficiency of contrastive learning and propose a hierarchical training strategy. Yuan teaches training a cascade of embedding models with increasing depth, where each level handles easy pairs of image and passes only the harder positive pairs to the next deeper model. Goyal teaches Hierarchical multi-label classification (HMC) methods, which utilize the hierarchy of class labels to train a machine learning model in real world scenarios. One of ordinary skill would have motivation to combine Ge, Yuan, Goyal, Bekuzarov and Xu because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E) “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if variations are predictable to one of ordinary skill in the art.

As per claim 7, the combination of Ge, Yuan, Goyal, Bakuzarov and Xu discloses all the limitations of claim 6 (as shown in the rejection above). 
Xu further discloses:
computing a loss objective based at least in part on summing pair losses over positive image samples at each level and over the plurality of levels.  (Xu, Page 5, Col. 1: “When there are L losses corresponding to L intermediate stages, the final losses of the whole network can be computed as: 
    PNG
    media_image7.png
    84
    397
    media_image7.png
    Greyscale
” ) [Examiner’s note: the training objective is being interpreted as Ltotal as this is the total losses corresponding to L levels (i.e., stages)]
As per claim 14, the combination of Ge, Yuan, Goyal and Bakuzarov discloses all the limitations of Claim 11 (as shown in the rejections above).
Ge in view of Yuan, Goyal and Bekuzarov further discloses:
computing a first pair loss corresponding to the first positive pair based on a distance between the anchor representation and the first positive representation in a feature space. (Ge, Page 4, Col. 1: “Symmetrically, for positive sample xp, we also have: 
    PNG
    media_image6.png
    197
    571
    media_image6.png
    Greyscale
The overall loss is the combination of the two losses, which is equipped with two positive samples in the query encoder fq, and the corresponding two positive samples in the key encoder fk. Each sample is accompanied with a random data augmentation as described in [5], and is pulled together with all positive samples (also undergo a random data augmentation) from the other encoder.”)
Ge in view of Yuan, Goyal and Bekuzarov fails to disclose:
generating, by an encoder, an anchor representation and a first positive representation from the anchor image sample and the third positive image sample, respectively; and
However, Xu explicitly discloses:
generating, by an encoder, an anchor representation and a first positive representation from the anchor image sample and the third positive image sample, respectively; and (Xu, Page 3, Col. 2, Section 3.2: “For positive samples, we simply make use of k nearest neighbors to search semantically similar images in the embedding space. Specially, given unlabeled training set X = {x1, x2, …, xn} and a query encoder fq, we obtain the corresponding embedding representation V = {v1, v2,…, vn} where vi = fq (xi)… Given an anchor sample xa, we compute the cosine similarity with all other images, and select the top k samples with the highest similarity as positives Ω = {x1, x2,…, xk}”, Page 4, Col. 1, ¶[1]: “We simply adjust the contrastive loss in Eq. 1 to allow for multiple positives per anchor. Given an anchor sample xa and its nearest neighborhood set Ω, we randomly select a positive sample xp ϵ Ω, and the loss function can be reformulated as:… where each anchor sample qa encoded with fq, is pulled with two samples ka and kq encoded with fk, and pushed away with all other samples in the key encoder fk.”, and Page 4, Col. 1, ¶[2]: “The overall loss is the combination of the two losses, which is equipped with two positive samples in the query encoder fq, and the corresponding two positive samples in the key encoder fk. Each sample is accompanied with a random data augmentation as described in [5], and is pulled together with all positive samples (also undergo a random data augmentation) from the other encoder.”) [Examiner’s note: an anchor representation i.e., each anchor sample qa encoded with fq, the positive representation i.e., positive samples in the query encoder fq]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge, Yuan, Goyal, Bekuzarov and Xu. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Bekuzarov teaches using contrastive loss to train a machine learning model in face verification and face recognition tasks. Xu tackles the representation inefficiency of contrastive learning and propose a hierarchical training strategy. Yuan teaches training a cascade of embedding models with increasing depth, where each level handles easy pairs of image and passes only the harder positive pairs to the next deeper model. Goyal teaches Hierarchical multi-label classification (HMC) methods, which utilize the hierarchy of class labels to train a machine learning model in real world scenarios. One of ordinary skill would have motivation to combine Ge, Yuan, Goyal, Bekuzarov and Xu because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E) “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if variations are predictable to one of ordinary skill in the art.
As per claim 15, the combination of Ge, Yuan, Goyal, Bakuzarov and Xu discloses all the limitations of Claim 14 (as shown in the rejections above).
Xu further discloses:
wherein the one or more hardware processors read the plurality of processor-executable instructions to further perform: computing a loss objective based at least in part on summing pair losses over positive image samples at each level and over the plurality of levels.  (Xu, Page 5, Col. 1: “When there are L losses corresponding to L intermediate stages, the final losses of the whole network can be computed as: 
    PNG
    media_image7.png
    84
    397
    media_image7.png
    Greyscale
” ) [Examiner’s note: the training objective is being interpreted as Ltotal as this is the total losses corresponding to L levels (i.e., stages)]
As per claim 19, the combination of Ge, Yuan, Goyal and Bekuzarov discloses all the limitations of claim 17 (as shown in the rejection above). 
Ge in view of Yuan, Goyal and Bekuzarov further discloses:
computing a first pair loss corresponding to the first positive pair based on a distance between the anchor representation and the first positive representation in a feature space. (Ge, Page 4, Col. 1: “Symmetrically, for positive sample xp, we also have: 
    PNG
    media_image6.png
    197
    571
    media_image6.png
    Greyscale
The overall loss is the combination of the two losses, which is equipped with two positive samples in the query encoder fq, and the corresponding two positive samples in the key encoder fk. Each sample is accompanied with a random data augmentation as described in [5], and is pulled together with all positive samples (also undergo a random data augmentation) from the other encoder.”)
Ge in view of Yuan, Goyal and Bekuzarov fails to disclose:
generating, by an encoder, an anchor representation and a first positive representation from the anchor image sample and the third positive image sample, respectively; and
However, Xu explicitly discloses:
generating, by an encoder, an anchor representation and a first positive representation from the anchor image sample and the third positive image sample, respectively; and (Xu, Page 3, Col. 2, Section 3.2: “For positive samples, we simply make use of k nearest neighbors to search semantically similar images in the embedding space. Specially, given unlabeled training set X = {x1, x2, …, xn} and a query encoder fq, we obtain the corresponding embedding representation V = {v1, v2,…, vn} where vi = fq (xi)… Given an anchor sample xa, we compute the cosine similarity with all other images, and select the top k samples with the highest similarity as positives Ω = {x1, x2,…, xk}”, Page 4, Col. 1, ¶[1]: “We simply adjust the contrastive loss in Eq. 1 to allow for multiple positives per anchor. Given an anchor sample xa and its nearest neighborhood set Ω, we randomly select a positive sample xp ϵ Ω, and the loss function can be reformulated as:… where each anchor sample qa encoded with fq, is pulled with two samples ka and kq encoded with fk, and pushed away with all other samples in the key encoder fk.”, and Page 4, Col. 1, ¶[2]: “The overall loss is the combination of the two losses, which is equipped with two positive samples in the query encoder fq, and the corresponding two positive samples in the key encoder fk. Each sample is accompanied with a random data augmentation as described in [5], and is pulled together with all positive samples (also undergo a random data augmentation) from the other encoder.”) [Examiner’s note: an anchor representation i.e., each anchor sample qa encoded with fq, the positive representation i.e., positive samples in the query encoder fq]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge, Yuan, Goyal, Bekuzarov and Xu. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Bekuzarov teaches using contrastive loss to train a machine learning model in face verification and face recognition tasks. Xu tackles the representation inefficiency of contrastive learning and propose a hierarchical training strategy. Yuan teaches training a cascade of embedding models with increasing depth, where each level handles easy pairs of image and passes only the harder positive pairs to the next deeper model. Goyal teaches Hierarchical multi-label classification (HMC) methods, which utilize the hierarchy of class labels to train a machine learning model in real world scenarios. One of ordinary skill would have motivation to combine Ge, Yuan, Goyal, Bekuzarov and Xu because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E) “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if variations are predictable to one of ordinary skill in the art.
Claim(s) 8, 16 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ge et al. (“Deep Metric Learning with Hierarchical Triplet Loss”) (hereafter referred to as “Ge”) in view of Yuan et al (“Hard-Aware Deeply Cascaded Embedding”) (hereafter referred to as “Yuan”), Goyal & Ghosh (“Hierarchical Class-Based Curriculum Loss”) (hereafter referred to as “Goyal”)and Maksym Bekuzarov (“Losses explained: Contrastive Loss”) (hereafter referred to as “Bekuzarov”), Xu et al. (“Hierarchical Semantic Aggregation for Contrastive Representation Learning”) (hereafter referred to as “Xu”) and further in view of Ramadiansyah & Rahadianti (“Proxy-based Losses and Pair-based Losses for Face Image Retrieval”)
 	As per claim 8, the combination of Ge, Yuan, Goyal, Bakuzarov and Xu discloses all the limitations of claim 6 (as shown in the rejections above).
	Ge in view of Yuan, Goyal, Bakuzarov and Xu further discloses:
at the respective level subject to a condition that the respective maximum pair loss is no less than another maximum pair loss corresponding to a lower level; and (Ge, Page 2, ¶[2]: “These loss functions are calculated on correlated samples, with a common goal of encouraging samples from the same class to be closer, and pushing samples of different classes apart from each other, in a projected feature space.”, Page 8, ¶[2]: “In our hierarchical triplet loss, a sample xa is encouraged to push the nearby points with different semantic meanings apart from itself.”, Page 8, Figure 3: 
    PNG
    media_image2.png
    317
    855
    media_image2.png
    Greyscale
) [Examiner’s note: The concept of the pair loss of lower level being greater than the pair loss of upper level is disclosed by Ge when Ge illustrates that the loss function encourages samples from the same class (lower loss value) to be closer (upper level) and pushing samples of different classes (higher loss value) apart (lower level) ]
computing a loss objective based at least in part on summing maximum pair losses over positive image samples at each level and among the plurality of levels.  (Xu, Page 5, Col. 1: “When there are L losses corresponding to L intermediate stages, the final losses of the whole network can be computed as: 
    PNG
    media_image7.png
    84
    397
    media_image7.png
    Greyscale
” ) [Examiner’s note: the training objective is being interpreted as Ltotal as this is the total losses corresponding to L levels (i.e., stages)]
	Ge in view of Yuan, Goyal, Bakuzarov and Xu fails to disclose:
determining, at each level from the plurality of levels, a respective maximum pair loss among positive pairs
	However, Ramadiansyah & Rahadianti explicitly discloses:
determining, at each level from the plurality of levels, a respective maximum pair loss among positive pairs (Ramadiansyah & Rahadianti, Page 178, Col. 1, Section 1: “The two input images (anchor and pair) are passed through the ConvNet to generate a fixed-length feature vector and are calculated by using Contrastive loss as illustrated in Eq. 1 as follows: 
    PNG
    media_image8.png
    84
    650
    media_image8.png
    Greyscale
where a is the anchor image or test image, p is positive image that belongs to same class with anchor image and n is negative image that has different class with anchor image.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge, Yuan, Goyal, Bekuzarov, Xu and Ramadiansyah. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Goyal teaches Hierarchical multi-label classification (HMC) methods, which utilize the hierarchy of class labels to train a machine learning model in real world scenarios. Bekuzarov teaches using contrastive loss to train a machine learning model in face verification and face recognition tasks. Xu tackles the representation inefficiency of contrastive learning and propose a hierarchical training strategy. Yuan teaches training a cascade of embedding models with increasing depth, where each level handles easy pairs of image and passes only the harder positive pairs to the next deeper model. Ramadiansyah teaches proxy-based losses and pair-based losses for face image retrieval. One of ordinary skill would have motivation to combine Ge, Yuan, Goyal, Bekuzarov, Xu and Ramadiansyah because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E) “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if variations are predictable to one of ordinary skill in the art.
As per claim 16, the combination of Ge, Yuan, Goyal, Bakuzarov and Xu discloses all the limitations of claim 14 (as shown in the rejections above).
	Ge in view of Yuan, Goyal, Bakuzarov and Xu further discloses:
at the respective level subject to a condition that the respective maximum pair loss is no less than another maximum pair loss corresponding to a lower label level; and (Ge, Page 2, ¶[2]: “These loss functions are calculated on correlated samples, with a common goal of encouraging samples from the same class to be closer, and pushing samples of different classes apart from each other, in a projected feature space.”, Page 8, ¶[2]: “In our hierarchical triplet loss, a sample xa is encouraged to push the nearby points with different semantic meanings apart from itself.”, Page 8, Figure 3: 
    PNG
    media_image2.png
    317
    855
    media_image2.png
    Greyscale
) [Examiner’s note: The concept of the pair loss of lower level being greater than the pair loss of upper level is disclosed by Ge when Ge illustrates that the loss function encourages samples from the same class (lower loss value) to be closer (upper level) and pushing samples of different classes (higher loss value) apart (lower level) ]
computing a loss objective based at least in part on summing maximum pair losses over positive image samples at each level and among the plurality of levels.  (Xu, Page 5, Col. 1: “When there are L losses corresponding to L intermediate stages, the final losses of the whole network can be computed as: 
    PNG
    media_image7.png
    84
    397
    media_image7.png
    Greyscale
” ) [Examiner’s note: the training objective is being interpreted as Ltotal as this is the total losses corresponding to L levels (i.e., stages)]
	Ge in view of Yuan, Goyal, Bakuzarov and Xu fails to disclose:
determining, at each level from the plurality of levels, a respective maximum pair loss among positive pairs
	However, Ramadiansyah & Rahadianti explicitly discloses:
determining, at each level from the plurality of levels, a respective maximum pair loss among positive pairs (Ramadiansyah & Rahadianti, Page 178, Col. 1, Section 1: “The two input images (anchor and pair) are passed through the ConvNet to generate a fixed-length feature vector and are calculated by using Contrastive loss as illustrated in Eq. 1 as follows: 
    PNG
    media_image8.png
    84
    650
    media_image8.png
    Greyscale
where a is the anchor image or test image, p is positive image that belongs to same class with anchor image and n is negative image that has different class with anchor image.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge, Yuan, Goyal, Bekuzarov, Xu and Ramadiansyah. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Goyal teaches Hierarchical multi-label classification (HMC) methods, which utilize the hierarchy of class labels to train a machine learning model in real world scenarios. Bekuzarov teaches using contrastive loss to train a machine learning model in face verification and face recognition tasks. Xu tackles the representation inefficiency of contrastive learning and propose a hierarchical training strategy. Yuan teaches training a cascade of embedding models with increasing depth, where each level handles easy pairs of image and passes only the harder positive pairs to the next deeper model. Ramadiansyah teaches proxy-based losses and pair-based losses for face image retrieval. One of ordinary skill would have motivation to combine Ge, Yuan, Goyal, Bekuzarov, Xu and Ramadiansyah because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E) “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if variations are predictable to one of ordinary skill in the art.
As per claim 20, the combination of Ge, Yuan, Goyal, Bakuzarov and Xu discloses all the limitations of claim 19 (as shown in the rejections above).
	Ge in view of Yuan, Goyal, Bakuzarov and Xu further discloses:
at the respective level subject to a condition that the respective maximum pair loss is no less than another maximum pair loss corresponding to a lower level; and (Ge, Page 2, ¶[2]: “These loss functions are calculated on correlated samples, with a common goal of encouraging samples from the same class to be closer, and pushing samples of different classes apart from each other, in a projected feature space.”, Page 8, ¶[2]: “In our hierarchical triplet loss, a sample xa is encouraged to push the nearby points with different semantic meanings apart from itself.”, Page 8, Figure 3: 
    PNG
    media_image2.png
    317
    855
    media_image2.png
    Greyscale
) [Examiner’s note: The concept of the pair loss of lower level being greater than the pair loss of upper level is disclosed by Ge when Ge illustrates that the loss function encourages samples from the same class (lower loss value) to be closer (upper level) and pushing samples of different classes (higher loss value) apart (lower level) ]
computing a loss objective based at least in part on summing maximum pair losses over positive image samples at each level and among the plurality of levels.  (Xu, Page 5, Col. 1: “When there are L losses corresponding to L intermediate stages, the final losses of the whole network can be computed as: 
    PNG
    media_image7.png
    84
    397
    media_image7.png
    Greyscale
” ) [Examiner’s note: the training objective is being interpreted as Ltotal as this is the total losses corresponding to L levels (i.e., stages)]
	Ge in view of Yuan, Goyal, Bakuzarov and Xu fails to disclose:
determining, at each level from the plurality of levels, a respective maximum pair loss among positive pairs
	However, Ramadiansyah & Rahadianti explicitly discloses:
determining, at each level from the plurality of levels, a respective maximum pair loss among positive pairs (Ramadiansyah & Rahadianti, Page 178, Col. 1, Section 1: “The two input images (anchor and pair) are passed through the ConvNet to generate a fixed-length feature vector and are calculated by using Contrastive loss as illustrated in Eq. 1 as follows: 
    PNG
    media_image8.png
    84
    650
    media_image8.png
    Greyscale
where a is the anchor image or test image, p is positive image that belongs to same class with anchor image and n is negative image that has different class with anchor image.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ge, Yuan, Goyal, Bekuzarov, Xu and Ramadiansyah. Ge teaches a novel hierarchical triplet loss capable of automatically collecting informative training samples via a defined hierarchical tree. Goyal teaches Hierarchical multi-label classification (HMC) methods, which utilize the hierarchy of class labels to train a machine learning model in real world scenarios. Bekuzarov teaches using contrastive loss to train a machine learning model in face verification and face recognition tasks. Xu tackles the representation inefficiency of contrastive learning and propose a hierarchical training strategy. Yuan teaches training a cascade of embedding models with increasing depth, where each level handles easy pairs of image and passes only the harder positive pairs to the next deeper model. Ramadiansyah teaches proxy-based losses and pair-based losses for face image retrieval. One of ordinary skill would have motivation to combine Ge, Yuan, Goyal, Bekuzarov, Xu and Ramadiansyah because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E) “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if variations are predictable to one of ordinary skill in the art.
Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMY TRAN whose telephone number is (571)270-0693. The examiner can normally be reached Monday - Friday 7:30 am - 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/AMY TRAN/Examiner, Art Unit 2126                                                                                                                                                                                                        
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126
Read full office action
Prosecution Timeline

Show 5 earlier events
Apr 15, 2025
Final Rejection — §103
Jul 21, 2025
Applicant Interview (Telephonic)
Jul 22, 2025
Examiner Interview Summary
Jul 22, 2025
Request for Continued Examination
Jul 29, 2025
Response after Non-Final Action
Feb 02, 2026
Non-Final Rejection — §103
Apr 30, 2026
Applicant Interview (Telephonic)
Apr 30, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

17/226,399
Patent 12602582
DYNAMIC DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS
5y 0m to grant Granted Apr 14, 2026
17/137,588
Patent 12468932
IDENTIFYING RELATED MESSAGES IN A NATURAL LANGUAGE INTERACTION
4y 10m to grant Granted Nov 11, 2025
16/996,310
Patent 12462185
SCENE GRAMMAR BASED REINFORCEMENT LEARNING IN AGENT TRAINING
5y 2m to grant Granted Nov 04, 2025
17/111,611
Patent 12423589
TRAINING DECISION TREE-BASED PREDICTIVE MODELS
4y 9m to grant Granted Sep 23, 2025
16/261,092
Patent 12288074
GENERATING AND PROVIDING PROPOSED DIGITAL ACTIONS IN HIGH-DIMENSIONAL ACTION SPACES USING REINFORCEMENT LEARNING MODELS
6y 3m to grant Granted Apr 29, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
38%
Grant Probability
88%
With Interview (+50.5%)
4y 9m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 29 resolved cases by this examiner. Grant probability derived from career allowance rate.