Last updated: May 29, 2026
Application No. 15/237,970
METHOD AND SYSTEM FOR MULTI-LABEL PREDICTION

Final Rejection §103
Filed
Aug 16, 2016
Examiner
LO, ANN J
Art Unit
2159
Tech Center
2100 — Computer Architecture & Software
Assignee
Yahoo Assets LLC
OA Round
12 (Final)
Interview Optional

— +26.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 45% grant rate with +26.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 224 resolved cases, 2023–2026
Examiner Intelligence

LO, ANN J View full profile →
Grants 45% of resolved cases
Career Allowance Rate
100 granted / 224 resolved
-10.4% vs TC avg
Strong +27% interview lift
Without
With
+26.7%
Interview Lift
resolved cases with interview
Typical timeline
4y 7m
Avg Prosecution
3 currently pending
Career history
238
Total Applications
across all art units
Statute-Specific Performance

§101
1.2%
-38.8% vs TC avg
§103
73.7%
+33.7% vs TC avg
§102
23.6%
-16.4% vs TC avg
§112
1.1%
-38.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 224 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Arguments
Applicant’s arguments with respect to claim(s) rejections under 35 USC 103 have been fully considered but they are not persuasive. Applicant argues that Bhatia does not teach the limitation “regression matrix learned in accordance with a random regression model”. Examiner respectfully disagrees.  First, the specification does not provide antecedent basis nor does it show the subject matter claimed, “a random regression model” Since the specification is silent on the claimed limitation of “a random regression model, it is noted that the aspect of “random” is stated in relation to random matrix, random ensembles that satisfy RIP, very low-dimensional random embedding in relation to RIP.  Thus, the interpretation for “random regression model” is taken from Applicants spec [0040] to be “low-dimensional embedding” Bhatia teaches this matter in combination with Hsu, Bhatia discloses clusters above as shown and states “learns a separate embedding per cluster.” Bhatia, analogously to Hsu’s regression matrix A of labels (“We use A to compress  … the labels Y”), discloses on Page 4 Section 2.1, “label matrix Y” for which an “embedding matrix Z” is found. Bhatia then states “However, Y can still be accurately modeled using a low-dimensional non-linear manifold That is, instead of preserving distances (or inner products) of a given label vector to all the training points, we attempt to preserve the distance to only a few nearest neighbors.. That is, we wish to find a L-dimensional embedding matrix Z … which minimizes the following objective.” here Bhatia discloses reducing a label matrix Y to a matrix Z using regression (i.e., minimizing an objective function), and specifically non-linear regression. This is further noted below Eq. 2 where it is stated: “we wish to learn a multi-regression model to predict the embeddings Z” and they state that this is “somewhat similar to a few existing methods for non-linear dimensionality reduction that also seek to preserve distances to a few near neighbors.”) 

Specification
The specification is objected to as failing to provide proper antecedent basis for the claimed subject matter.  See 37 CFR 1.75(d)(1) and MPEP § 608.01(o).  Correction of the following is required: “a random regression model”

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-21 are rejected under 35 U.S.C. 103 as being unpatentable over Hsu et al. (“Multi-Label Prediction via Compressed Sensing”; hereinafter “Hsu”) in view of 
Niculescu-Mizil et al., (US 2016/0328466, hereinafter “Niculescu-Mizil”), further in view of Bhatia et al. (“Locally Non-linear Embeddings for Extreme Multi-label Learning”; hereinafter “Bhatia”), further in view of Eskimez et al. (“WISE: Web-based Interactive Speech Emotion Classification”; hereinafter “Eskimez”), and further in view of Murphy-Chutorian et al. (US 2015/0081703 A1; hereinafter “Murphy-Chutorian”).
	As per Claim 1, Hsu teaches a method implemented [on a computing device having at least one processor, storage, and a communication platform connected to a network] for multi-label prediction (Hsu, Page 1 Abstract, discloses:  “We consider multi-label prediction problems with large output spaces under the assumption of output sparsity
– that the target (label) vectors have small support.”), the method comprising:
generating a label space from an initial label space having a dimension representing features and an initial dimension representing labels [by separately training the label space based on each of the plurality of clusters to reduce the initial dimension representing labels] (Hsu, Page 3 Section 3.1, discloses:  “Let A : Rd → Rm be a linear compression function, where m ≤ d (but hopefully m ≪ d). We use A to compress (i.e. reduce the dimension of) the labels Y, and learn a predictor H : X → A(Y) of these compressed labels. Since A is linear, we simply represent A ∈ Rm×d as a matrix. Specifically, given a sample {(xi, yi)}ni=1, we form a compressed sample {(xi, Ayi)}ni=1 and then learn a predictor H of E[Ay|x] with the objective of minimizing the ℓ22-error Ex||H(x) − E[Ay|x]||22”.  
Here, Hsu discloses an initial label space having a dimension representing features (“xi”) and an initial dimension representing labels (“yi”)).
wherein a corresponding set of low-dimensional label vectors and a corresponding regression matrix [learned in accordance with a non-linear regression model or a random regression model] are generated [for each of the clusters], (Hsu, Page 3 Section 3.1 as shown above, discloses a corresponding set of low-dimensional (“compress (i.e. reduce the dimension of)”) label vectors (“Ayi”).  This is performed by a corresponding matrix (“A”; “we simply represent A ∈ Rm×d as a matrix”).  This matrix is a “regression matrix”, as “regression” is disclosed by Hsu in Page 4 Figure 1:

    PNG
    media_image1.png
    246
    693
    media_image1.png
    Greyscale

As shown above in Algorithm 1, the matrix A is applied to a “regression learning algorithm”, and is thus a “regression matrix”.
Examiner also notes that Hsu states this matrix A satisfies Restricted Isometry Property (RIP) in Section 4.1:  “Several valid reconstruction algorithms are known for compression matrices that satisfy a restricted isometry property. Definition. A matrix A ∈ Rm×d satisfies the (k, δ)-restricted isometry property ((k, δ)-RIP), δ ∈ (0, 1), if … “ and in Section 4.2:  “Let Ak = {(k + f(k), δ)-RIP matrices} for some function f : N → N, and let A ∈ Ak have m rows.”  Examiner notes this is similar to the “regression matrix” described in Spec [0039].)
and wherein the generated label space maintains the dimension representing features yet has labels having a lower dimension as compared with the initial dimension representing labels [with the least relevant labels being filtered out] (Hsu, as shown above, in Page 3 Section 3.1, discloses:  “Specifically, given a sample {(xi, yi)}ni=1, we form a compressed sample {(xi, Ayi)}ni=1”, wherein the xi is unchanged.  See above in algorithms 1 and 2:  “input training data S C X x Rd”, and “test point x e X”.  It is evident that the features have the same dimension.  However, the labels have a lower dimension (“We use A to compress (i.e. reduce the dimension of) the labels Y”), and so Ayi has smaller dimension than yi.)
receiving a text data point from a user (Hsu, Page 11 Section 7.1 discloses receiving data from users for text data points (web pages):  “Text data. The second data set was collected by Tsoumakas et al. [11] from del.icio.us, a social bookmarking service in which users assign descriptive textual tags to web pages.”)
generating a first feature vector from the text data point (Hsu, Page 11 Section 7.1, discloses that both the web page, and the label, are vectorized:  “Each web page is represented as a boolean bag-of-words vector, with the vocabulary chosen using a combination of frequency thresholding and χ2 feature ranking. See [11] for details. Each binary label vector (in both data sets) indicates the labels of the corresponding data point.”)
projecting the first feature vector to the label space (Hsu, Page 3 Section 3.1 as shown above, discloses:  “learn a predictor H : X → A(Y) of these compressed labels”.  Here, H projects the feature space into the label space.)
converting [the first set] of labels to a second set of labels of the initial label space (Examiner notes:  Bhatia teaches “the” first set in a subsequent limitation, and actually teaches this entire limitation below.  However, Examiner also points out Hsu teaches a similar limitation, to take a set of labels and convert back to the initial label space.  Hsu, Top of Page 2, discloses:  “Rather, we learn to predict compressed label vectors, and then use sparse reconstruction algorithms to recover uncompressed labels from these predictions.”)
However, Hsu does not explicitly teach on a computing device having at least one processor, storage, and a communication platform connected to a network; filtering out, from labels extracted from training data, duplicate labels and least relevant labels; clustering the filtered training data into a plurality of clusters; wherein a corresponding regression matrix learned in accordance with a random regression model is generated for each of the clusters; generating a label space from an initial label space having a dimension representing features and an initial dimension representing labels by separately training the label space based on each of the plurality of clusters to reduce the initial dimension representing labels; identifying k-closest label vectors associated with the first feature vector; determining a first set of labels associated with the first feature vector from the label space, wherein each of the first set of labels occurs for a pre-determined number of times in the k-closest label vectors; converting the first set of labels to a second set of labels of the initial label space; and displaying the second set of labels together with the data point to annotate the data point with the second set of labels for auto-displaying further content upon detection of a click, wherein the second set of labels are displayed in a different display format compared to the data point.
Niculescu-Mizil teaches on a computing device having at least one processor, storage, and a communication platform connected to a network (Niculescu-Mizil, Para [0026-0028], discloses:  “Each computer program may be tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein … A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus … Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks.)
filtering out, from labels extracted from training data, duplicate labels and least relevant labels (Niculescu-Mizil, Para [0023], discloses: “In an embodiment of the present principles, one or more label filters are applied to the original set of labels, prior to the use of the multi-label classifier. These label filters eliminate from consideration labels in a label set that are unlikely to be relevant (corresponds to claimed “filtering out, from labels extracted from training data […] least relevant labels”).”
Niculescu-Mizil is analogous art because it is in the field of endeavor of multi label classification.	It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to apply the removal of irrelevant labels as taught by Niculescu-Mizil to either the training data or the original label space of Lin, the benefit being that “[…] label filters can reduce the prediction times of multi-label classifiers by orders of magnitude without significantly impacting the performance of the multi-label classifiers”, as recited by Niculescu-Mizil at ¶ [0022].
However, the combination of Hsu and Niculescu-Mizil does not teach clustering the filtered training data into a plurality of clusters; separately training the label space based on each of the plurality of clusters to reduce the initial dimension representing labels; wherein a corresponding regression matrix learned in accordance with a random regression model is generated for each of the clusters; identifying k-closest label vectors associated with the first feature vector; determining a first set of labels associated with the first feature vector from the label space, wherein each of the first set of labels occurs for a pre-determined number of times in the k-closest label vectors; converting the first set of labels to a second set of labels of the initial label space; and displaying the second set of labels together with the data point to annotate the data point with the second set of labels for auto-displaying further content upon detection of a click, wherein the second set of labels are displayed in a different display format compared to the data point.
Bhatia teaches clustering the filtered training data into a plurality of clusters (Bhatia, Page 2 Penultimate Paragraph, discloses:  “However, kNN classifiers are known to be slow at prediction. X1 therefore clusters the training data into C clusters, learns a separate embedding per cluster and performs kNN classification within the test point’s cluster alone.”)
wherein a corresponding regression matrix learned in accordance with a random regression model is generated for each of the clusters (Bhatia discloses clusters above as shown and states “learns a separate embedding per cluster.” Bhatia, analogously to Hsu’s regression matrix A of labels (“We use A to compress  … the labels Y”), discloses on Page 4 Section 2.1, “label matrix Y” for which an “embedding matrix Z” is found. Bhatia then states “However, Y can still be accurately modeled using a low-dimensional non-linear manifold That is, instead of preserving distances (or inner products) of a given label vector to all the training points, we attempt to preserve the distance to only a few nearest neighbors.. That is, we wish to find a L-dimensional embedding matrix Z … which minimizes the following objective.” here Bhatia discloses reducing a label matrix Y to a matrix Z using regression (i.e., minimizing an objective function), and specifically non-linear regression. This is further noted below Eq. 2 where it is stated: “we wish to learn a multi-regression model to predict the embeddings Z” and they state that this is “somewhat similar to a few existing methods for non-linear dimensionality reduction that also seek to preserve distances to a few near neighbors.”) [*Note: here showing the interpretation for “random regression model”  taken from Applicants spec [0040] to be “low-dimensional embedding”]
generating a label space from an initial label space having a dimension representing features and an initial dimension representing labels by separately training the label space based on each of the plurality of clusters to reduce the initial dimension representing labels (Bhatia, Paragraph spanning Pages 1-2, discloses:  “Embedding based approaches make training and prediction tractable by reducing the effective number of labels. Given a set of n training points {(xi, yi)ni=1} with d-dimensional feature vectors xi ∈ Rd and L-dimensional label vectors yi ∈ {0, 1}L, state-of-the-art embedding approaches project the label vectors onto a lower                 
                    
                        
                            L
                        
                        ^
                    
                
            -dimensional linear subspace as zi = Uyi, based on a low-rank assumption.”  Here, Bhatia discloses a dimension representing features (“d”) and an initial dimension representing labels (“L”), as well as reducing the dimension representing labels from “L” to “                
                    
                        
                            L
                        
                        ^
                    
                
            ”.  Bhatia discloses that previous methods of doing this are inefficient on Page 2 Para 3:  “Embedding approaches also have some limitations. They are slow at training and prediction even for a small embedding dimension                 
                    
                        
                            L
                        
                        ^
                    
                
            .”  Thus, Bhatia proposes accomplishing this by using k-Nearest-Neighbor on Page 2 Para 5:  “This paper develops the X1 algorithm which extends embedding methods in multiple ways to address these limitations. First, instead of projecting onto a linear low-rank subspace, X1 learns embeddings which non-linearly capture label correlations by preserving the pairwise distances between only the closest (rather than all) label vectors, i. e. d(zi, zj) ≈ d(yi, yj) if i ∈ kNN(j) where d is a distance metric.”  Then, Bhatia discloses that this is accomplished by separately training the label space based on each of the plurality of clusters in Page 2 Para 7:  “However, kNN classifiers are known to be slow at prediction. X1 therefore clusters the training data into C clusters, learns a separate embedding per cluster and performs kNN classification within the test point’s cluster alone.”  Here, Bhatia discloses “learns a separate embedding per cluster” and is thus separately training the label space.)
identifying k-closest label vectors associated with the first feature vector; (Bhatia, Page 2 Para 5, discloses:  “Regressors V are trained to predict zi = Vxi. During prediction, rather than using a decompression matrix, X1 uses a k-nearest neighbour (kNN) classifier in the learnt embedding space, thus leveraging the fact that nearest neighbour distances have been preserved during training. Thus, for a novel point x, the predicted label vector is obtained as 
    PNG
    media_image2.png
    18
    138
    media_image2.png
    Greyscale

Here, Bhatia discloses identifying the KNN label vectors (yi) to the first feature vector (x)).
determining a first set of labels associated with the first feature vector from the label space, wherein each of the first set of labels occurs for a pre-determined number of times in the k-closest label vectors; (Bhatia, Page 3 Section 2, discloses:

    PNG
    media_image3.png
    56
    624
    media_image3.png
    Greyscale

Thus, above Bhatia teaches that each label vector y in the nearest neighbors consists of a list of 0 and 1, wherein the label is set to 1 if it is turned on for that neighbor.  Bhatia then discloses in Algorithm 2:

    PNG
    media_image4.png
    117
    310
    media_image4.png
    Greyscale

Here, in Line 5, Bhatia discloses taking the “Top p” labels.  Thus, each of Bhatia’s predicted labels occur at least as many times as necessary to qualify as being in the “top p” labels.  The number of occurrences necessary is “pre-determined”, since “p” used to figure this out is pre-determined in Bhatia’s algorithm.)
converting the first set of labels to a second set of labels of the initial label space (Bhatia, Top of Page 2, discloses:  “Labels for a novel point x are predicted by post-processing y = U†Vx where U† is a decompression matrix which lifts the embedded label vectors back to the original label space.”  As Examiner notes above, Hsu also does similar.)
Bhatia is analogous art, as it is in the field of multi-label classification.  It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to apply the kNN with clustering as taught by Bhatia to the lower dimension label space generation of Hsu, the benefit of kNN being improvements in both efficiency and accuracy: “The superiority of X1’s proposed embeddings over traditional low-rank embeddings can be determined in two ways. First, as can be seen in Figure 1, the relative approximation error in learning X1’s embeddings is significantly smaller as compared to the low-rank approximation error. Second, X1 can improve over state-of-the-art embedding methods’ prediction accuracy by as much as 35% (absolute) on the challenging WikiLSHTC data set,” as recited by Bhatia Page 2 Para 3-5, and the benefit of using clustering for the kNN is further improvements in efficiency:  “This reduces X1’s prediction costs…for determining the cluster membership of the test point, embedding it and then performing kNN classification respectively, where NC is the number of points in the cluster to which the test point was assigned. X1 can therefore be more than two orders of magnitude faster at prediction
than LEML and other embedding methods on the WikiLSHTC data set,” as recited by Bhatia Page 2 Para 7.  Bhatia also discloses the following about incorporating KNN in Page 2 Para 5:  “Our use of the kNN classifier is also motivated by the observation that kNN outperforms discriminative methods in acutely low training data regimes [17] as in the case of tail labels.”
However, the combination of Hsu, Niculescu-Mizil, and Bhatia does not explicitly teach displaying the second set of labels together with the text data point to annotate the text data point with the second set of labels for auto-displaying further content upon detection of a click, wherein the second set of labels are displayed in a different text color or text font compared to the text data point.
Eskimez teaches displaying the second set of labels together with the text data point to annotate the text data point with the second set of labels for auto-displaying further content upon detection of a click, wherein the second set of labels are displayed in a different text color or text font compared to the text data point. (Recall above that Hsu discloses a text data point.  Eskimez, Page 4 Figure 2, discloses:

    PNG
    media_image5.png
    807
    1051
    media_image5.png
    Greyscale

Eskimez, Page 3, discloses:  “The user can request labels from the automated emotion
classifier by clicking on the “request label” button. The system then shows suggested labels to the user.”  
Thus, Eskimez discloses displaying the set of second labels together with the data point (Eskimez’ data point is audio speech, but recall above Hsu discloses a text web page), in order to annotate the data point with a label. The second set of labels are placed on a display that also includes a button called “Request analysis” that, upon clicking, displays the further content of suggested labels for the user, that correspond to the second set of labels below.  Regarding auto-displaying further content upon detection of a click, Examiner notes that the Specification [0051] states  “In some other embodiments, presenting unit 512 displays the one or more labels ynew in an annotation format that allows auto-displaying further content upon detecting a mouse move or click.”  Above, the annotation “format” is that the second set of labels are displayed on the same interface that also includes a button called “Request analysis”, which displays suggestions to the user (further content) upon a click.  Thus, the labels are displayed in a format that allows auto-displaying further content upon detecting a mouse click on the display on which the second set of labels are shown.)
Eskimez suggests, but does not explicitly teach wherein the second set of labels are displayed in a different text color or text font compared to the text data point.  (Eskimez, Figure 2 shown above, shows that the second set of labels are in a white color text, with a unique icon next to it.  Thus the second set of labels are each displayed in a unique format.  However, while the default color of text is typically black, Eskimez’ data point is not in text form, so it is not explicitly disclosed here that the labels are a different text color or text font from Eskimez (or Hsu, who above discloses labelling text web pages, but does not explicitly recite a text font or color).
Eskimez is analogous art because it is in the field of endeavor of active learning for machine learning classification.  It would have been obvious before the effective filing date of the claimed invention to combine the multi label classification of Hsu, Niculescu-Mizil, and Bhatia with the user-assisted labeling of Eskimez.  One of ordinary skill in the art would be motivated to do so in order to perform training in a less costly manner (Eskimez, Abstract:  “However, manual classification of emotions from speech is time consuming  … The user can then adjust the emotion label if the system classification of the emotion does not agree with the user’s perception, and this updated label is then fed back into the system to retrain the models. In this way, WISE enables the emotion classification models to be adapted over time … We evaluate the benefit of the user feedback enabled by WISE in situations where manually classifying emotions in a large dataset is costly, yet trained models alone will not be able to accurately classify the data.”)
Murphy-Chutorian teaches wherein the second set of labels are displayed in a different text color or text font compared to the text data point.  (Murphy-Chutorian, Para [0061], discloses:  “In some implementations, suggested labels are provided for display in the user interface, by system 102 in an arrangement or manner that is based on a ranking of suggested labels according to one or more suggested criteria. For example, if label "#AutoBackUp" is associated more frequently with given photos than the label "#Email," system 102 may cause "#AutoBackUp" to display in a higher position in a list of suggested labels in a user interface. In some implementations, system 102 may indicate a higher rank for a given suggested label by changing the manner of presentation of the given suggested label (e.g., highlight, underline, change color, or the like).”  Here, Murphy-Chutorian discloses changing the font (“underline”) or color (“change color”) of a suggested label.  If some of the labels’ font or color are different from others, as suggested by Murphy-Chutorian, then at least some of them would then be different from the original text color of any text data point.  Recall above that Hsu discloses a text data point (web page)).
Murphy-Chutorian is analogous art because the problem faced by Murphy (labeling of data) is reasonably pertinent to the problem faced by the inventor and also the combination of Hsu, Niculescu-Mizil, Bhatia, and Eskimez.  It would have been obvious before the effective filing date of the claimed invention to combine the multi label classification with user assisted labeling of Hsu, Niculescu-Mizil,  Bhatia, and Eskimez with the suggested label font adjustment of Murphy-Chutorian.  One of ordinary skill in the art would be motivated to do so in order to achieve better labeling results by pushing the user toward better label choices (Murphy [0061]:  “In some implementations, system 102 may indicate a higher rank for a given suggested label by changing the manner of presentation of the given suggested label (e.g., highlight, underline, change color, or the like”)).

	As per Claim 2, the combination of Hsu, Niculescu-Mizil, Bhatia, Eskimez, and Murphy-Chutorian teaches the method of claim 1. Hsu teaches 
wherein generating the label space further comprises: 
obtaining a plurality of data samples from at least a knowledge base (Hsu, Page 11 Section 7.1 discloses receiving data from users:  “The first data set was collected by the ESP Game [26], an online game in which players ultimately provide word tags for a diverse set of web images.  The set contains nearly 68000 images, with about 22000 unique labels”)
generating a plurality of feature vectors respectively associated with the plurality of data samples; (Hsu, Page 11 Section 7.1, discloses:  “We represented each image as a bag-of-features vector in a manner similar to [27].”)
extracting one or more labels associated with the plurality of feature vectors (Hsu, Page 11 Section 7.1, discloses:  “We retained just the 1000 most frequent labels: the least frequent of these occurs 39 times in the data, and the most frequent occurs about 12000 times. Each image contains about four labels on average.”)
However, Hsu does not teach generating a first label matrix based on the plurality of feature vectors and the one or more labels; transforming the first label matrix to a second label matrix; training one or more parameters associated with the second label matrix; and generating the label space based on the second label matrix and the trained one or more parameters.
Bhatia teaches generating a first label matrix based on the plurality of feature vectors and the one or more labels; (Bhatia, Page 3 Section 2, discloses:  “Let D = {(x1, y1) . . . (xn, yn)} be the given training data set, xi ∈ X ⊆ Rd be the input feature vector, yi ∈ Y ⊆ {0, 1}L be the corresponding label vector, and yij = 1 iff the j-th label is turned on for xi. Let X = [x1, . . . , xn] be the data matrix and Y = [y1, . . . , yn] be the label matrix.”  Here, Bhatia discloses a first label matrix Y based on the feature vectors and labels (“yij = 1 iff the j-th label is turned on for xi”)).
transforming the first label matrix to a second label matrix (Bhatia, Page 3 Section 2 Para 2, discloses:  “Our algorithm is an embedding-style algorithm, i.e., during training we map the label vectors yi to L-hat-dimensional vectors zi ∈ R L-hat and learn a set of regressors V ∈ R L-hat ×d s.t. zi ≈ V xi, ∀i.”  Here, Bhatia discloses that the vectors yi that make up the first label vector Y are transformed into vectors zi.  This is also shown in Page 4 Section 2.1:  “However, Y can still be accurately modeled using a low-dimensional non-linear manifold. That is, instead of preserving distances (or inner products) of a given label vector to all the training points, we attempt to preserve the distance to only a few nearest neighbors. That is, we wish to find a L-hat-dimensional embedding matrix Z = [z1, . . . , zn] ∈ R L-hat ×n which minimizes the following objective”.  Here, Bhatia shows that the first label matrix Y is transformed into a second label matrix Z.)
training one or more parameters associated with the second label matrix (Bhatia, Page 4 Between Eq 2 and 3:  “Now, given the embeddings Z = [z1, . . . , zn] ∈ R L-hat ×n, we wish to learn a multi-regression model to predict the embeddings Z using the input features.”  Here, Bhatia states to “learn a multi-regression model” to predict Z, and learning a model includes training parameters.)
and generating the label space based on the second label matrix and the trained one or more parameters (As shown above in Bhatia Page 3 Section 2, the matrix Z (that has been learned based on the parameters), represents a label space that is smaller than original label space Y, reducing the dimension from L to L-hat.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Hsu and Niculescu-Mizli with those of Bhatia for at least the reasons recited in the rejection to Claim 1.

As per Claim 3, the combination of Hsu, Niculescu-Mizil, Bhatia, Eskimez, and Murphy-Chutorian teaches the method of claim 2. Bhatia teaches wherein each element of the first label matrix indicates a relation as to whether one of the plurality of feature vectors is annotated by one of the one or more labels (Bhatia, Page 3 Section 2, discloses:  “Let D = {(x1, y1) . . . (xn, yn)} be the given training data set, xi ∈ X ⊆ Rd be the input feature vector, yi ∈ Y ⊆ {0, 1}L be the corresponding label vector, and yij = 1 iff the j-th label is turned on for xi.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Hsu and Niculescu-Mizli with those of Bhatia for at least the reasons recited in the rejection to Claim 1.

As per Claim 4, the combination of Hsu, Niculescu-Mizil, Bhatia, Eskimez, and Murphy-Chutorian teaches the method of claim 2 as well as transforming the first label matrix to a second label matrix (see Rejection to Claim 2). Hsu teaches wherein transforming the first label matrix to a second label matrix further comprises: performing dimensionality reduction on the first label matrix based on random projection (Hsu, Page 12 Section 7.3, discloses:  “The compression functions we used were generated by selecting m random rows of the 1024×1024 Hadamard matrix, for m ∈ {100, 200, 300, 400}.”  Here, Hsu discloses dimensionality reduction (“compression”) based on random projection, using random rows of the Hadamard matrix.)
However, Hsu does not explicitly teach wherein a first dimension of the first label matrix representing a number of labels is reduced to a pre-determined value in the second label matrix
Bhatia teaches wherein a first dimension of the first label matrix representing a number of labels is reduced to a pre-determined value in the second label matrix. (Bhatia, as shown above, reduces the dimension of the first label matrix L to the lower dimension of the second label matrix L-hat.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Hsu and Niculescu-Mizli with those of Bhatia for at least the reasons recited in the rejection to Claim 1.

As per Claim 5, the combination of Hsu, Niculescu-Mizil, Bhatia, Eskimez, and Murphy-Chutorian teaches the method of claim 2. Bhatia teaches wherein the one or more parameters associated with the second label matrix is trained by a least square regression model. (Bhatia, Page 4 Section 2.1, discloses:  “That is, we wish to find a bL-dimensional embedding matrix Z = [z1, . . . , zn] ∈ R L-hat×n which minimizes the following objective:

    PNG
    media_image6.png
    52
    389
    media_image6.png
    Greyscale

As shown above, Bhatia discloses the minimization of the squared error to learn the parameters of the second label matrix Z.  This is a least squares regression model.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Hsu and Niculescu-Mizli with those of Bhatia for at least the reasons recited in the rejection to Claim 1.

As per Claim 6, the combination of Hsu, Niculescu-Mizil, Bhatia, Eskimez, and Murphy-Chutorian teaches the method of claim 2. Bhatia teaches wherein the first feature vector is projected to the label space using the one or more parameters associated with the second label matrix. (Bhatia, Page 3 Section 2, discloses:  “Let D = {(x1, y1) . . . (xn, yn)} be the given training data set, xi ∈ X ⊆ Rd be the input feature vector, yi ∈ Y ⊆ {0, 1}L be the corresponding label vector, and yij = 1 iff the j-th label is turned on for xi.”  Here, Bhatia discloses projecting the feature vector to the initial label space Y.  Bhatia above, shows that the first label matrix Y is replaced with the dimension reduced second label matrix Z, as shown in Page 4 Section 2.1:  “However, Y can still be accurately modeled using a low-dimensional non-linear manifold. That is, instead of preserving distances (or inner products) of a given label vector to all the training points, we attempt to preserve the distance to only a few nearest neighbors. That is, we wish to find a L-hat-dimensional embedding matrix Z = [z1, . . . , zn] ∈ R bL ×n which minimizes the following objective”.  Thus, the feature vector x is projected to the label space using the parameters associated with second matrix Z instead of first matrix Y.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Hsu and Niculescu-Mizli with those of Bhatia for at least the reasons recited in the rejection to Claim 1.

As per Claim 7, the combination of Hsu, Niculescu-Mizil, Bhatia, Eskimez, and Murphy-Chutorian teaches the method of claim 1. Bhatia teaches wherein determining a first set of labels associated with the first feature vector from the label space further comprises: 
selecting a pre-determined number of candidates from the label space using k-nearest neighbor learning; (Bhatia, Page 4 Algorithm 2, discloses:

    PNG
    media_image7.png
    125
    311
    media_image7.png
    Greyscale

(Here, in Line 3, Bhatia discloses a pre-determined number of candidates (“nearest neighbors”) from the label space (“Nz <- ñ nearest neighbors of z in ZT”)).
computing an empirical distribution for each of the pre-determined number of candidates; (Bhatia, Page 4 Algorithm 2 above, discloses in Line 4:  “Empirical label dist. for points e Nz.”) 
and determining the first set of labels based on the computed empirical distributions (Bhatia, Page 4 Algorithm 2 above, discloses in Line 5:  “ypred <- Topp(Pz)”.  Here, Bhatia discloses that the predicted label vector is based on the top values of the empirical distribution.) 

	Claims 8-14 are system claims corresponding to method Claims 1-7, and are rejected for similar reasons.

	Claims 15-19 and 20 are non-transitory machine-readable medium claims corresponding to method Claims 1-5 and 7, respectively, and are rejected for similar reasons.

	

As per Claim 21, the combination of Hsu, Niculescu-Mizil, Bhatia, Eskimez, and Murphy-Chutorian teaches the method of claim 1. Niculescu-Mizil teaches further comprising: automatically tagging a document with the determined first set of labels; (Niculescu-Mizil, Para [0070], discloses:  “At block 370, the subset of labels or tags is sent to a classifier. The classifier evaluates the subset of labels and tags to determine which labels or tags are relevant to the objects being queried. The classifier then assigns the relevant labels or tags to each of the queried objects.”)
and providing the tagged document to the user (Niculescu-Mizil, Para [0071], discloses: “At block 380, based on the labels or tags assigned to each of the objects, objects relevant to a user are determined. In an embodiment, user-oriented is created from this determination, wherein the user-oriented content includes content consisting of at least one object determined to be relevant to a user. In an embodiment, the user-oriented content is displayed to a user”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Niculescu-Mizli with those of Hsu for at least the reasons recited in the rejection to Claim 1.



Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANN J LO whose telephone number is (571)272-9767. The examiner can normally be reached Monday-Friday, 9 AM to 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Cordelia Zecher can be reached at 571-272-7771. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ANN J LO/Supervisory Patent Examiner, Art Unit 2159
Read full office action
Prosecution Timeline

Show 35 earlier events
Jul 18, 2024
Response after Non-Final Action
Jul 18, 2024
Response after Non-Final Action
Apr 29, 2025
Response after Non-Final Action
Jun 27, 2025
Request for Continued Examination
Jul 07, 2025
Response after Non-Final Action
Jul 30, 2025
Non-Final Rejection mailed — §103
Oct 30, 2025
Response Filed
Apr 01, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/772,925
Patent 12614108
METHOD FOR RECOMMENDING INFORMATION, RECOMMENDATION SERVER, AND STORAGE MEDIUM
4y 0m to grant Granted Apr 28, 2026
17/442,987
Patent 12602228
NEUROMORPHIC PROCESSOR AND NEUROMORPHIC PROCESSING METHOD
4y 6m to grant Granted Apr 14, 2026
17/486,770
Patent 12602597
AUTOMATIC DISCOVERY OF MACHINE LEARNING MODEL FEATURES
4y 6m to grant Granted Apr 14, 2026
17/987,874
Patent 12566146
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, NON-TRANSITORY COMPUTER READABLE MEDIA STORING PROGRAM, AND X-RAY ANALYSIS APPARATUS
3y 3m to grant Granted Mar 03, 2026
17/696,380
Patent 12541710
COMPUTERIZED SYSTEMS AND METHODS FOR USER ACTION PREDICTION
3y 10m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

13-14
Expected OA Rounds
45%
Grant Probability
71%
With Interview (+26.7%)
4y 7m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 224 resolved cases by this examiner. Grant probability derived from career allowance rate.
METHOD AND SYSTEM FOR MULTI-LABEL PREDICTION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email