DETAILED ACTION
This action is responsive to claims filed on 2 December 2022.
Claims 1-20 are pending for examination.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claim 14 is objected to because of the following informalities: “The method of claim 11” in line 1 should be “The method of claim 13”. Appropriate correction is required.
Claim 15 is objected to because of the following informalities: “The method of claim 15” in line 1 should be “The method of claim 1”. Appropriate correction is required.
Claim 15 is objected to because of the following informalities: “one learning model” in line 1 should be “one machine learning model”. Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2, 4-13, 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Jiang et al. (NPL: "BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online E-Commerce Search", hereinafter ‘Jiang'), in view of Ji et al. (U.S. Pre-Grant Publication No. 20110029517, hereinafter 'Ji').
Regarding claim 1 and analogous claim 20, Jiang teaches A computer-implemented method of training a machine learning model to rank in-use digital objects, a given in-use digital object generated using a respective in-use search query, the method executable by a processor, the method comprising:
receiving, by the processor, a second plurality of training digital objects, a given one of the second plurality of training digital objects being associated with: (i) a respective training search query used for generating the given one of the second plurality of training digital objects; and (ii) a respective first assessor-generated label indicative of how relevant, to the respective training search query, the given one of the second plurality of training digital objects is as perceived by a respective human assessor that has assigned the first respective assessor-generated label ([C. Model Structure, pg. 215] BERT embeds a rich hierarchy of linguistic signals: surface information at the bottom, syntactic information in the middle, and semantic information at the top [24]. Large model capacity and expressive power make BERT an ideal choice for the teacher model. To better align with the e-commerce relevance classification task, we fine-tune BERT on close to 400k (ii) a respective first assessor-generated label indicative of how relevant, to the respective training search query, the given one of the second plurality of training digital objects is as perceived by a respective human assessor that has assigned the first respective assessor-generated label human labeled query, receiving, by the processor, a second plurality of training digital objects, a given one of the second plurality of training digital objects being associated with: (i) a respective training search query used for generating the given one of the second plurality of training digital objects item pairs. Concretely, we combine the query and item title tokens into a single sentence separated by the special token [SEP], prefixed by the special token [CLS], and padded at the end by [PAD]. We choose token length of 128 which covers more than 95% of all our query, title pairs. The model proceeds by converting each token into 768 dimensional embedding, thus a total input size of 98304 per example (query, item pair). At the end, the model tries to minimize the cross entropy loss between the true label (relevant or not) and the output probability.);
training, in a second training phase following the first training phase, based on the second plurality of training digital objects, the machine learning model for determining a respective synthetic assessor-generated label of the given in-use digital object, the respective synthetic assessor-generated label being indicative of how relevant, to the respective in-use search query, the given in-use digital object is as perceived by the respective human assessor if the given in-use digital object is presented to the respective human assessor; applying, by the processor, the machine learning model to the first plurality of training digital objects to augment the given one of the first plurality of training digital objects with the respective synthetic assessor-generated label, thereby generating a first augmented plurality of training digital objects ([C. Model Structure, pg. 216] Single-BERT model often suffers from high variance on its test prediction scores. One way to mitigate that is through ensembling. By neutralizing individual model variance against one another, ensembling also postpones over-fitting on test data. We modify the stacking (ensemble learning) method described in [25] to fit the distillation framework: 1. instead of base learners, training, in a second training phase following the first training phase, based on the second plurality of training digital objects, the machine learning model for determining a respective synthetic assessor-generated label of the given in-use digital object first train multiple base teachers with the same training set D, 2. collect the respective synthetic assessor-generated label being indicative of how relevant, to the respective in-use search query, the given in-use digital object is as perceived by the respective human assessor if the given in-use digital object is presented to the respective human assessor prediction scores using the base teachers and applying, by the processor, the machine learning model to the first plurality of training digital objects to augment the given one of the first plurality of training digital objects with the respective synthetic assessor-generated label, thereby generating a first augmented plurality of training digital objects generate new meta data D’ (also called “distilled data” or “transfer set”), 3. train a single meta learner based on the simulated label in D’.; [C. Model Structure, pg. 216] Homogeneous Base Teacher Stacking. Here we first train two BERT-Base models on the same human labeled data, which we name BERT1 and BERT2. While the training methodologies are identical, the two models can differ by virtue of random initialization and data shuffling. In the 2nd stage, prediction scores of BERT1 and BERT2 on user log data are combined by simple averaging to obtain D’. Finally student model is trained directly on the averaged scores in D’. Heterogeneous Base Teacher Stacking. Here we add prediction scores from ERNIE [5], a close competitor of BERT. The main difference is that ERNIE used whole phrase masking in its pre-training step. This helps widening the gap between the two base learners. Now we simply average the prediction scores of all three BERT1, BERT2, and ERNIE together in stage 2, while the rest remains the same.); and
training the machine learning model based on the first augmented plurality of training digital objects to determine a respective relevance parameter of the given in-use digital object, the respective relevance parameter being indicative of how relevant the given in- use digital object is to the respective in-use search query ([C. Model Structure, pg. 216] 2) Student model: fully-connected: The training the machine learning model based on the first augmented plurality of training digital objects to determine a respective relevance parameter of the given in-use digital object, the respective relevance parameter being indicative of how relevant the given in- use digital object is to the respective in-use search query student model (Figure 4) takes aggregated token embeddings as sentence embedding input, and computes the score of a feed-forward Deep Neural Net (DNN). By the universal approximation theorem [26], DNN of just a single hidden layer can approximate any n-variate function under mild constraint. In addition, DNN has far less serving complexity and consumes much less computing resources than CNN, RNN, or Transformers. Therefore we focus on DNN as the student model of choice).
Jiang fails to teach receiving, by the processor, a first plurality of training digital objects, a given one of the first plurality of training digital objects being associated with a past user interaction parameter indicative of user interaction of past users with the given one of the first plurality of training digital objects; training, in a first training phase, based on the first plurality of training digital objects, the machine learning model for determining a respective predicted user interaction parameter of the given in-use digital object, the respective predicted user interaction parameter being indicative of a user interaction of future users with the given in-use digital object;
Ji teaches receiving, by the processor, a first plurality of training digital objects, a given one of the first plurality of training digital objects being associated with a past user interaction parameter indicative of user interaction of past users with the given one of the first plurality of training digital objects ([0030] In accordance with one or more embodiments, a receiving, by the processor, a first plurality of training digital objects training data set includes a plurality of queries, a plurality of feature vectors associated with each query and a label associated with each feature vector. By way of a non-limiting example, each query has a set of search results containing at least one item, or document. As is discussed below, all or a portion, e.g., the first ten, of the documents in a search result set can be considered, and each item considered has an associated feature vector and a label. Each label used in the training data set is provided by a human judge; each label comprises information of a human judge's assessment of the relevance assessment of an item, or document, to a query. Each feature vector comprises a plurality of features and a value for each of the plurality of features. In accordance with one or more embodiments, the feature vector comprises both global and local features. In accordance with one or more embodiments, a given one of the first plurality of training digital objects being associated with a past user interaction parameter indicative of user interaction of past users with the given one of the first plurality of training digital objects features for a query session comprise features extracted using click data for the query session. In accordance with one or more alternate embodiments, the feature vector comprises global features. In accordance with one or more embodiments, various types of click features can be used in the model and aggregated click features can be extracted from user click, or query, sessions.);
training, in a first training phase, based on the first plurality of training digital objects, the machine learning model for determining a respective predicted user interaction parameter of the given in-use digital object, the respective predicted user interaction parameter being indicative of a user interaction of future users with the given in-use digital object ([0024] In accordance with one or more embodiments, training, in a first training phase, based on the first plurality of training digital objects, the machine learning model for determining a respective predicted user interaction parameter of the given in-use digital object click data from a plurality of query sessions is used to train one or more relevance predictor models, and a the respective predicted user interaction parameter being indicative of a user interaction of future users with the given in-use digital object trained relevance predictor model is used to rank items in a search query according to relevance.; [0086] Referring again to FIG. 4, in accordance with one or more embodiments, a query and corresponding result set of documents can be used with one or more models trained during the training phase 402 to generate predictions, or estimates, of the relevance rankings of the documents in the result set.);
Jiang and Ji are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Jiang, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Ji to Jiang before the effective filing date of the claimed invention in order to explore supervised learning in click data modeling, a click model such as that disclosed in accordance with one or more embodiments can reliably extract relevance information by calibrating with human relevance judgments (cf. Ji, [0022] In accordance with one or more embodiments disclosed herein, relevance information is extracted from user click data via a global ranking framework; relational information among the documents as manifested by an aggregation of user clicks is used. Experiments on the click data collected from a commercial search engine demonstrate the effectiveness of this approach, and its superior performance over a set of widely used unsupervised methods, such as the cascade model and the heuristic rule based methods. Since user click data is inherently noisy, a supervised approach, which uses human judgment information as part of the training data used to generate a relevance predictor model, provides a degree of reliability over an unsupervised approach. Advantageously, by exploring supervised learning in click data modeling, a click model such as that disclosed in accordance with one or more embodiments can reliably extract relevance information by calibrating with human relevance judgments.).
Regarding claim 2, Jiang, as modified by Ji, teaches The method of claim 1.
Jiang teaches wherein: the given one of the first plurality of training digital objects includes an indication of a digital document, the digital document being associated with document metadata; and the training the machine learning model, based on the first plurality of training digital objects, further comprises, in the first training phase: converting the document metadata into a text representation thereof comprising tokens ([C. Model Structure, pg. 215] For a sequence converting the document metadata into a text representation thereof comprising tokens embedding of n tokens and embedding dimension d, X ∈ Rd×n, the forward pass of one Transformer block including a self-attention layer Att(.) and a feed-forward layer FF(.) can be formulated as: Att(X) = X + h i=1 Wi OWi V X · σ[(Wi KX) T Wi QX] (2) FF(X) = Att(X) +W2 · ReLU(W1 ·Att(X) +b11T n ) +b21T n (3) where Wi O ∈ Rd×m, Wi V , Wi K, Wi Q ∈ Rm×d, W2 ∈ Rd×r, W1 ∈ Rr×d, b2 ∈ Rd, b1 ∈ Rr, and FF(X) is the output of the Transformer block. The number of heads h and the head size m are two main parameters of the attention layer; and r denotes the hidden layer size of the feed-forward layer. BERT embeds a rich hierarchy of linguistic signals: surface information at the bottom, syntactic information in the middle, and semantic information at the top [24]. Large model capacity and expressive power make BERT an ideal choice for the teacher model. To better align with the e-commerce relevance classification task, we fine-tune BERT on close to 400k human labeled query, item pairs. Concretely, we combine the query and item title tokens into a single sentence separated by the special token [SEP], prefixed by the special token [CLS], and padded at the end by [PAD]. We choose token length of 128 which covers more than 95% of all our query, title pairs. The model proceeds by converting each token into 768 dimensional embedding, thus a total input size of 98304 per example (query, item pair).);
preprocessing the text representation to mask therein a number of masked tokens; and training the machine learning model, based on the first plurality of training digital objects, to determine a given one of the number of masked tokens based on a context provided by neighboring tokens ([C. Model Structure, pg. 216] Heterogeneous Base Teacher Stacking. Here we add prediction scores from ERNIE [5], a close competitor of BERT. The main difference is that ERNIE preprocessing the text representation to mask therein a number of masked tokens used whole phrase masking in its pre-training step. This helps widening the gap between the two base learners.; 2) Student model: fully-connected: The training the machine learning model, based on the first plurality of training digital objects, to determine a given one of the number of masked tokens based on a context provided by neighboring tokens student model (Figure 4) takes aggregated token embeddings as sentence embedding input, and computes the score of a feed-forward Deep Neural Net (DNN). By the universal approximation theorem [26], DNN of just a single hidden layer can approximate any n-variate function under mild constraint. In addition, DNN has far less serving complexity and consumes much less computing resources than CNN, RNN, or Transformers. Therefore we focus on DNN as the student model of choice.); and
wherein the respective relevance parameter of the given in-use digital object is further indicative of a semantic relevance parameter, the semantic relevance parameter being indicative of how semantically relevant the respective in-use search query is to a content of the given in-use digital object ([5) Online experiments:, pg. 219] Table III shows online A/B test metrics of BERT2DNN applied as a relevance filter to the baseline production search results, over a period of 2 weeks in JD.com. Our A/B test is conducted on three sort options, i.e. sale sort, price sort and default sort. As we stated, in such ranking options, relevance problem is much easier to reveal. Since relevance is often not well-aligned with user click/purchase preference (see section IV-B), we added user behavioral data from each mode. This had little impact on the offline relevance metrics, but ensured good recall on historically clicked/purchased items. When the user issues a query in either search mode, production search results with model scores below a pre-chosen threshold are filtered out. In all search modes the filtering model wherein the respective relevance parameter of the given in-use digital object is further indicative of a semantic relevance parameter, the semantic relevance parameter being indicative of how semantically relevant the respective in-use search query is to a content of the given in-use digital object produces generally positive behavioral trend, as seen by user view value (UVvalue), unique user conversion rate (UCVR) and unique user click through rate (UCTR).).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 4, Jiang, as modified by Ji, teaches The method of claim 1.
Ji teaches further comprising determining the past user interaction parameter associated with the given one of the first plurality of training digital objects based on click data of the past users ([0031] The FrequencyRank feature, which identifies the rank of the document in a list of the documents sorted by the determining the past user interaction parameter associated with the given one of the first plurality of training digital objects based on click data of the past users number of clicks associated with each of the documents, is one non-limiting example of a global feature. Some of the features in the table shown in FIG. 2 are independent of temporal information of the clicks, such as and without limitation Position, Frequency and FrequencyRank, features, such as IsNextClicked, IsPreviousClicked, IsAboveClicked, and IsBelowClicked, rely on their surrounding documents and the click sequences, and features, such as and without limitation ClickRank and ClickDuration, have a temporal aspect.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 5, Jiang, as modified by Ji, teaches The method of claim 4.
Ji teaches wherein the click data includes data of at least one click of at least one past user made in response to submitting the respective training search query associated with the given one of the first plurality of training digital objects ([0024] In accordance with one or more embodiments, click data includes data of at least one click of at least one past user made in response to submitting the respective training search query associated with the given one of the first plurality of training digital objects click data from a plurality of query sessions is used to train one or more relevance predictor models, and a trained relevance predictor model is used to rank items in a search query according to relevance. In accordance with one or more embodiments, global feature vectors extracted from the training data, which takes into account click data sequences between items in a query session, is used. In accordance with one or more embodiments, a feature vector includes values extracted from training data, and the training data comprises click data corresponding to search result items.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 6, Jiang, as modified by Ji, teaches The method of claim 1.
Jiang teaches further comprising, prior to the training the machine learning model to determine the respective relevance parameter of the given in-use digital object:
receiving, by the processor, a third plurality of training digital objects, a given one of the third plurality of training digital objects being associated with ([A. Basic Distillation Data, pg. 217] We prepare 4 sets of in-house data, all of which are divided into 90% training and 10% test set, based on queries, to avoid cross-leakage • Human labeled 380k (query, item) pairs with original relevance label in the range of 1 - 5. Editors are asked to judge the relevance of query and item and give a relevance score, with 5 as most relevant and 1 as most irrelevant. In our experiment, we binarize the original label, by regarding the relevance score as 1 for origin score of 4-5(positive label), while as 0 for 1-3(negative label). • 2 month of search log with query, item title pairs and user behaviors. The dataset is filtered by either ordered once or displayed 20 times without clicks, totaling 50m. • similar to the above but with more relaxed filtering criteria: clicked once or skipped at least 5 times, totaling 170m. And skipped means shown by not click. • 10 months of search log (query, item title) without any filtering, totally about 2.3b. The receiving, by the processor, a third plurality of training digital objects, a given one of the third plurality of training digital objects being associated with three sets of search log data provide additional data points to the data size comparison experiments in Fig. 6.):
(i) the respective training search query used for generating the given one of the third plurality of training digital objects; and (ii) a respective second assessor-generated label indicative of how relevant, to the respective training search query, the given one of the third plurality of training digital objects is as perceived by the respective human assessor that has assigned the respective second assessor-generated label ([C. Model Structure, pg. 215] BERT embeds a rich hierarchy of linguistic signals: surface information at the bottom, syntactic information in the middle, and semantic information at the top [24]. Large model capacity and expressive power make BERT an ideal choice for the teacher model. To better align with the e-commerce relevance classification task, we fine-tune BERT on close to 400k (ii) a respective second assessor-generated label indicative of how relevant, to the respective training search query, the given one of the third plurality of training digital objects is as perceived by the respective human assessor that has assigned the respective second assessor-generated label human labeled query, (i) the respective training search query used for generating the given one of the third plurality of training digital objects item pairs. Concretely, we combine the query and item title tokens into a single sentence separated by the special token [SEP], prefixed by the special token [CLS], and padded at the end by [PAD]. We choose token length of 128 which covers more than 95% of all our query, title pairs. The model proceeds by converting each token into 768 dimensional embedding, thus a total input size of 98304 per example (query, item pair). At the end, the model tries to minimize the cross entropy loss between the true label (relevant or not) and the output probability.);
training, in a third training phase following the second training phase, based on the third plurality of training digital objects, the machine learning model for determining a respective refined synthetic assessor-generated label of the given in-use digital object, the respective refined synthetic assessor-generated label being indicative of how relevant, to the respective in-use search query, the given in-use digital object is as perceived by the respective human assessor if the given in-use digital object is presented to the respective human assessor; applying, by the processor, the machine learning model to the first augmented plurality of training digital objects to augment a given one of the first augmented plurality of training digital objects with the respective refined synthetic assessor-generated label, thereby generating a second augmented plurality of training digital objects ([C. Model Structure, pg. 216] Single-BERT model often suffers from high variance on its test prediction scores. One way to mitigate that is through ensembling. By neutralizing individual model variance against one another, ensembling also postpones over-fitting on test data. We modify the stacking (ensemble learning) method described in [25] to fit the distillation framework: 1. instead of base learners, training, in a third training phase following the second training phase, based on the third plurality of training digital objects, the machine learning model for determining a respective refined synthetic assessor-generated label of the given in-use digital object first train multiple base teachers with the same training set D, 2. collect the respective refined synthetic assessor-generated label being indicative of how relevant, to the respective in-use search query, the given in-use digital object is as perceived by the respective human assessor if the given in-use digital object is presented to the respective human assessor prediction scores using the base teachers and applying, by the processor, the machine learning model to the first augmented plurality of training digital objects to augment a given one of the first augmented plurality of training digital objects with the respective refined synthetic assessor-generated label, thereby generating a second augmented plurality of training digital objects generate new meta data D’ (also called “distilled data” or “transfer set”), 3. train a single meta learner based on the simulated label in D’.; [C. Model Structure, pg. 216] Homogeneous Base Teacher Stacking. Here we first train two BERT-Base models on the same human labeled data, which we name BERT1 and BERT2. While the training methodologies are identical, the two models can differ by virtue of random initialization and data shuffling. In the 2nd stage, prediction scores of BERT1 and BERT2 on user log data are combined by simple averaging to obtain D’. Finally student model is trained directly on the averaged scores in D’. Heterogeneous Base Teacher Stacking. Here we add prediction scores from ERNIE [5], a close competitor of BERT. The main difference is that ERNIE used whole phrase masking in its pre-training step. This helps widening the gap between the two base learners. Now we simply average the prediction scores of all three BERT1, BERT2, and ERNIE together in stage 2, while the rest remains the same.); and
training the machine learning model to determine the respective relevance parameter of the given in-use digital object based on the second augmented plurality of training digital objects
([C. Model Structure, pg. 216] 2) Student model: fully-connected: The training the machine learning model to determine the respective relevance parameter of the given in-use digital object based on the second augmented plurality of training digital objects student model (Figure 4) takes aggregated token embeddings as sentence embedding input, and computes the score of a feed-forward Deep Neural Net (DNN). By the universal approximation theorem [26], DNN of just a single hidden layer can approximate any n-variate function under mild constraint. In addition, DNN has far less serving complexity and consumes much less computing resources than CNN, RNN, or Transformers. Therefore we focus on DNN as the student model of choice).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 7, Jiang, as modified by Ji, teaches The method of claim 6.
Jiang teaches wherein a given one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects is at least partially different from any other one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects ([A. Basic Distillation Data, pg.127] We prepare 4 sets of in-house data, all of which are divided into 90% training and 10% test set, based on queries, to avoid cross-leakage • Human labeled 380k (query, item) pairs with original relevance label in the range of 1 - 5. Editors are asked to judge the relevance of query and item and give a relevance score, with 5 as most relevant and 1 as most irrelevant. In our experiment, we binarize the original label, by regarding the relevance score as 1 for origin score of 4-5(positive label), while as 0 for 1-3(negative label). • 2 month of search log with query, item title pairs and user behaviors. The dataset is filtered by either ordered once or displayed 20 times without clicks, totaling 50m. • similar to the above but with more relaxed filtering criteria: clicked once or skipped at least 5 times, totaling 170m. And skipped means shown by not click. • 10 months of search log (query, item title) without any filtering, totally about 2.3b. The wherein a given one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects is at least partially different from any other one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects three sets of search log data provide additional data points to the data size comparison experiments in Fig. 6.; [1) Ensemble model as teacher network:, pg. 218] To assess the impact of teacher ensembling on student model quality, we compare single teacher, as well as homogeneous and heterogeneous teacher ensembles. In both ensemble settings, we simply take the average of the various teacher prediction scores. In the homogeneous case, we fine-tuned two BERT-Base models on the human label data in an identical manner. The difference of the final teacher models comes mainly from the randomization of the training data. In the heterogeneous case, we added another teacher based on ERNIE [5], which performs whole phrase masking during the pre-training stage. By itself the ERNIE teacher model performs similarly with BERT teacher. It is not surprising that averaging teacher predictions provide better student quality, since the teacher predictions themselves perform better on test when averaged. The addition of a heterogeneous teacher source, ERNIE, however, dramatically improves the student quality, on top of existing homogeneous ensemble. This suggests that the diversity of teacher labels plays an important role in distillation quality.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 8, Jiang, as modified by Ji, teaches The method of claim 6.
Jiang teaches wherein a given one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects is of a greater size than a subsequent respective one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects ([A. Basic Distillation Data, pg.127] We prepare 4 sets of in-house data, all of which are divided into 90% training and 10% test set, based on queries, to avoid cross-leakage • Human labeled 380k (query, item) pairs with original relevance label in the range of 1 - 5. Editors are asked to judge the relevance of query and item and give a relevance score, with 5 as most relevant and 1 as most irrelevant. In our experiment, we binarize the original label, by regarding the relevance score as 1 for origin score of 4-5(positive label), while as 0 for 1-3(negative label). • 2 month of search log with query, item title pairs and user behaviors. The dataset is filtered by either ordered once or displayed 20 times without clicks, totaling 50m. • similar to the above but with more relaxed filtering criteria: clicked once or skipped at least 5 times, totaling 170m. And skipped means shown by not click. • 10 months of search log (query, item title) without any filtering, totally about 2.3b. The wherein a given one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects is of a greater size than a subsequent respective one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects three sets of search log data provide additional data points to the data size comparison experiments in Fig. 6.; [1) Ensemble model as teacher network:, pg. 218] To assess the impact of teacher ensembling on student model quality, we compare single teacher, as well as homogeneous and heterogeneous teacher ensembles. In both ensemble settings, we simply take the average of the various teacher prediction scores. In the homogeneous case, we fine-tuned two BERT-Base models on the human label data in an identical manner. The difference of the final teacher models comes mainly from the randomization of the training data. In the heterogeneous case, we added another teacher based on ERNIE [5], which performs whole phrase masking during the pre-training stage. By itself the ERNIE teacher model performs similarly with BERT teacher. It is not surprising that averaging teacher predictions provide better student quality, since the teacher predictions themselves perform better on test when averaged. The addition of a heterogeneous teacher source, ERNIE, however, dramatically improves the student quality, on top of existing homogeneous ensemble. This suggests that the diversity of teacher labels plays an important role in distillation quality.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 9, Jiang, as modified by Ji, teaches The method of claim 1.
Jiang teaches further comprising, after the training the machine learning model to determine the respective relevance parameter of the given in-use digital object ([A. Relevance Quality on In-house Dataset, pg. 217] We assess BERT2DNN framework with our in-house data and compare with some existing deep relevance models trained on CTR/CVR dataset. Here’s a brief description of benchmark models. DRFC [6]: Deep relevance model using fully-connected network. The model was trained in a 2 stages manner. after the training the machine learning model to determine the respective relevance parameter of the given in-use digital object First, a Siamese pairwise model was trained with shared parameters using click ratios as labels. The click model is then fine-tuned in a pointwise manner with 90% of the 380k editor labeled data.):
receiving, by the processor, a third plurality of training digital objects, a given one of the third plurality of training digital objects being associated with ([A. Basic Distillation Data, pg. 217] We prepare 4 sets of in-house data, all of which are divided into 90% training and 10% test set, based on queries, to avoid cross-leakage • Human labeled 380k (query, item) pairs with original relevance label in the range of 1 - 5. Editors are asked to judge the relevance of query and item and give a relevance score, with 5 as most relevant and 1 as most irrelevant. In our experiment, we binarize the original label, by regarding the relevance score as 1 for origin score of 4-5(positive label), while as 0 for 1-3(negative label). • 2 month of search log with query, item title pairs and user behaviors. The dataset is filtered by either ordered once or displayed 20 times without clicks, totaling 50m. • similar to the above but with more relaxed filtering criteria: clicked once or skipped at least 5 times, totaling 170m. And skipped means shown by not click. • 10 months of search log (query, item title) without any filtering, totally about 2.3b. The receiving, by the processor, a third plurality of training digital objects, a given one of the third plurality of training digital objects being associated with three sets of search log data provide additional data points to the data size comparison experiments in Fig. 6.):
(i) the respective training search query used for generating the given one of the third plurality of training digital objects; and (ii) a respective second assessor-generated label indicative of how relevant, to the respective training search query, the given one of the third plurality of training digital objects is as perceived by the respective human assessor that has assigned the respective second assessor-generated label ([C. Model Structure, pg. 215] BERT embeds a rich hierarchy of linguistic signals: surface information at the bottom, syntactic information in the middle, and semantic information at the top [24]. Large model capacity and expressive power make BERT an ideal choice for the teacher model. To better align with the e-commerce relevance classification task, we fine-tune BERT on close to 400k (ii) a respective second assessor-generated label indicative of how relevant, to the respective training search query, the given one of the third plurality of training digital objects is as perceived by the respective human assessor that has assigned the respective second assessor-generated label human labeled query, (i) the respective training search query used for generating the given one of the third plurality of training digital objects item pairs. Concretely, we combine the query and item title tokens into a single sentence separated by the special token [SEP], prefixed by the special token [CLS], and padded at the end by [PAD]. We choose token length of 128 which covers more than 95% of all our query, title pairs. The model proceeds by converting each token into 768 dimensional embedding, thus a total input size of 98304 per example (query, item pair). At the end, the model tries to minimize the cross entropy loss between the true label (relevant or not) and the output probability.);
training, based on the third plurality of training digital objects, the machine learning model to determine a respective refined relevance parameter of the given in-use digital object, the respective refined relevance parameter being indicative of how relevant the given in-use digital object is to the respective in-use search query ([C. Model Structure, pg. 216] 2) Student model: fully-connected: The training, based on the third plurality of training digital objects, the machine learning model to determine a respective refined relevance parameter of the given in-use digital object student model (Figure 4) takes aggregated token embeddings as sentence embedding input, and computes the score of a feed-forward Deep Neural Net (DNN). By the universal approximation theorem [26], DNN of just a single hidden layer can approximate any n-variate function under mild constraint. In addition, DNN has far less serving complexity and consumes much less computing resources than CNN, RNN, or Transformers. Therefore we focus on DNN as the student model of choice; [V. Experiments, pg. 218] From the results shown in Table I, the BERT/ERNIE Teacher model achieves the overall best results. DRFC and DRDD represent previous best performers under feed-forward network architecture, but they are significantly worse than the BERT/ERNIE. We present the result of BERT2DOT (using deep-dot model as student model) and three versions of BERT2DNN trained with different distillation data sizes (50m, 170m, 2.3b). We track five the respective refined relevance parameter being indicative of how relevant the given in-use digital object is to the respective in-use search query relevance metrics: Accuracy, Precision, Recall, F1-score and AUC. The first four metrics are denoted as Acc, Pr, Rc, F1, respectively. As shown in Table I, as the amount of distillation examples increases, the model also performs better on test and at 2.3b examples gets close to BERT teacher model(AUC: -0.99%, Acc: -0.67%, F1: -0.24%), but at a much lower inference cost.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 10, Jiang, as modified by Ji, teaches The method of claim 9.
Jiang teaches wherein a given one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects is at least partially different from any other one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects ([A. Basic Distillation Data, pg.127] We prepare 4 sets of in-house data, all of which are divided into 90% training and 10% test set, based on queries, to avoid cross-leakage • Human labeled 380k (query, item) pairs with original relevance label in the range of 1 - 5. Editors are asked to judge the relevance of query and item and give a relevance score, with 5 as most relevant and 1 as most irrelevant. In our experiment, we binarize the original label, by regarding the relevance score as 1 for origin score of 4-5(positive label), while as 0 for 1-3(negative label). • 2 month of search log with query, item title pairs and user behaviors. The dataset is filtered by either ordered once or displayed 20 times without clicks, totaling 50m. • similar to the above but with more relaxed filtering criteria: clicked once or skipped at least 5 times, totaling 170m. And skipped means shown by not click. • 10 months of search log (query, item title) without any filtering, totally about 2.3b. The wherein a given one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects is at least partially different from any other one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects three sets of search log data provide additional data points to the data size comparison experiments in Fig. 6.; [1) Ensemble model as teacher network:, pg. 218] To assess the impact of teacher ensembling on student model quality, we compare single teacher, as well as homogeneous and heterogeneous teacher ensembles. In both ensemble settings, we simply take the average of the various teacher prediction scores. In the homogeneous case, we fine-tuned two BERT-Base models on the human label data in an identical manner. The difference of the final teacher models comes mainly from the randomization of the training data. In the heterogeneous case, we added another teacher based on ERNIE [5], which performs whole phrase masking during the pre-training stage. By itself the ERNIE teacher model performs similarly with BERT teacher. It is not surprising that averaging teacher predictions provide better student quality, since the teacher predictions themselves perform better on test when averaged. The addition of a heterogeneous teacher source, ERNIE, however, dramatically improves the student quality, on top of existing homogeneous ensemble. This suggests that the diversity of teacher labels plays an important role in distillation quality.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 11, Jiang, as modified by Ji, teaches The method of claim 9.
Jiang teaches wherein a given one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects is of a greater size than a subsequent respective one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects ([A. Basic Distillation Data, pg.127] We prepare 4 sets of in-house data, all of which are divided into 90% training and 10% test set, based on queries, to avoid cross-leakage • Human labeled 380k (query, item) pairs with original relevance label in the range of 1 - 5. Editors are asked to judge the relevance of query and item and give a relevance score, with 5 as most relevant and 1 as most irrelevant. In our experiment, we binarize the original label, by regarding the relevance score as 1 for origin score of 4-5(positive label), while as 0 for 1-3(negative label). • 2 month of search log with query, item title pairs and user behaviors. The dataset is filtered by either ordered once or displayed 20 times without clicks, totaling 50m. • similar to the above but with more relaxed filtering criteria: clicked once or skipped at least 5 times, totaling 170m. And skipped means shown by not click. • 10 months of search log (query, item title) without any filtering, totally about 2.3b. The wherein a given one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects is of a greater size than a subsequent respective one of the first plurality of training digital objects, the second plurality of training digital objects, and the third plurality of training digital objects three sets of search log data provide additional data points to the data size comparison experiments in Fig. 6.; [1) Ensemble model as teacher network:, pg. 218] To assess the impact of teacher ensembling on student model quality, we compare single teacher, as well as homogeneous and heterogeneous teacher ensembles. In both ensemble settings, we simply take the average of the various teacher prediction scores. In the homogeneous case, we fine-tuned two BERT-Base models on the human label data in an identical manner. The difference of the final teacher models comes mainly from the randomization of the training data. In the heterogeneous case, we added another teacher based on ERNIE [5], which performs whole phrase masking during the pre-training stage. By itself the ERNIE teacher model performs similarly with BERT teacher. It is not surprising that averaging teacher predictions provide better student quality, since the teacher predictions themselves perform better on test when averaged. The addition of a heterogeneous teacher source, ERNIE, however, dramatically improves the student quality, on top of existing homogeneous ensemble. This suggests that the diversity of teacher labels plays an important role in distillation quality.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 12, Jiang, as modified by Ji, teaches The method of claim 9.
Jiang teaches wherein the third plurality of training objects and the second plurality of training digital objects are the same ([BERT/ERNIE Teacher:, pg. 218] BERT is fine-tuned from pre-trained Chinese base model1, using the 90% training portion of the in-house editor labeled data. ERNIE is wherein the third plurality of training objects and the second plurality of training digital objects are the same fine-tuned on the same data from pre-trained ERNIE 1.0 Chinese base model2.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 13, Jiang, as modified by Ji, teaches The method of claim 1.
Jiang teaches wherein: in the first training phase, the machine learning model is trained to determine a rough initial estimate of the respective relevance parameter of the given in-use digital object; and in each subsequent training phase, the machine learning model is trained to improve the rough initial estimate ([A. Relevance Quality on In-house Dataset, pg. 217] DRFC [6]: Deep relevance model using fully-connected network. The model was trained in a 2 stages manner. First, a Siamese pairwise model was in the first training phase, the machine learning model is trained to determine a rough initial estimate of the respective relevance parameter of the given in-use digital object trained with shared parameters using click ratios as labels. The click model is then in each subsequent training phase, the machine learning model is trained to improve the rough initial estimate fine-tuned in a pointwise manner with 90% of the 380k editor labeled data.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 15, Jiang, as modified by Ji, teaches The method of claim 1.
Jiang teaches wherein the at least one learning model is a transformer-based learning model ([A. Overview, pg. 214] The model setup is similar to the one described in [6], with one major difference. In [6], pairwise training format (query, itema, itemb) was essential to learn from relative user preference signals: e.g., higher CTR or more click counts across two different queries do not necessarily mean more relevant results. In this work, we only need pointwise data since the score produced by wherein the at least one learning model is a transformer-based learning model BERT teacher model has absolute relevance meaning. This allows more compact aggregation of the training data, and avoids query distribution skew introduced by the pairwise expansion. We show the whole knowledge transfer process of the proposed BERT2DNN model in Fig. 3 and illustrate it in detail in the following subsections. First we discuss how texts are converted to embeddings to be consumed by the models. We then describe our teacher and student model architectures in detail, comparing several candidates and techniques. Finally we describe how to transfer knowledge from teacher model to student model with temperature parameters.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 16, Jiang, as modified by Ji, teaches The method of claim 1.
Jiang teaches wherein the machine learning model comprises at least two learning models, and wherein: a first one of the two learning models is trained to determine the respective synthetic assessor-generated label for the given in-use digital object for generating the first augmented plurality of training digital objects; and a second one of the two learning models is trained to determine the respective relevance parameter of the given in-use digital object, based on the first augmented plurality of training digital objects ([A. Overview, pg. 214] The model setup is similar to the one described in [6], with one major difference. In [6], pairwise training format (query, itema, itemb) was essential to learn from relative user preference signals: e.g., higher CTR or more click counts across two different queries do not necessarily mean more relevant results. In this work, we only need a first one of the two learning models is trained to determine the respective synthetic assessor-generated label for the given in-use digital object for generating the first augmented plurality of training digital objects pointwise data since the score produced by BERT teacher model has absolute relevance meaning. This allows more compact aggregation of the training data, and avoids query distribution skew introduced by the pairwise expansion. We show the whole a second one of the two learning models is trained to determine the respective relevance parameter of the given in-use digital object, based on the first augmented plurality of training digital objects knowledge transfer process of the proposed machine learning model comprises at least two learning models BERT2DNN model in Fig. 3 and illustrate it in detail in the following subsections. First we discuss how texts are converted to embeddings to be consumed by the models. We then describe our teacher and student model architectures in detail, comparing several candidates and techniques. Finally we describe how to transfer knowledge from teacher model to student model with temperature parameters.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 17, Jiang, as modified by Ji, teaches The method of claim 16.
Jiang teaches wherein the first one of the two learning models is different from the second one ([A. Overview, pg. 214] The model setup is similar to the one described in [6], with one major difference. In [6], pairwise training format (query, itema, itemb) was essential to learn from relative user preference signals: e.g., higher CTR or more click counts across two different queries do not necessarily mean more relevant results. In this work, we only need pointwise data since the score produced by BERT teacher model has absolute relevance meaning. This allows more compact aggregation of the training data, and avoids query distribution skew introduced by the pairwise expansion. We show the whole knowledge transfer process of the wherein the first one of the two learning models is different from the second one proposed BERT2DNN model in Fig. 3 and illustrate it in detail in the following subsections. First we discuss how texts are converted to embeddings to be consumed by the models. We then describe our teacher and student model architectures in detail, comparing several candidates and techniques. Finally we describe how to transfer knowledge from teacher model to student model with temperature parameters.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 18, Jiang, as modified by Ji, teaches The method of claim 1.
Ji teaches further comprising ranking the in-use digital objects in accordance with respective relevance parameters associated therewith ([0024] In accordance with one or more embodiments, click data from a plurality of query sessions is used to train one or more relevance predictor models, and a trained relevance predictor model is used to ranking the in-use digital objects in accordance with respective relevance parameters associated therewith rank items in a search query according to relevance.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 19, Jiang, as modified by Ji, teaches The method of claim 1.
Ji teaches further comprising ranking the in-use digital objects based on respective relevance parameters associated therewith, the ranking comprising using another learning model having been trained to rank the in-use digital objects using the respective relevance parameters generated by the machine learning model as input features ([0024] In accordance with one or more embodiments, click data from a plurality of query sessions is used to train one or more relevance predictor models, and a ranking comprising using another learning model having been trained to rank the in-use digital objects trained relevance predictor model is used to ranking the in-use digital objects based on respective relevance parameters associated therewith rank items in a search query according to relevance.; [0028] Model generator 108 generates one or more relevance predictor models 110 using using the respective relevance parameters generated by the machine learning model as input features training data generated by training data generator 128.).
Jiang and Ji are combinable for the same rationale as set forth above with respect to claim 1.
Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Jiang, in view of Ji, and further in view of Singh et al. (U.S. Pre-Grant Publication No. 20090210381, hereinafter 'Singh').
Regarding claim 3, Jiang, as modified by Ji, teaches The method of claim 2.
Jiang, as modified by Ji, fails to teach wherein the document metadata includes at least one of: the respective training search query associated with the given one of the first plurality of training digital objects, a title of the digital document, a content of the digital document, and a web address associated with the digital document.
Singh teaches wherein the document metadata includes at least one of: the respective training search query associated with the given one of the first plurality of training digital objects, a title of the digital document, a content of the digital document, and a web address associated with the digital document ([0097] The features that may be included in the training data instances and to which the function f(x) may be applied may include any of the features discussed above, such as tags or other a content of the digital document community metadata associated with a Web page, a title of the digital document a title of a Web page, a web address associated with the digital document a URL associated with a Web page, the respective training search query associated with the given one of the first plurality of training digital objects a query (or search terms) used to identify a Web page. The features may also include other information, including but not limited to editorial scores associated with a Web page, Web page anchor text, or DMOZ results.).
Jiang, Ji, and Singh are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Jiang and Ji, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Singh to Jiang before the effective filing date of the claimed invention in order to generate an abstract that more accurately represents relevant document content as compared to conventional abstract generation algorithms, operating in a fast and efficient manner that satisfies run time constraints associated with the search engine (cf. Singh, [0013] By using community-based metadata to help identify text fragments within a document that are most suitable for generating the abstract, an embodiment of the present invention generates an abstract that more accurately represents relevant document content as compared to conventional abstract generation algorithms. As will also be described herein, an embodiment of the present invention assembles the identified text fragments in a form that is easily understood by the user and that complies with size constraints imposed by a user interface. An embodiment of the present invention may also be programmed to operate in a fast and efficient manner that satisfies run time constraints associated with the search engine.).
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Jiang, in view of Ji, and further in view of Chevalier et al. (U.S. Pre-Grant Publication No. 20160070705, hereinafter 'Chevalier').
Regarding claim 14, Jiang, as modified by Ji, teaches The method of claim 13.
Jiang, as modified by Ji, fails to teach wherein improvement of the rough initial estimate is determined using a normalized discounted cumulative gain metric.
Chevalier teaches wherein improvement of the rough initial estimate is determined using a normalized discounted cumulative gain metric ([0047] Click rank is provided as one exemplary search metric, but various other metrics such as mean reciprocal rank, etc. may be determined in other embodiments in addition to and/or in place of click rank. “Reciprocal rank” refers to the multiplicative inverse of the rank of the first correct answer in a list of possible query results. The “correctness” of an answer may be determined in various ways. In some embodiments, for example, a result is correct when a user selects it, and mean reciprocal rank may be related to click rank in these embodiments. “Mean reciprocal rank” refers to an average of reciprocal ranks of results for a set of queries. In various embodiments, mean reciprocal rank may be determined using similar techniques to those disclosed for click rank (e.g., using past search data to determine mean reciprocal rank based on adjusted search parameters, without actually performing new searches). Other exemplary search metrics include, without limitation: a metric relating to whether a given number of top clicked results show up in a new tuning and a improvement of the rough initial estimate is determined using a normalized discounted cumulative gain metric normalized discounted cumulative gain metric.).
Jiang, Ji, and Chevalier are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Jiang and Ji, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Chevalier to Jiang before the effective filing date of the claimed invention in order to quickly determine various impacts of adjusted search parameters before deploying the adjusted search parameters on a search system (cf. Chevalier, [0018] This disclosure initially describes, with reference to FIG. 1, embodiments of a search profile feedback user interface that provides feedback, including, for example, real-time feedback regarding ordering of search result based on adjustment of relevancy parameters. It then describes, with reference to FIGS. 2-5C, embodiments of a user interface that provide feedback regarding various search metrics for past queries using updated search parameters. Additional exemplary queries, results, methods, and systems are discussed with reference to FIGS. 6-9. In various embodiments, the disclosed techniques may allow a search system administrator to quickly determine various impacts of adjusted search parameters before deploying the adjusted search parameters on a search system.).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Ross et al. (U.S. Pre-Grant Publication No. 20190179940) teaches implementations relating to providing, in response to a query, machine learning model output that is based on output from a trained machine learning model.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAGGIE MAIDO whose telephone number is (703) 756-1953. The examiner can normally be reached M-Th: 6am - 4pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MM/Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129