DETAILED ACTION
This action is responsive to the application filed on 12/08/2025. Claims 1, 3-5, and 7-11 are pending and have been examined. This action is Non-Final.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Applicant’s claim for the benefit of a prior-filed application under 35 U.S.C. 119(e) or under 35 U.S.C.
120, 121, 365(c), or 386(c) is acknowledged.
Response to Arguments
Argument 1 (101 rejection): The applicant argues that the rejection under 35 U.S.C. 101 is improper because the Office action allegedly evaluates the claims at an overly high level of generality and improperly equates machine learning with an unpatentable algorithm (see Remarks, page 10). The applicant further contends that the claim limitations directed to a first model comprising a bi-encoder, a second model comprising a cross-encoder, training a third model using second data, and limiting the number of combinations represented in the first data to fewer than all possible combinations, are not themselves mathematical concepts (see Remarks, page 11). The applicant further argues that, even assuming a judicial exception is recited, claim 1 integrates any alleged judicial exception into a practical application and provides significantly more than the alleged judicial exception by providing a technical solution that improves accuracy of data to be output while suppressing required computer resources and computing time (see Remarks, pages 12 and part of 13).
Examiner Response to Argument 1: The examiner has considered the applicant’s arguments but finds them unpersuasive. The rejection is not based on categorically excluding machine learning inventions. Rather, as set forth in the Office action, claim 1 recites limitations that are directed to mathematical concepts and relationships, including generating vector representations from product information, determining similarity between vector representations, selecting combinations based on computed similarities, and training a model using training data, which are mathematical operations performed on information. The recited use of a bi-encoder and a cross-encoder, and the comparative higher accuracy requirement, describe the application of these mathematical similarity operations using generic machine learning models, but do not change the fundamental character of the claim as directed to computing and using vector representations and similarity relationships. Further, the additional elements recited in claim 1 do not integrate the judicial exception into a practical application because the claim does not recite a specific technological improvement to computer operation itself, for example a particular data structure, indexing mechanism, memory-management technique, hardware configuration, or other claimed improvement to the functioning of the computer. Instead, the claim recites generic model execution to improve similarity scoring and training outcomes, which is an improvement to the abstract information processing result rather than a technological improvement to the computer. Accordingly, claim 1 does not integrate the judicial exception into a practical application and does not amount to significantly more than the judicial exception. Therefore, the rejection under 35 U.S.C. 101 is maintained.
Argument 2 (Art rejection): The applicant argues that the applied prior art fails to disclose the specific three-model training pipeline recited in claims 1, 10, and 11. In particular, the applicant contends that the claimed invention requires a sequence in which a first model comprising a bi-encoder generates candidate combinations of similar products, a second model comprising a cross-encoder evaluates those combinations to produce more accurate similarity determinations, and a third model is subsequently trained using data generated by the cross-encoder. The applicant asserts that the applied references do not teach or suggest this coordinated sequence of operations involving three distinct models operating in a training pipeline (see Remarks, pages 14-17).
Examiner Response to Argument 2: The examiner has considered the argument but finds it unpersuasive. The applied references teach, or at least render obvious, the claimed multi-stage training pipeline. In particular, Thakur teaches a bi-encoder used to encode items into vector representations and to retrieve or sample similar pairs, followed by use of a separate cross-encoder to evaluate those candidate pairs and label them, thereby producing a labeled dataset, and then training a bi-encoder using that cross-encoder-labeled dataset. As such, Thakur teaches the claimed sequence of generating candidate similar pairs using a first encoder model, evaluating the candidate pairs using a cross-encoder to obtain higher-accuracy pair judgments, and training an encoder model using the resulting labeled data generated by the cross-encoder. Therefore, the applicant’s position that the applied art fails to disclose the claimed three-model training pipeline is not persuasive.
Argument 3 (Art rejection): The applicant argues that the prior art combination of Zhang and Thakur fails to disclose the limitation requiring the second model to be configured to receive product information comprising product titles for two products and to output a vector corresponding to the input product information of those two products, and asserts that Thakur does not explicitly teach the specific input and output configuration of such a cross-encoder (see Remarks, pages 14-16).
Examiner Response to Argument 3: The examiner has considered the argument set forth above. The examiner asserts that the rejection of claims 1, 10, and 11 is proper and that the applicant’s argument is not persuasive because the combination of Thakur and Lu, (Lu being the new reference in light of amendments), teaches or at least renders obvious the disputed limitation. As set forth in the rejection, Thakur teaches the claimed second model as a cross-encoder operating on paired inputs, for example Cross-encoders, which perform full-attention over the input pair, and Given a pre-trained, well-performing cross-encoder, we sample sentence pairs and label these using the cross-encoder, which demonstrates that the second model jointly processes two items drawn from the first data. Lu further teaches the specific input structure of such a cross-encoder, namely that the input is the concatenation of q and p with a special token [SEP] and that the [CLS] representation of the output is fed into a linear function to compute the relevance score. Thus, Lu expressly teaches a cross-encoder that receives two inputs together as a single paired input and produces a combined output representation used for similarity or relevance scoring. In view of Zhang’s teaching that the relevant item information includes Item Title Tokens, the combined teachings reasonably suggest using product-title information for the two paired inputs. Accordingly, the cited combination teaches or at least makes obvious a second model configured to receive product information comprising product titles for two products and output a representation corresponding to those paired inputs, and therefore applicant’s argument is not persuasive.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition
of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the
conditions and requirements of this title.
Claims 1, 3-5, and 7-11 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim 1,
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies Step 1.
Step 2A Prong 1:
(a) ”generating, using a first model comprising a bi-encoder, first vector representations based on product information for the plurality of products, and generating first data indicating one or more combinations of products determined to be highly similar to each other from among the plurality of products based on similarities between the first vector representations” - The limitation is directed to generating vector representations and determining similarity between vector representations in order to identify combinations deemed highly similar. Generating vectors and computing similarity between vectors are mathematical concepts/calculation/operation, and thus the limitation is directed to math.
(c) ”generating, using a second model comprising a cross-encoder, second vector representations from product information of two products included in the first data and generating second data indicating one or more combinations of products determined to be highly similar to each other based on the second vector representations,” - The limitation is directed to producing a similarity value between products based on the model output for paired inputs. Computing a similarity value is a mathematical relationship and mathematical calculation. Therefore, this limitation is directed to a mathematical concept.
(d) ”wherein the second model is configured to output a similarity value between products with a higher accuracy than the first model” - The limitation is directed to a comparative performance relationship between outputs of two models. This is an evaluation and comparison of results expressed in a quantified manner and is tied to the mathematical similarity output. Therefore, this limitation is directed to a mathematical concept.
Step 2A Prong 2 and Step 2B:
“A method of learning a model used for identifying a product in accordance with a predetermined search condition from among a plurality of products, performed by a learning device, the method comprising:” - The limitation recites a method of learning a model used for searching for a product with a predetermined search condition to be performed on a learning device. The limitation recites mere instructions to apply onto a computer, and thus the limitation does not integrate to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(f)).
“wherein the first model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information… wherein the second model is configured to receive product information comprising product titles n products, and to output a vector corresponding to the input product information of the two products, wherein the second model is configured to output a similarity value between products with a higher accuracy than the first model…wherein the third model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information” -The limitation recites receiving product information as input then outputting a vector that corresponds to the received product information for multiple models and corresponding information within that model. The limitation is directed to an insignificant, extra-solution activity that cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under Step 2B, the act of receiving/sending data over a network is a well-understood, routine, and conventional activity (WURC) that cannot provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)).
“with using a second model for generating vector representations from combinations of two product information;” - The limitation recites using a first model for generating vector representations. The limitation amounts to no more than mere instructions to apply onto a computer, and thus it does not integrate to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(f)).
“training third model to generate vector representations from the product information using the second data as training data.” - The limitation recites training a third model to generate the vector representations from product information using gathered data as the training data. The limitation amounts to no more than mere instructions to apply onto a computer, and thus it does not integrate to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(f)).
“and wherein the first model, the second model, and the third model are different models, and wherein a number of product combinations of the plurality of products represented in the first data is less than a number of all combinations of the plurality of products.” - The limitation recites that the models are different and that the number of product combinations of products represented in the first data is less than the combinations of products. The limitation amounts to no more than mere further limiting to a field of use/environment, and it does not integrate to a practical application, nor does it provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Thus, claim 1 is non-patent eligible.
Regarding claim 3,
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies Step 1.
Step 2A Prong 1:
“The method according to claim 1, wherein similarity between the first vector representations is computed based on distance between two of the first vector representations.” - The limitation is directed to the computing a similarity based on distance between two vector representations. The limitation is directed to mathematical calculation/concept, and it is considered math.
There are no elements to be evaluated under Step 2A Prong 2 and Step 2B.
Thus, claim 3 is non-patent eligible.
Regarding claim 4,
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies Step 1.
Step 2A Prong 1:
“The method according to claim 1, wherein in the step of generating the second data, similarity is computed based on a score computed based on the second vector representations.” - The limitation is directed to computing similarity based on a score value based on the second vector representations. The limitation is directed to mathematical calculation/concept, and it is considered math.
There are no elements to be evaluated under Step 2A Prong 2 and Step 2B.
Thus, claim 4 is non-patent eligible.
Regarding claim 5,
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies Step 1.
There are no elements to be evaluated under Step 2A Prong 1.
Step 2A Prong 2 and Step 2B:
“The method according to claim 1, wherein the third model comprises a bi-encoder.” - The limitation recites that the third model comprises a bi-encoder. The limitation amounts to no more than merely limiting to a field of use/environment, and thus the limitation does not integrate to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Thus, claim 5 is non-patent eligible.
Regarding claim 7,
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies Step 1.
Step 2A Prong 1:
“compute vector representations lower in dimension than the vector representations.” - The limitation is directed to computing vector representations that are lower in dimension. The limitation is directed to the use of mathematical calculations/concept, and thus the limitation is directed to math.
Step 2A Prong 2 and Step 2B:
“The method according to claim 1, further comprising a step of using a dimensionality reduction encoder to” - The limitation recites a step of using a dimensionality reduction encoder to compute the vector representations that are lower than in dimension. The limitation is directed to mere instructions to apply the encoder for executing the abstract idea, and thus it does not integrate to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Thus, claim 7 is non-patent eligible.
Regarding claim 8,
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies Step 1.
Step 2A Prong 1:
“The method according to claim 1, wherein in the step of generating the second data, when there is data related to a combination of products manually determined to be highly similar to each other, the similarity is determined to be high based on the second vector representations, and second data including the combination of products manually determined to be highly similar to each other is generated.” - The limitation is directed to manually determining combination of products to be highly similar to one another for the data. The limitation is directed to a process that can be performed in the human mind using evaluation, observation, and judgement, with aid of pen and paper, and thus the limitation is directed to a mental process.
There are no elements to be evaluated under Step 2A Prong 2 and Step 2B.
Thus, claim 8 is non-patent eligible.
Regarding claim 9,
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies Step 1.
Step 2A Prong 1:
“generating third vector representations based on product information of the new product in order to generate third data including new combinations of products highly similar to the new product based on similarity between the third vector representations and the first vector representations with using the first model; generating fourth vector representations from combinations of product information of two products included in the new combinations included in the third data and generating fourth data including combinations of highly similar products based on the fourth vector representations with using the second model; - The limitation is directed to generating vector representations based on information of new product orders and generating new combinations of the products based on similarity. The limitation is directed to the use of mathematical calculations/concept, and thus the limitation is directed to math.
Step 2A Prong 2 and Step 2B:
“The method according to claim 1, further comprising: receiving a new product put up for sale;” - The limitation recites a step to receive products that are put up for sale. The limitation is directed to an insignificant, extra-solution activity that cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under Step 2B, the act of sending/receiving information and data over a network is a well-understood, routine, and convention activity (WURC) and cannot provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)).
“executing learning of the third model using the second encoder annotation data as training data.” - The limitation is directed to executing the third model using another encoder’s annotation data as the training data. The limitation amounts to no more than mere further limiting to e field of use/environment, and it does not integrate to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Thus, claim 9 is non-patent eligible.
Regarding claim 10,
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies Step 1.
Step 2A Prong 1;
“generating vector representations of the search condition using a third model, wherein the model is trained by: generating, using a first model comprising a bi-encoder, first vector representations based on product information data for the plurality of products, and generating first data indicating one or more combinations of products determined to be highly similar to each other from among the plurality of products based on similarities between the first vector representations,… generating, using a second model comprising a cross-encoder, second vector representations from product information of two products included in the first data and generating second data indicating one or more combinations of products determined to be highly similar to each other based on the second vector representations,…” - The limitation is directed to generating vector representations of the search conditions using a model, and generating vector representations based on product information for products, and generating data indicating combinations of productions that are determined to be similar to each other. The limitation is directed to the use of mathematical calculation/operation/relationships, and thus it is directed to math. Furthermore, the limitation element of comparing the combination of products to be similar to one another from a plurality (group) of products based on similarities to vector representations is directed to a process that can be performed in the human mind using evaluation, observation, and judgement to perform the task, and thus it is also directed to a mental process.
Step 2A Prong 2 and Step 2B:
“A method of outputting search results of related products, comprising: receiving input information indicating a search condition from a user; receiving input of a search condition from a user;…wherein the first model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information, wherein the second model is configured to output a similarity value between products with a higher accuracy than the first model:…wherein the second model is configured to receive product information comprising product titles for two products, and to output a vector corresponding to the input product information of the two products;… wherein the third model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information;” - The limitation recites a method of outputting/inputting search results which involves receiving input from a user. The limitation is directed to an insignificant, extra-solution activity that cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under step 2B, sending/receiving data over network is also considered an insignificant, extra-solution activity that cannot provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)).
“acquiring product information of products corresponding to the search condition based on similarity between vector representations of product information of the plurality of products generated by the third model and vector representations of the search condition; - The limitation is directed to acquiring product information based on similarity between the vector representations. The limitation is directed to obtaining data based on gathered data/information, and it is considered an insignificant, extra-solution activity that cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under step 2B, the limitation is also considered an insignificant, extra-solution activity that cannot provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)).
“displaying the acquired product information of the product corresponding to the search condition,” - The limitation is directed to displaying the acquired product information that corresponds to the search condition. The limitation of merely displaying information corresponding to a condition is considered instructions to apply, and it does not integrate to a practical application, nor provides significantly more than the judicial exception (see MPEP 2106.05(f)).
“and training the third model to generate vector representations from the product information using the second data as training data,” -- The limitation recites mere instructions of training the third model to generate vector representations from product information using trained second data. The limitation is directed to mere instructions to apply onto a computer, and it cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(f)).
“wherein the first model, the second model, and the third model are different models, and wherein a number of product combinations of the plurality of products represented in the first data is less than a number of all combinations of the plurality of products.” - The limitation recites that the models are different and that the number of product combinations of products represented in the first data is less than the combinations of products. The limitation amounts to no more than mere further limiting to a field of use/environment, and it does not integrate to a practical application, nor does it provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Thus, claim 10 is non-patent eligible.
Regarding claim 11,
Step 1: The claim is directed to a learning device comprising at least one memory and at least one processor, which falls under the category of a machine. The claim satisfies Step 1.
Step 2A Prong 1:
“wherein a number of product combinations of the plurality of products represented in the first data is less than a number of all combinations of the plurality of products.” - The limitation is directed to a number of product representations of the plurality of products represented in the first data is less than all the combinations of the plurality of products. The limitation is directed to a process that can be performed in the human mind using evaluation, observation, and judgement, and thus the limitation is directed to a mental process.
“generate, using a first model comprising a bi-encoder, first vector representations based on product information data for the plurality of products, and generate first data indicating one or more combinations of products determined to be highly similar to each other from among the plurality of products based on similarities between the first vector representations, …first generating code configured to cause at least one of the at least one processor to generate, using a first model comprising a bi-encoder, first vector representations based on product information data for the plurality of products, and generate first data indicating one or more combinations of products determined to be highly similar to each other from among the plurality of products based on similarities between the first vector representations, generate, using a second model comprising a cross-encoder, second vector representations from product information of two products included in the first data and generate second data indicating one or more combinations of products determined to be highly similar to each other based on the second vector representations,… generate vector representations from the product information using the second data as training data,” - The limitation is directed to generating vector representations and determining similarity between vector representations in order to identify combinations deemed highly similar. Generating vectors and computing similarity between vectors are mathematical concepts/calculation/operation, and thus the limitation is directed to math.
Step 2A Prong 2 and Step 2B:
“A learning device for training a model for identifying a product in accordance with a predetermined search condition from among a plurality of products, the device comprising:…at least one processor configured to read the program code and operate as instructed by the program code, the program code comprising:… first generating code configured to cause at least one of the at least one processor to… second generating code configured to cause at least one of the at least one processor to… training code configured to cause at least one of the at least one processor to train a third model to” - The limitation recites a learning device of training model for identifying a product according to a predetermined search condition among a plurality of products and will be executing by the device along other proceeding tasks, as well as recites of merely using a processor/configuring to apply to a computer in a generic manner. The limitation amounts to no more than mere instructions to apply onto a computer, and it cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(f)).
“at least one memory configured to store computer program code;” - The limitation recites memory configured to store computer code. The limitation is directed to an insignificant, extra-solution activity that cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under Step 2B, the act of storing data in memory is a well-understood, routine, and conventional activity (WURC) that cannot provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)).
“wherein the first model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information;… wherein the second model is configured to receive product information comprising product titles for two products, and to output a vector corresponding to the input product information of the two products;… wherein the third model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information,” - The limitation recites a model configured to receive product information that comprises a product title as an input for a product, then to output a vector corresponding to the input of the product. The limitation is directed to an insignificant, extra-solution activity that cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under Step 2B, the act of receiving/sending data over a network is a well-understood, routine, and conventional activity (WURC) that cannot provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)).
“wherein the first model, the second model, and the third model are different models,” - The limitation is directed to a first, second, and third model are further disclosed to be different models. The limitation amounts to no more than mere further limiting to a field of use/environment, and thus it does not integrate to a practical application, nor provides significantly more than the judicial exception (see MPEP 2106.05(h)).
Thus, claim 11 is non-patent eligible.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this
Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not
identically disclosed as set forth in section 102, if the differences between the claimed invention and the
prior art are such that the claimed invention as a whole would have been obvious before the effective filing
date of the claimed invention to a person having ordinary skill in the art to which the claimed invention
pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are
summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 3-5, and 8-11 is/are rejected under 35 U.S.C. 103 as being unpatentable over NPL reference “Towards Personalized and Semantic Retrieval: An End-to-End Solution for E-commerce Search via Embedding Learning, by Zhang et. al. (referred herein as Zhang) in view of NPL reference “Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks”, by Thakur et. al. (referred herein as Thakur) further in view of NPL reference “ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval”, by Lu et. al. (referred herein as Lu).
Regarding claim 1, Zhang teaches:
A method for training a model for identifying a product in accordance with a predetermined search condition from among a plurality of products, performed by a learning device, the method comprising: ([Zhang, page 2409, sec 3] “Offline Model Training module trains a two tower model…for the uses in online serving and offline indexing…enable fast online embedding retrieval…transform any user input query text to query embedding, which is then fed to the item embedding index to retrieve K similar items.”, wherein the examiner interprets “training a two tower model for online serving and transforming a user's input query to retrieve K similar items” to be the same as “training a model used for identifying a product corresponding to a predetermined search condition from among a plurality of products”, because they are both describing a trained model deployed in a search system that retrieves products matching a given user condition.)
wherein the first model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information, ([Zhang, Figure 3, page 2409] showing “Item Title Tokens” as an input feature to the item tower, and [Zhang, page 2409-2410, sec 4.1] “As shown in the right side of offline model training panel in Figure 3. [The item tower] concatenates all item features as input layer, then goes through multi-layer perceptron (MLP)…to output a single item embedding” and [Zhang, page 2408, sec 1.2] “especially for e-commerce search, where item titles are often short.”, wherein the examiner interprets Zhang’s item tower receiving item title tokens (product title information) for a single item and outputting a single item embedding vector to be the same as “the first model being configured to receive product information comprising a product title for one product as input and to output a vector corresponding to the input product information,” because they are both describing a model that ingests a product title for a product and produces a corresponding vector representation (embedding) for that product.)
Zhang does not teach generating, using a first model comprising a bi-encoder, first vector representations based on product information data for the plurality of products, and generating first data indicating one or more combinations of products determined to be highly similar to each other from among the plurality of products based on similarities between the first vector representations, … generating, using a second model comprising a cross-encoder, second vector representations from product information of two products included in the first data and generating second data indicating one or more combinations of products determined to be highly similar to each other based on the second vector representations, wherein the second model is configured to output a similarity value between products with a higher accuracy than the first model; …and training a third model to generate vector representations from the product information using the second data as training data, wherein the third model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information, and wherein the first model, the second model, and the third model are different models, and wherein a number of product combinations of the plurality of products represented in the first data is less than a number of all combinations of the plurality of products.
Thakur teaches:
generating, using a first model comprising a bi-encoder, first vector representations based on product information data for the plurality of products, ([Thakur, page 1, sec 1] “bi-encoders such as Sentence BERT (SBERT) encode each sentence independently and map them to a dense vector space.” and [Thakur, page 4, sec 3.1] “We train a bi-encoder (SBERT) on the gold training set…and use it to sample further, similar sentence pairs. We use cosine-similarity and retrieve for every sentence the top k most similar sentences in our collection.”, wherein the examiner interprets training an initial SBERT bi-encoder and using it to independently encode each input into a dense vector to be the same as “using a first model comprising a bi-encoder to generate first vector representations based on product information for the plurality of products”, because they are both describing an independent-encoding architecture that maps each item into a dense vector representation from which similarity can be computed.)
and generating first data indicating one or more combinations of products determined to be highly similar to each other from among the plurality of products based on similarities between the first vector representations, ([Thakur, page 4, sec 3.1] “We train a bi-encoder (SBERT) on the gold training set…and use it to sample further, similar sentence pairs. We use cosine-similarity and retrieve for every sentence the top k most similar sentences in our collection.”, wherein the examiner interprets using the bi-encoder's cosine similarity over encoded vectors to retrieve the top-k most similar sentence pairs to be the same as “generating first data indicating combinations of products determined to be highly similar based on similarities between the first vector representations”, because they are both using a bi-encoder's vector similarity to select a subset of similar pairs from the full collection.)
generating, using a second model comprising a cross-encoder, second vector representations from product information of two products included in the first data ([Thakur, page 3, sec 3.1] “Given a pre-trained, well-performing cross-encoder, we sample sentence pairs according to a certain sampling strategy (discussed later) and label these using the cross-encoder.” AND [Thakur, page 1, Abstract] “Cross-encoders, which perform full-attention over the input pair”, wherein the examiner interprets the cross-encoder performing full-attention over a pair of inputs drawn from the set of similar pairs identified by the bi-encoder to be the same as using a second model comprising a cross-encoder to generate second vector representations from the product information of two products included in the first data, because they are both describing a cross-encoder that takes two items jointly as input, specifically pairs that were first identified as candidates by the bi-encoder, and produces a joint representation from those paired inputs.)
and generating second data indicating one or more combinations of products determined to be highly similar to each other based on the second vector representations, ([Thakur, page 1, Abstract] “we use the cross-encoder to label a larger set of input pairs to augment the training data for the bi-encoder.” AND [Thakur, page 3, sec 3.1] “We call these weakly labeled examples the silver dataset and they will be merged with the gold training dataset.”, wherein the examiner interprets the cross-encoder labeling input pairs and producing a silver dataset of labeled similar pairs to be the same as generating second data indicating combinations of products determined to be highly similar based on the second vector representations, because they are both using a cross-encoder to produce a labeled dataset of highly similar pairs that will serve as downstream training data.)
wherein the second model is configured to output a similarity value between products with a higher accuracy than the first model; ([Thakur, page 1, Abstract] “While cross-encoders often achieve higher performance, they are too slow for many practical use cases.” and [Thakur, page 1, sec 1] “A drawback of the SBERT bi-encoder is usually a lower performance in comparison with the BERT cross-encoder.”, wherein the examiner interprets the cross-encoder achieving higher performance/accuracy than the bi-encoder to be the same as the second model outputting a similarity value with higher accuracy than the first model, because they are both describing the well-established relative superiority of cross-encoders over bi-encoders in terms of accuracy of similarity scoring between two items.)
and training a third model to generate vector representations from the product information using the second data as training data, ([Thakur, page 3, sec 3.1] “We then train the bi-encoder on this extended training dataset. We refer to this model as Augmented SBERT (AugSBERT).”, wherein the examiner interprets training a new bi-encoder on the silver dataset labeled by the cross-encoder to be the same as training a third model to generate vector representations from product information using the second data as training data, because they are both describing training a new encoder model using a training dataset that was generated and labeled by the cross-encoder.)
wherein the third model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information, ([Thakur, page 1, sec 1] “bi-encoders…encode each sentence independently and map them to a dense vector space.” and [Thakur, page 3, sec 3.1] “We then train the bi-encoder on this extended training dataset. We refer to this model as Augmented SBERT (AugSBERT).”, wherein the examiner interprets the newly trained AugSBERT bi-encoder independently encoding each single input into a dense vector to be the same as the third model being configured to receive product information comprising a product title for one product as input and to output a vector corresponding to that product, because they are both describing a model that independently takes a single input item and maps it to a vector representation.)
and wherein the first model, the second model, and the third model are different models, ([Thakur, page 3, sec 3.1] “Given a pre-trained, well-performing cross encoder, we sample sentence pairs according to a certain sampling strategy (discussed later) and label these using the cross-encoder. We call these weakly labeled examples the silver dataset and they will be merged with the gold training dataset. We then train the bi-encoder on this extended train ing dataset. We refer to this model as Augmented SBERT (AugSBERT). The process is illustrated in Figure 2..” and [Thakur, page 4, sec 3.1] “We train a bi-encoder (SBERT) on the gold training set as described in section 5 and use it to sample further, similar sentence pairs.”, wherein the examiner interprets Thakur's explicit use of three structurally and parametrically distinct models; (1) an initial SBERT bi-encoder used for semantic search sampling to identify candidate similar pairs, (2) a separate BERT cross-encoder used to label those pairs, and (3) a newly trained AugSBERT bi-encoder trained on the cross-encoder-labeled silver data; to be the same as the first, second, and third models being different models, because they are both describing a pipeline with three distinct models each serving a different role, where no two models are the same model being reused.)
and wherein a number of product combinations of the plurality of products represented in the first data is less than a number of all combinations of the plurality of products. ([Thakur, page 3, sec 3.1] “there are n × (n - 1)/2 possible combinations for n sentences. Weakly labeling all possible combinations would create an extreme computational overhead, and, as our experiments show, would likely not lead to a performance improvement.”, wherein the examiner interprets the teaching that it is computationally infeasible and undesirable to label all possible pair combinations and that only a sampled subset is used to be the same as the number of product combinations represented in the first data being less than the number of all combinations of the plurality of products, because they are both recognizing that processing every possible pair from the full item set is impractical and that only a selected subset of combinations, retrieved by the bi-encoder, is carried forward.)
Zhang and Thakur does not teach wherein the second model is configured to receive product information comprising product titles for two products, wherein the second model is configured to receive product information comprising product titles for two products, and to output a vector corresponding to the input product information of the two products.
Lu teaches wherein the second model is configured to receive product information comprising product titles for two products, and to output a vector corresponding to the input product information of the two products, ([Lu, page 3, sec 3.1] “cross-encoder computes the relevance score sce(q, p), where the input is the concatenation of q and p with a special token [SEP]. Subsequently, the [CLS] representation of the output is fed into a linear function to compute the relevance score.”, wherein the examiner interprets Lu’s cross-encoder receiving a paired input (the concatenation of two inputs separated by a special token [SEP]) and producing a [CLS] representation corresponding to that paired input to be the same as “the second model being configured to receive product information comprising product titles for two products and to output a vector corresponding to the input product information of the two products,” because they are both describing a model that jointly ingests two inputs and produces a vector representation corresponding to the combined paired input.)
Zhang, Thakur, Lu, and the instant application are analogous art because they are both directed to neural retrieval systems that use encoder-based models.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the offline training module disclosed by Zhang to include the encoding technique disclosed by Thakur. One would be motivated to do so to effectively generate dense vector representations for large collections of textual items that enable efficient similarity computation and candidate retrieval, as suggested by Thakur ([Thakur, page 1, sec 1] “bi-encoders such as Sentence BERT (SBERT) encode each sentence independently and map them to a dense vector space.”). It would have further been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the offline training module disclosed by Zhang to include the cross-encoding technique disclosed by Lu. One would be motivated to do so to effectively compute more accurate similarity or relevance scores between paired inputs by jointly encoding both items in a single model, as suggested by Lu ([Lu, page 3, sec 3.1] “cross-encoder computes the relevance score sce(q, p), where the input is the concatenation of q and p with a special token [SEP].”).
Regarding Claim 3, Zhang, Thakur, and Lu teach The method according to claim 1, (see rejection of claim 1).
Zhang further teaches wherein similarity between the first vector representations is computed based on distance between two of the first vector representations; ([Zhang, page 2409, sec 3] “we employ one of state-of-the-art algorithms [15] for efficient nearest-neighbor search of dense vectors.” and [Zhang, 2410, sec 4.1] “simple dot product interaction between query and item towers, the query and item embeddings are still theoretically in the same geometric space. Thus ending K nearest items for a given query embedding is equivalent to minimizing the loss for K query item pairs where the query is given.…G(Q(q), S(s)) = ∑ wᵢ eᵢᵀ g”, wherein the examiner interprets nearest-neighbor search of dense vectors and finding K nearest items to be the same as computing similarity based on distance between two vector representations, because they are both procedures that compare embeddings in a shared space and identify the pairs with the smallest distance (i.e., highest similarity)).
Regarding Claim 4, Zhang, Thakur, and Lu teach The method according to claim 1, (see rejection of claim 1).
Zhang further teaches wherein in the step of generating the second data, similarity is computed based on a score computed based on the second vector representations. ([Zhang, page 2410, sec 4.3] “the soft dot product interaction between query and item can be defined as follows, G(Q(q), S(s)) = ∑ wᵢ eᵢᵀ g” and [Zhang, page 2410, sec 4.3] “This scoring function is basically a weighted sum of all inner products between m query embeddings and one item embedding.”, wherein the examiner interprets Zhang’s discussion of a scoring function formed by weighted inner products of the query and item embeddings to be the same as computing similarity based on a score derived from the second vector representations because they are both describing how a similarity measure is produced by applying a mathematical function (dot-product weighting) to the vector outputs of the second model.)
Zhang, Thakur, Lu and the instant application are analogous art because they are all directed to neural retrieval methods that compute similarity between encoded item representations using scoring functions over vector outputs.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method claim 1 disclosed by Thakur to include the scoring function disclosed by Zhang. One would be motivated to do so to effectively compute similarity scores from encoded vector representations so as to improve ranking and selection of highly similar item pairs, as suggested by Zhang ([Zhang, page 2410] “This scoring function is basically a weighted sum of all inner products between m query embeddings and one item embedding.”).
Regarding claim 5, Zhang, Thakur, and Lu teach The method according to claim 1, (see rejection of claim 1).
Thakur further teaches wherein the third model comprises a bi-encoder. ([Thakur, page 3, sec 3.1] “We then train the bi-encoder on this extended training dataset. We refer to this model as Augmented SBERT (AugSBERT).”, wherein the examiner interprets the newly trained AugSBERT model (which is a bi-encoder trained on the cross-encoder-labeled silver dataset) to be the same as the “third model comprising a bi-encoder”, because they are both describing a model that is trained last in the pipeline, using data generated by the cross-encoder as training data, and that independently encodes each single input into a dense vector representation, which is the defining characteristic of a bi-encoder.)
Zhang, Thakur, Lu and the instant application are analogous art because they are all directed to neural retrieval systems that train encoder-based models.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method claim 1 disclosed by Zhang, Thakur, and Lu to include the bi-encoder process disclosed by Thakur. One would be motivated to do so to effectively improve the quality of vector representations used for retrieval by retraining a bi-encoder model on additional labeled similarity data generated during the training pipeline, as suggested by Thakur ([Thakur, page 3, sec 3.1] “We then train the bi-encoder on this extended training dataset.”).
Regarding Claim 8, Zhang, Thakur, and Lu teaches The method according to claim 1, (see rejection of claim 1).
Thakur further teaches wherein in the step of generating the second data, when there is data related to a combination of products manually determined to be highly similar to each other, the similarity is determined to be high based on the second vector representations, and second data including the combination of products manually determined to be highly similar to each other is generated. ([Thakur, page 3, sec 3.1] “Given a pre-trained, well-performing cross encoder, we sample sentence pairs according to a certain sampling strategy (discussed later) and label these using the cross-encoder. We call these weakly labeled examples the silver dataset and they will be merged with the gold training dataset. We then train the bi-encoder on this extended training dataset.” and “we can re-use the sentences from the gold training set [human-annotated]”, wherein the examiner interprets “merged with the gold training dataset” (gold = human-labeled pairs already judged highly similar) and “weakly labeled…silver dataset” (pairs that the cross-encoder deems highly similar using second-stage vector representations) to be the same as “data related to a combination of products manually determined to be highly similar to each other” and “similarity is determined to be high based on the second vector representations…second data including the combination of products manually determined to be highly similar,” because they are both describing a process in which previously human-verified similar pairs (gold) are carried forward into a new dataset only after the model’s second-stage vectors (cross-encoder) confirm high similarity, thereby forming the updated second data.)
Zhang, Thakur, Lu, and the instant application are analogous art because they are all directed to enhancing the quality of product-pair training data.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method according to claim 1 disclosed by Zhang, Thakur, and Lu, to include the gold training dataset disclosed by Thakur. One would be motivated to do so to efficiently train the bi-encoder as suggested by Thakur ([Thakur, page 4, sec 3.1] “We train a bi-encoder (SBERT) on the gold training set as described in section 5 and use it to sample further, similar sentence pairs.”)
Regarding Claim 9, Zhang, Thakur, and Lu teach The method according to claim 1, (see rejection of claim 1).
Zhang further teaches further comprising receiving a new product put up for sale; generating third vector representations based on product information of the new product in order to generate third data including new combinations of products highly similar to the new product based on similarity between the third vector representations and the first vector representations with using the first model; ([Zhang, 2409, sec 3] “Offline Indexing module loads the item embedding model (i.e., the item tower) to compute all the item embeddings from the item”, and [Zhang, page 2409, sec 3] “transform any user input query text to query embedding…retrieve K similar items”, wherein the examiner interprets computing and embedding for a query item and retrieving K similar items by nearest-neighbor search to be the same as generating a vector for the new product with the first model and forming new combinations of highly similar products, because they are both embedding the new item and selecting its closest neighbors in the existing product-vector space.)
generating fourth vector representations from combinations of product information of two products included in the new combinations included in the third data and generating fourth data including combinations of highly similar products based on the fourth vector representations with using the second model; ([Thakur, page 3, sec 3.1] “we sample sentence pairs according to a certain sampling strategy (discussed later) and label these using the cross-encoder”, wherein the examiner interprets label these using the cross-encoder (which jointly encodes each item pair) to be the same as generating fourth vector representations with the second model to score the new pairs, because they are both re-embedding each candidate pair with a stronger cross-encoder to assess similarity.)
generating second encoder annotation data by annotating each of the combinations of highly similar products included in the fourth data to be positive; and executing learning of the third model using the second encoder annotation data as training data. ([Thakur, page 3, sec 3.1] “we call these weakly labeled examples the silver dataset We call these weakly labeled examples the silver dataset and they will be merged with the gold training dataset. We then train the bi-encoder on this extended training dataset”, wherein the examiner interprets silver dataset of cross-encoder-approved pairs to be the same as second-encoder annotation data marked positive, because they are both collections of pairs that the second model has confirmed as highly similar. The examiner further interprets “train the bi-encoder on this extended training dataset” to be the same as executing learning of the third model with the second-encoder annotation data, because they are both retraining the serving bi-encoder using the positives produced by the cross-encoder.)
Zhang, Thakur, Lu, and the instant application are analogous art because they are all directed to automated pipelines that ingest newly-arriving items.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method according to claim 1 disclosed by Zhang, Thakur, and Lu, to include the training data set fine-tuning process disclosed by Thakur. One would be motivated to do so to effectively improve the accuracy of the serving bi-encoder without costly manual labeling, as suggested by Thakur ([Thakur, page 1] “We use the cross-encoder to label new input pairs, which are added to the training set for the bi-encoder. The SBERT bi-encoder is then fine-tuned on this larger augmented training set, which yields a significant performance increase”).
Regarding claim 10, Zhang teaches:
A method of outputting search results of related products, comprising receiving input information indicating a search condition from a user, generating vector representations of the search condition using a third model, wherein the third model is trained by: ([Zhang, p. 2409, sec 3] “Online Serving module loads the query embedding model (i.e., the query tower) to transform any user input query text to query embedding, which is then fed to the item embedding index to retrieve 𝐾 similar items. Note that this online serving system has to be built with low latency of tens of milliseconds.”, wherein the examiner interprets “transform any user input query text to query embedding, which is then fed to the item embedding index to retrieve K similar items” to be the same as “receiving input information indicating a search condition from a user and generating vector representations of the search condition using a third model”, because they are both describing a system that accepts a user's query as input, converts that query into a vector representation using a trained model, and uses that vector to retrieve matching products from an indexed collection.)
wherein the first model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information, ([Zhang, page 2409, Figure 3] showing “Item Title Tokens” as an input feature to the item tower, and [Zhang, page 2409-2410, sec 4.1] “As shown in the right side of offline model training panel in Figure 3. [The item tower] concatenates all item features as input layer, then goes through multi-layer perceptron (MLP)…to output a single item embedding” and [Zhang, page 2408, sec 1.2] “especially for e-commerce search, where item titles are often short.”, wherein the examiner interprets Zhang’s item tower receiving item title tokens (product title information) for a single item and outputting a single item embedding vector to be the same as “the first model being configured to receive product information comprising a product title for one product as input and to output a vector corresponding to the input product information,” because they are both describing a model that ingests a product title for a product and produces a corresponding vector representation (embedding) for that product.)
acquiring product information of products corresponding to the search condition based on similarity between vector representations of product information of the plurality of products generated by the third model and vector representations of the search condition; ([Zhang, page 2409] “transform any user input query text to query embedding, which is then fed to the item embedding index to retrieve K similar items.” and [Zhang, page 2410] “due to the simple dot product interaction...finding K nearest items for a given query embedding”, wherein the examiner interprets “fed to the item embedding index to retrieve K similar items” to be the same as acquiring product information of products corresponding to the search condition based on similarity between vector representations of the plurality of products and vector representations of the search condition, because they are both using similarity between the query vector and stored product vectors to identify and retrieve the most relevant matching products.)
and displaying the acquired product information of the products corresponding to the search condition, [Zhang, page 2413, sec 6.1.3] “Semantic Matching. For better understanding of how our proposed model performs, we show a few good cases from our retrieval production in Table 1. We can observe that DPSR is surprisingly capable of bridging queries and relevant items by learning the semantic meaning of some words”, wherein the examiner interprets “we show a few good cases from our retrieval production” to be the same as displaying the acquired product information of the products corresponding to the search condition, because they are both referring to displaying the retrieved product results to illustrate the system's output to the user.)
Zhang does not teach generating, using a first model comprising a bi-encoder, first vector representations based on product information data for the plurality of products, and generating first data indicating one or more combinations of products determined to be highly similar to each other from among the plurality of products based on similarities between the first vector representations, … wherein the second model is configured to output a similarity value between products with a higher accuracy than the first model; … generating, using a second model comprising a cross-encoder, second vector representations from product information of two products included in the first data and generating second data indicating one or more combinations of products determined to be highly similar to each other based on the second vector representations, … and training the third model to generate vector representations from the product information using the second data as training data, wherein the third model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information, wherein the first model, the second model, and the third model are different models, and wherein a number of product combinations of the plurality of products represented in the first data is less than a number of all combinations of the plurality of products.
Thakur teaches:
generating, using a first model comprising a bi-encoder, first vector representations based on product information data for the plurality of products, and generating first data indicating one or more combinations of products determined to be highly similar to each other from among the plurality of products based on similarities between the first vector representations, ([Thakur, page 1, sec 1] “bi-encoders such as Sentence BERT (SBERT) encode each sentence independently and map them to a dense vector space.” and [Thakur, page 4, sec 3.1] “We train a bi-encoder (SBERT) on the gold training set…and use it to sample further, similar sentence pairs. We use cosine-similarity and retrieve for every sentence the top k most similar sentences in our collection.”, wherein the examiner interprets training an initial SBERT bi-encoder and using it to independently encode each input into a dense vector and retrieve the top-k most similar pairs via cosine similarity to be the same as using a first model comprising a bi-encoder to generate first vector representations and generating first data indicating combinations of highly similar products, because they are both describing an independent encoding architecture that maps each item into a dense vector and uses vector similarity to select a subset of similar pairs from the full collection.)
wherein the second model is configured to output a similarity value between products with a higher accuracy than the first model; ([Thakur, page 1, Abstract] “While cross-encoders often achieve higher performance, they are too slow for many practical use cases … A drawback of the SBERT bi-encoder is usually a lower performance in comparison with the BERT cross-encoder.”, wherein the examiner interprets the cross-encoder achieving higher performance/accuracy than the bi-encoder to be the same as the second model outputting a similarity value with higher accuracy than the first model, because they are both describing the well-established relative superiority of cross-encoders over bi-encoders in terms of accuracy of similarity scoring between two items.)
generating, using a second model comprising a cross-encoder, second vector representations from product information of two products included in the first data and generating second data indicating one or more combinations of products determined to be highly similar to each other based on the second vector representations, ([Thakur, page 3, sec 3.1] “Given a pre-trained, well-performing crossencoder, we sample sentence pairs according to a certain sampling strategy (discussed later) and label these using the cross-encoder. We call these weakly labeled examples the silver dataset and they will be merged with the gold training dataset.” and [Thakur, page 1] “Cross-encoders, which perform full-attention over the input pair … we use the cross-encoder to label a larger set of input pairs to augment the training data for the bi-encoder.”, wherein the examiner interprets the cross-encoder performing full-attention over pairs drawn from the set identified by the bi-encoder and producing a labeled silver dataset to be the same as “using a second model comprising a cross-encoder to generate second vector representations from two products included in the first data and generating second data indicating combinations of highly similar products”, because they are both describing a cross-encoder that takes two items jointly as input from a previously identified candidate set and produces a labeled dataset of highly similar pairs for downstream training.)
and training the third model to generate vector representations from the product information using the second data as training data, wherein the third model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information, ([Thakur, page 3, sec 3.1] “We then train the bi-encoder on this extended training dataset. We refer to this model as Augmented SBERT (AugSBERT).” and [Thakur, page 1, sec 1] “bi-encoders…encode each sentence independently and map them to a dense vector space.”, wherein the examiner interprets training the new AugSBERT bi-encoder on the cross-encoder-labeled silver dataset, where that bi-encoder independently encodes each single input into a dense vector, to be the same as training the third model to generate vector representations from product information using the second data as training data, wherein the third model receives a product title for one product and outputs a corresponding vector, because they are both describing a model trained last in the pipeline on cross-encoder-generated data that independently maps a single input item to a vector representation.)
wherein the first model, the second model, and the third model are different models,([Thakur, page 3, sec 3.1] “Given a pre-trained, well-performing cross-encoder, we sample sentence pairs according to a certain sampling strategy (discussed later) and label these using the cross-encoder…We then train the bi-encoder on this extended training dataset. We refer to this model as Augmented SBERT (AugSBERT).” and [Thakur, page 4, sec 3.1] “We train a bi-encoder (SBERT) on the gold training set…and use it to sample further, similar sentence pairs.”, wherein the examiner interprets Thakur's explicit use of three structurally and parametrically distinct models; (1) an initial SBERT bi-encoder used for semantic search sampling to identify candidate similar pairs, (2) a separate BERT cross-encoder used to label those pairs, and (3) a newly trained AugSBERT bi-encoder trained on the cross-encoder-labeled silver data; to be the same as the first, second, and third models being different models, because they are both describing a pipeline with three distinct models each serving a different role, where no two models are the same model being reused.)
and wherein a number of product combinations of the plurality of products represented in the first data is less than a number of all combinations of the plurality of products. ([Thakur, page 3, sec 3.1] “there are n × (n - 1)/2 possible combinations for n sentences. Weakly labeling all possible combinations would create an extreme computational overhead, and, as our experiments show, would likely not lead to a performance improvement.”, wherein the examiner interprets the teaching that it is computationally infeasible and undesirable to label all possible pair combinations and that only a sampled subset is used to be the same as the number of product combinations represented in the first data being less than the number of all combinations of the plurality of products, because they are both recognizing that processing every possible pair from the full item set is impractical and that only a selected subset of combinations, retrieved by the bi-encoder, is carried forward.)
Thakur does not teach, wherein the second model is configured to receive product information comprising product titles for two products, and to output a vector corresponding to the input product information of the two products;.
Lu teaches wherein the second model is configured to receive product information comprising product titles for two products, and to output a vector corresponding to the input product information of the two products ([Lu, page 3, sec 3.1] “cross-encoder computes the relevance score sce(q, p), where the input is the concatenation of q and p with a special token [SEP]. Subsequently, the [CLS] representation of the output is fed into a linear function to compute the relevance score.”, wherein the examiner interprets Lu’s cross-encoder receiving a paired input (the concatenation of two inputs separated by a special token [SEP]) and producing a [CLS] representation corresponding to that paired input to be the same as “the second model being configured to receive product information comprising product titles for two products and to output a vector corresponding to the input product information of the two products,” because they are both describing a model that jointly ingests two inputs and produces a vector representation corresponding to the combined paired input.)
Zhang, Thakur, Lu, and the instant application are analogous art because they are all directed to neural retrieval systems that generate vector representations of textual inputs.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the search retrieval system disclosed by Zhang to include the training pipeline disclosed by Thakur. One would be motivated to do so to effectively generate high-quality dense vector representations for textual inputs using a teacher–student training framework in which a cross-encoder produces labeled similarity data used to train an efficient bi-encoder model for retrieval, thereby improving the semantic accuracy of retrieved search results while maintaining efficient retrieval performance, as suggested by Thakur ([Thakur, page 1, sec 1] “bi-encoders such as Sentence BERT (SBERT) encode each sentence independently and map them to a dense vector space.”). It would have further been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the retrieval and training pipeline disclosed by Zhang and Thakur to include the paired-input cross-encoding technique disclosed by Lu. One would be motivated to do so to effectively compute more accurate similarity or relevance scores between pairs of textual inputs by jointly encoding both items within a single model representation, thereby producing more accurate similarity labels for training downstream encoder models, as suggested by Lu ([Lu, Section 3.1] “cross-encoder computes the relevance score sce(q, p), where the input is the concatenation of q and p with a special token [SEP].”).
Regarding claim 11, Zhang teaches:
A learning device for training a model for identifying a product in accordance with a predetermined search condition from among a plurality of products, the device comprising: ([Zhang, page 2409, sec 3] “Offline Model Training module trains a two tower model…for the uses in online serving and offline indexing…enable fast online embedding retrieval.” and [Zhang, page 2409] “transform any user input query text to query embedding, which is then fed to the item embedding index to retrieve K similar items.”, wherein the examiner interprets “training a two tower model for online serving and transforming a user's input query to retrieve K similar items” to be the same as “training a model used for identifying a product corresponding to a predetermined search condition from among a plurality of products”, because they are both describing a trained model deployed in a search system that retrieves products matching a given user condition.)
at least one memory configured to store computer program code, and at least one processor configured to read the program code and operate as instructed by the program code, the program code ([Zhang, page 2409, sec 3] “Offline Model Training module trains a two tower model…for the uses in online serving and offline indexing.”, wherein the examiner interprets the Offline Model Training module as a computing system executing programmatic training instructions to be the same as a device comprising at least one memory configured to store computer program code and at least one processor configured to read and execute that program code, because they are both describing a hardware-implemented computing system that stores and executes instructions to carry out the model training pipeline.)
wherein the first model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information, ([Zhang, Figure 3, page 2409] showing “Item Title Tokens” as an input feature to the item tower, and [Zhang, page 2409-2410, sec 4.1] “As shown in the right side of offline model training panel in Figure 3. [The item tower] concatenates all item features as input layer, then goes through multi-layer perceptron (MLP)…to output a single item embedding” and [Zhang, page 2408, sec 1.2] “especially for e-commerce search, where item titles are often short.”, wherein the examiner interprets Zhang’s item tower receiving item title tokens (product title information) for a single item and outputting a single item embedding vector to be the same as “the first model being configured to receive product information comprising a product title for one product as input and to output a vector corresponding to the input product information,” because they are both describing a model that ingests a product title for a product and produces a corresponding vector representation (embedding) for that product.)
Zhang does not teach comprising first generating code configured to cause at least one of the at least one processor to generate, using a first model comprising a bi-encoder, first vector representations based on product information data for the plurality of products, ... and generate first data indicating one or more combinations of products determined to be highly similar to each other from among the plurality of products based on similarities between the first vector representations, ... second generating code configured to cause at least one of the at least one processor to generate, using a second model comprising a cross-encoder, second vector representations from product information of two products included in the first data and generate second data indicating one or more combinations of products determined to be highly similar to each other based on the second vector representations, ... and training code configured to cause at least one of the at least one processor to train a third model to generate vector representations from the product information using the second data as training data, wherein the third model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information, wherein the first model, the second model, and the third model are different models, and wherein a number of product combinations of the plurality of products represented in the first data is less than a number of all combinations of the plurality of products.
Thakur teaches comprising first generating code configured to cause at least one of the at least one processor to generate, using a first model comprising a bi-encoder, first vector representations based on product information data for the plurality of products, ([Thakur, page 1, sec 1] “bi-encoders such as Sentence BERT (SBERT) encode each sentence independently and map them to a dense vector space.” and [Thakur, page 4, sec 3.1] “We train a bi-encoder (SBERT) on the gold training set…and use it to sample further, similar sentence pairs. We use cosine-similarity and retrieve for every sentence the top k most similar sentences in our collection.”, wherein the examiner interprets executable instructions that train and run an initial SBERT bi-encoder to independently encode each input item into a dense vector representation to be the same as first generating code configured to cause a processor to generate, using a first model comprising a bi-encoder, first vector representations based on product information data for the plurality of products, because they are both describing programmatic instructions that invoke a bi-encoder to independently map each product's information into a dense vector representation across the full product collection.)
and generate first data indicating one or more combinations of products determined to be highly similar to each other from among the plurality of products based on similarities between the first vector representations, ([Thakur, page 4, sec 3.1] “We train a bi-encoder (SBERT) on the gold training set…and use it to sample further, similar sentence pairs. We use cosine-similarity and retrieve for every sentence the top k most similar sentences in our collection.”, wherein the examiner interprets executable instructions that use the bi-encoder's cosine similarity over encoded vectors to retrieve the top-k most similar pairs from the collection to be the same as first generating code configured to generate first data indicating combinations of products determined to be highly similar based on similarities between the first vector representations, because they are both describing programmatic instructions that apply vector similarity over the bi-encoder's output to produce a set of highly similar product pair combinations from the full product collection.)
second generating code configured to cause at least one of the at least one processor to generate, using a second model comprising a cross-encoder, second vector representations from product information of two products included in the first data ([Thakur, page 3, sec 3.1] “Given a pre-trained, well-performing cross-encoder, we sample sentence pairs according to a certain sampling strategy (discussed later) and label these using the cross-encoder.” and [Thakur, page 1] “Cross-encoders, which perform full-attention over the input pair”, wherein the examiner interprets executable instructions that invoke a cross-encoder to perform full-attention over pairs drawn from the set identified by the bi-encoder to be the same as second generating code configured to cause a processor to generate, using a second model comprising a cross-encoder, second vector representations from product information of two products included in the first data, because they are both describing programmatic instructions that pass previously identified candidate pairs jointly into a cross-encoder to produce a joint representation from those paired inputs.)
and generate second data indicating one or more combinations of products determined to be highly similar to each other based on the second vector representations, ([Thakur, page 1, Abstract] “we use the cross-encoder to label a larger set of input pairs to augment the training data for the bi-encoder. We call these weakly labeled examples the silver dataset and they will be merged with the gold training dataset.”, wherein the examiner interprets executable instructions that use the cross-encoder to label input pairs and produce a silver dataset of labeled similar pairs to be the same as second generating code configured to generate second data indicating combinations of products determined to be highly similar based on the second vector representations, because they are both describing programmatic instructions that invoke a cross-encoder to produce a labeled dataset of highly similar pairs that will serve as downstream training data.)
and training code configured to cause at least one of the at least one processor to train a third model to generate vector representations from the product information using the second data as training data, ([Thakur, page 3, sec 3.1] “We then train the bi-encoder on this extended training dataset. We refer to this model as Augmented SBERT (AugSBERT).”, wherein the examiner interprets executable instructions that train a new AugSBERT bi-encoder on the cross-encoder-labeled silver dataset to be the same as training code configured to cause a processor to train a third model to generate vector representations from product information using the second data as training data, because they are both describing programmatic instructions that invoke a training routine using the cross-encoder-generated labeled dataset to produce a new, independently trained encoder model.)
wherein the third model is configured to receive product information comprising a product title for one product as input, and to output a vector corresponding to the input product information, ([Thakur, page 1, sec 1] “bi-encoders…encode each sentence independently and map them to a dense vector space.” and [Thakur, page 3, sec 3.1] “We then train the bi-encoder on this extended training dataset. We refer to this model as Augmented SBERT (AugSBERT).”, wherein the examiner interprets the newly trained AugSBERT bi-encoder independently encoding each single input into a dense vector to be the same as the third model being configured to receive product information comprising a product title for one product as input and to output a vector corresponding to that product, because they are both describing a model that independently takes a single input item and maps it to a vector representation.)
wherein the first model, the second model, and the third model are different models, ([Thakur, page 3, sec 3.1] “Given a pre-trained, well-performing cross-encoder, we sample sentence pairs according to a certain sampling strategy (discussed later) and label these using the cross-encoder…We then train the bi-encoder on this extended training dataset. We refer to this model as Augmented SBERT (AugSBERT).” and [Thakur, page 4, sec 3.1] “We train a bi-encoder (SBERT) on the gold training set…and use it to sample further, similar sentence pairs.”, wherein the examiner interprets Thakur's explicit use of three structurally and parametrically distinct models; (1) an initial SBERT bi-encoder used for semantic search sampling to identify candidate similar pairs, (2) a separate BERT cross-encoder used to label those pairs, and (3) a newly trained AugSBERT bi-encoder trained on the cross-encoder-labeled silver data; to be the same as the first, second, and third models being different models, because they are both describing a pipeline with three distinct models each serving a different role, where no two models are the same model being reused.)
and wherein a number of product combinations of the plurality of products represented in the first data is less than a number of all combinations of the plurality of products. ([Thakur, page 3, sec 3.1] “there are n × (n - 1)/2 possible combinations for n sentences. Weakly labeling all possible combinations would create an extreme computational overhead, and, as our experiments show, would likely not lead to a performance improvement.”, wherein the examiner interprets the teaching that it is computationally infeasible and undesirable to label all possible pair combinations and that only a sampled subset is used to be the same as the number of product combinations represented in the first data being less than the number of all combinations of the plurality of products, because they are both recognizing that processing every possible pair from the full item set is impractical and that only a selected subset of combinations, retrieved by the bi-encoder, is carried forward.)
Zhang and Thakur do not teach wherein the second model is configured to receive product information comprising product titles for two products, and to output a vector corresponding to the input product information of the two products,.
Lu teaches wherein the second model is configured to receive product information comprising product titles for two products, and to output a vector corresponding to the input product information of the two products, ([Lu, page 3, sec 3.1] “cross-encoder computes the relevance score sce(q, p), where the input is the concatenation of q and p with a special token [SEP]. Subsequently, the [CLS] representation of the output is fed into a linear function to compute the relevance score.”, wherein the examiner interprets Lu’s cross-encoder receiving a paired input (the concatenation of two inputs separated by a special token [SEP]) and producing a [CLS] representation corresponding to that paired input to be the same as “the second model being configured to receive product information comprising product titles for two products and to output a vector corresponding to the input product information of the two products,” because they are both describing a model that jointly ingests two inputs and produces a vector representation corresponding to the combined paired input.)
Zhang, Thakur, Lu, and the instant application are analogous art because they are all directed to machine learning systems for semantic similarity search.
It would have further been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the identifying/receiving product information from predetermined search conditions disclosed by Zhang to include the generating code directed to bi-encoders based on collected/prior info disclosed by Thakur. One would be motivated to do so to efficiently receive product info from predetermined or collected data suggested by Thakur ([Thakur, page 4, sec 3.1] “We train a bi-encoder (SBERT) on the gold training set…and use it to sample further, similar sentence pairs. We use cosine-similarity and retrieve for every sentence the top k most similar sentences in our collection.”) It would have further been obvious to one of ordinary skill in the art before the effective filing date to include the cross-encoder technique disclosed by Lu. One would be motivated to do so to effectively improve the accuracy of similarity scoring between paired textual items by jointly encoding both inputs and computing a relevance score from their combined representation, as suggested by Lu ([Lu, page 1, Section 3.1] “cross-encoder computes the relevance score sce(q, p), where the input is the concatenation of q and p with a special token [SEP].”).
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Thakur in view of Lu further in view of NPL reference “Billion-Scale Similarity Search with GPUs”, by Johnson et. al. (referred herein as Johnson).
Regarding claim 7, Zhang, Thakur, and Lu teaches The method according to claim 1, (see rejection of claim 1).
Zhang, Thakur, and Lu do not teach further comprising a step of using a dimensionality reduction encoder to compute vector representations lower in dimension than the vector representations.
Johnson teaches further comprising a step of using a dimensionality reduction encoder to compute vector representations lower in dimension than the vector representations. ([Johnson, page 535, sec 1] “several approaches employ compressed representations of the vectors using an encoding. This is especially convenient for memory-limited devices like GPUs. It turns out that accepting a minimal accuracy loss can result in orders of magnitude of compression” and [Johnson, page 535, sec 1] “the optimized product quantization or OPQ is a linear transformation on the input vectors that improves the accuracy of the product quantization; it can be applied as a pre-processing.”, wherein the examiner interprets “internal compressed representation…using an encoding” and “a linear transformation…applied as a pre-processing” to be the same as employing a dimensionality-reduction encoder that outputs lower-dimensional vectors, because they are both transforming higher-dimensional embeddings into more compact representations to reduce storage and accelerate subsequent similarity search.)
Zhang, Thakur, Lu, Johnson, and the instant application are analogous art because they are all directed to methods of product search that employ vector embeddings and similarity search.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method according to claim 1 disclosed by Zhang, Thakur, and Lu to include the compressed representation technique disclosed by Johnson. One would be motivated to do so to efficiently increase the amount of compression achieved, as suggested by Johnson ([Johnson, page 535] “compressed representation of the vectors using an encoding…accepting a minimal accuracy loss results in orders of magnitude of compression”).
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DEVAN KAPOOR whose telephone number is (703)756-1434. The examiner can normally be reached Monday - Friday: 9:00AM - 5:00 PM EST (times may vary).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DEVAN KAPOOR/Examiner, Art Unit 2126
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126