DETAILED ACTION
Receipt of Applicant’s Amendment, filed December 24, 2025 is acknowledged.
Claims 1 and 8 were amended.
Claim 3 was canceled.
Claims 1-2, 4-12 are pending in this office action.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings are objected to because Figure 1 is illegible. One of ordinary skill in the art would not be able to discern what Figure 1 is intended to portray. Page 14, lines 4-5 of the specification states that Figure 1 “shows an example of clusters of similar terms in a set of around 73500 documents”. The “clusters” described in the specification are not visible in Figure 1, nor are the terms. Any term depicted in a drawing is expected to be readable.
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Claim Objections
Claims 1-2, 4-12 are objected to because of the following informalities. Appropriate correction is required.
With regard to claims 1 and 8, claim 1 recites “calculating SimSet groups using the previously calculated word embeddings of the tokenized character strings, in each document of the at least one subset of the set of documents, which are sets of tokenized character strings computed by clustering the previously calculated word embeddings that represent the respective tokenized character string based on vector similarity of those word embeddings…”. The punctuation used renders the meaning of the claim unclear. What is “in each document”? What is the “set of tokenized character strings computed by clustering”? Are these limitations intended to refer to the SimSet groups, the previously calculated word embeddings or the tokenized character strings.
It should be noted that step “a)” of the claim recites “at least one subset of the set of documents that have the tokenized character string”. Meaning that the claim has previously tokenized character strings are in the documents of the at least one subset of the set of documents. It is unnecessary to repeat claim limitations, and in this instants serves to add confusion to the scope of the claim. It is suggested that duplicate limitations be removed to improve the readability of the claim.
It is noted that step “b)” recites “calculating word embeddings that represent the tokenized character string”. This claim limitation appears to be repeated in the instant claim, adding confusion to the scope of the claim. It is unnecessary to repeat limitations that the claim has already established.
For examination purposes this claim limitation has been construed to mean -- calculating SimSet groups using the previously calculated word embeddings of the tokenized character strings, wherein the SimSet groups are sets of tokenized character strings computed by clustering the previously calculated word embeddings based on vector similarity of those word embeddings--.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 4, 7-10 are rejected under 35 U.S.C. 103 as being unpatentable over Rioblat [6189002] in view of Sommer [6847966], Schuetze [EP0687987], Vailaya [2008/0263023] and Bellegarda [WO2018/208514].
With regard to claim 1 Roitblat teaches A method for pre-selecting (Roitblat, Column 10, lines 20-27 “The resulting profile is then compared with the centroids of each of the categories learned by the self-organizing map. One of these centroids will provide the best match to the query vector, so the patterns represented by that neuron are likely to be the best match for the query”) and determining similar documents as the documents in this cluster (Roitblat, Column 10, liens 27-28 “The semantic profiles of the documents in this cluster are compared with the query vector and ranked in order of decreasing dot product”) from a set of documents, wherein the set of documents as the documents retrieved (Roitblat, Column , lines 63 “Documents are retrieved (FIG. 1, 104)”) have tokenized character strings (Roitblat, Column 5, lines 67 “Words in the documents”), the method comprising the steps of:
a) calculating, with an indexing method (Roitblat, Column 12, line 24 “the word-to-index dictionary, and the word-by-document matrix”; Column 4, Table 1), an [[ as by document (Roitblat, Column 12, line 24 “the word-to-index dictionary, and the word-by-document matrix”; Column 4, Table 1) of the set of documents (Roitblat, Column 12, line 24 “the word-to-index dictionary, and the word-by-document matrix”; Column 4, Table 1) that have the tokenized as extracting principal components (Roiblat, Column 7 lines 30-36 “The first embodiment to be described is based on a neural network that extracts the principal components from the co-occurrence matrix. As shown in FIG. 1, the base text (the text corpus) is pre-processed to remove all formatting and all hard return characters except those between paragraphs (step 101). Very short paragraphs, such as titles are combined with the subsequent paragraphs.”; Figure 1, 101 “Process Base Text”; Please note that this claim limitation has been read in light of the definition given for “Tokenizing” provided in Page 15, lines 14-16: “coding. Tokenizing means breaking down a text into individually processable components (words, terms and punctuation marks).”) character strings (Roitblat, Column 5, lines 67 “Words in the documents”; Figure 1, 102 “Built Text Term Vectors”), wherein the tokenized character strings are formed by tokenizing as extracting the principal components (Roiblat, Column 7 lines 30-36) each document as of each document (Roiblat, Column 7 lines 30-36) in the set of documents into respective character strings (Roiblat, Column 7, lines 37-39 “A dictionary (vocabulary) is created that maps word forms and their uninflected stems to elements in a text vector (step 102).”) by breaking down text as extracting (Roiblat, Column 7 lines 30-36) in each document into one or more of (Please note that only one of the following is required for the prior art to read on the claimed device) words as word forms (Roiblat, Column 7, lines 37 “”word forms and their uninflected stems”; Please note the claim limitation ‘word’ has been construed in light of Page 6, lines 8-9 of the original specification: “Coherent character strings (alphanumeric characters, hyphen) can be understood as words of a language”), terms as their uninflected stems (Roiblat, Column 7, lines 37-39; Please note the claim limitation ‘term’ has been construed in light of Page 6, lines 9-12 of the original specification: “A term can be regarded as a superset of words that can comprise still further punctuation marks or printable special characters or can consist of multiple related words and terms.”), or [[
b) calculating word embeddings as sparce vectors when PCA is used (Roitblat, Column 7, lines 45-52 “the network implements a principal components analysis of the collection of text vectors. This analysis reduces the data representation from a set of sparse vectors with length K to a collection of reduced vectors with length N. It projects the original data vectors onto another set of vectors eliminating the redundancy, i.e., the correlation, among the elements of the original vectors.”) or low dimensional representations when decomposition techniques are used (Roitblat, Column 8, lines 48-52 “There are other techniques that can be used in place of principal components to project the high dimensional text vectors onto lower dimensional representations. These techniques are known in the statistical literature as matrix decomposition techniques.”) that represent the tokenized character strings as the text vectors when using PCA (Roitblat, Column 7, lines 46-50) or the high dimensional text vectors when using Decomposition (Roiblat, Column 8, lines 48-52) of the at least one subset of the set of documents as the documents being analyzed (Roitblat, Column 7, lines 30-44) in a [[ as the covariance matrix when PCA is used (Roiblat, Column 7 lines 67- Column 8, line 1 “”The matrix C is the KxK covariance matrix defined by C=E{xxT}”) or the lower dimensional subspace when Decomposition techniques are used (Roiblat, Column 8, lines 48-52), wherein each word embedding comprises a respective vector as the vector (Roitblat, Column 5, lines 49-54 “an input vector are taken to be a semantic profile of the text unit that produced the input vector. We can think of the elements of the hidden layer as representing a set of unnamed semantic primitives representing the words (in context) on which it was trained. ”) corresponding to a respective tokenized character string as the text unit that is used to produce the input vector (Id), wherein contextual information (Roitblat, Column 3, lines 5-7 “The present invention uses the context in which words appear as part of the representation of what those words mean.”) of the surrounding (Riotblat, Column 3, lines 20-24 “The context for the system comes from the analysis of a body of text, called a base text or a text corpus, that is representative of the interests of a particular community, as a guide to interpreting the words in their vocabulary, i.e., the aggregate of words in the text corpus”) tokenized character strings as the text (Id) is encoded in each component of their corresponding vector as projecting the text vector into the matrix (Roiblat, Column 7, lines 45-50; Column 8, lines 48-52”), and wherein the resulting vectors for similar tokenized character strings have (Please note this claim limitation reads as a description of a property of the inherent vectors themselves, and does not recite the step of calculating similarity. One of ordinary skill in the art would recognize that the covariance matrix/dimensional subspace [Roiblat, Column 7, lines 45-50; Column 8, lines 48-52] would inherently have the property.) less spatial distance to each other than resulting vectors for non-similar tokenized character strings (Roitblat, Column 7, lines 1-3 “The proximity of one vector to another corresponds to the similarity between the two vectors, obtained using the dot product of the two vectors.”; One of ordinary skill in the art would recognize that while Roitblat discusses the use of this property within the high-dimensional space generated in step 109, this is built upon the covariance matrix/dimensional subspace generated during step 102. One of ordinary skill in the art would recognize that it is the use of the covariance matrix/dimensional subspace when provides the ability to use proximity to calculate this similarity. Roitblat specifically uses this property to identify duplicate documents: Column 6, lines 10-14 “If the cosine (the normalized dot product) between the document's original text vector”);
c) calculating a respective document embedding as Document Term vectors (Figure 1, 105) for each document in the at least one subset of the set of documents as one vector for each document (Roitblat, Column 5, lines 63-65 “Documents are retrieved (FIG. 1, 104) and converted to text vectors - one vector for each document (FIG. 1, 105).”) by [[ as combining each paragraphs text vector with the subsequent text paragraph, e.g. counting the frequency of occurrence of words in the document (Roitblat, Column 5, lines 15-19 “One text vector is produced for each paragraph. The paragraphs that are used are those that occur naturally in the text, except that very short paragraphs-titles, for example-are combined with the subsequent text paragraphs.”; Figure 2, 203; Column 12, lines 3-4 “Entries in the matrix are the frequencies with which each word occurred in each document”), for each document of the at least one subset of the set of documents (Roitblat, Column 5, lines 63-65 “Documents are retrieved (FIG. 1, 104) and converted to text vectors -- one vector for each document (FIG. 1, 105).”), the word embeddings of all of the tokenized character strings within that document (Roitblat, Column 5, lines 15-19 “One text vector is produced for each paragraph. The paragraphs that are used are those that occur naturally in the text, except that very short paragraphs-titles, for example-are combined with the subsequent text paragraphs.”; Figure 2, 203; Column 12, lines 3-4 “Entries in the matrix are the frequencies with which each word occurred in each document”) and normalizing as normalizing (Roitblat, Column 12, lines 8-13) the sum as matrix as a whole (Roitblat, Column 11, line 66 – Column 12, line 4 “The word-by-document matrix has as many rows as there are elements in the text vectors, i.e., K. Each row corresponds to one word or word stem. The columns of the matrix, stored in sparse form, are the documents remaining in the database. Entries in the matrix are the frequencies with which each word occurred in each document”) of said word embeddings as each row of the matrix, e.g. representing the text vectors (Id) with the number of the tokenized character strings in that document as the number of words in the document(Roitblat, Column 12, lines 8-13 “These entries are then further transformed by dividing them by the log of the number of words in the document. Normalizing by document length in this way reduces the inherent bias for longer documents, which typically contain more examples of any particular word in the dictionary.”);
d) calculating SimSet groups (Roitblat, Column 6, lines 52-54 “the profiles for each of the cached documents can be organized into clusters”) using the previously calculated word embeddings as the semantic profile is the vector representing the document (Roitblat, Column 5, lines 49-54 “an input vector are taken to be a semantic profile of the text unit that produced the input vector. We can think of the elements of the hidden layer as representing a set of unnamed semantic primitives representing the words (in context) on which it was trained. ”) of the tokenized character strings as the text unit (Id), in each document of the at least one subset of the set of documents (Column 6, lines 31-34 “The text vectors of those documents that pass the relevance tests are submitted to the neural network as input and the hidden unit activation pattern is used as a semantic profile of the page (FIG. 1, 109).”), which are sets of tokenized character strings computed by clustering (Roitblat, Column 6, lines 52-54 “the profiles for each of the cached documents can be organized into clusters”; Column 9, lines 63-64 “the documents are organized into clusters”) the previously calculated word embeddings as the semantic profile is the vector representing the document (Roitblat, Column 5, lines 49-54 “an input vector are taken to be a semantic profile of the text unit that produced the input vector. We can think of the elements of the hidden layer as representing a set of unnamed semantic primitives representing the words (in context) on which it was trained.”) that represent the respective tokenized character string as the text unit (Id) based on vector similarity (Riotblat, Column 7, lines 1-3 “The proximity of one vector to another corresponds to the similarity between the two vectors”; Roitblat, Column 9, lines 54-64) of those word embeddings as the two vectors (Id), via an unsupervised, non-parameterized clustering algorithm as k-means clustering is inherently an unsupervised, non-parameterized clustering algorithm (Roitblat, Column 6, lines 52-56 “the profiles for each of the cached documents can be organized into clusters using either self-organizing feature map neural networks or the equivalent K-means clustering statistical procedure.”; Claim 12, “using k-means clustering”), wherein for the calculation of the SimSet groups (Roitblat, Column 9, lines 63-64 “the documents are organized into clusters”) only the terms of the document set that are above an importance threshold as only preserving the most important relations among terms and neglecting those relationships that are more idiosyncratic in the text, via the use of the threshold (Riotblat, Column 11, lines 23-24 “A threshold is then calculated corresponding to nonlinear lateral inhibition in the neural network”; Column 11, lines 27-30 “Only those items substantially above the mean are maintained, the others are set to 0. In doing this, we preserve the most important relations among the terms and neglect those relationships that are more idiosyncratic in the text”) based on the largest combined as for the document matrix (Roitblat, Column 11, lines 66) TF (Roitblat, Column 12, lines 4-5 “the frequencies with which each word occurred in each document”; line 7 “the word frequency”) IDF (Riotblat, Column 12, lines 8-9 “These entries are then further transformed by dividing them by the log of the number of words in the document”; line 12 “use this number to compute a log inverse document frequency, idf”) are used as only preserving the most important relations among terms and neglecting those relationships that are more idiosyncratic in the text, via the use of the threshold (Riotblat, Column 11, lines 23-24 “A threshold is then calculated corresponding to nonlinear lateral inhibition in the neural network”; Column 11, lines 27-30 “Only those items substantially above the mean are maintained, the others are set to 0. In doing this, we preserve the most important relations among the terms and neglect those relationships that are more idiosyncratic in the text”), wherein for the calculation of the SimSet groups a similarity threshold (Riotblat, Column 9, lines 53-60 “As a potential document is processed, it is transformed into a text vector x and passed through the neural network. The activation patterns of the units are saved as the semantic profile for the document and the result vector x is obtained. The cosine of the angle between x and x is computed. Those documents with cosines greater than a predetermined criterion ( e.g., 0.39) are admitted to the database, the rest are discarded”; Column 11, lines 23-24 “A threshold is then calculated corresponding to nonlinear lateral inhibition in the neural network”; Column 11, lines 27-30 “Only those items substantially above the mean are maintained, the others are set to 0. In doing this, we preserve the most important relations among the terms and neglect those relationships that are more idiosyncratic in the text”; Column 11, lines 42-44 “Relatedness is computed using the cosine of the angle between the two vectors. Documents with cosines above 0.39 are maintained in the database, cosines below 0.39 are not processed further”) is further used to exclude as preserving the most important relations and neglecting those that are more idiosyncratic in text (Id) dissimilar terms as not judged similar (Id) in which word embeddings as the vectors (Id) have a cosine similarity as cosine (Id) of less than 0 as only those items substantially above the mean are maintained, the others are set to 0 (Id) , resulting in SimSet groups (Roitblat, Column 9, lines 63-64 “the documents are organized into clusters”), only comprising strings or words, that:
1) frequently occur in the text based on a term frequency TF (Roitblat, Column 12, lines 4-5 “the frequencies with which each word occurred in each document”; line 7 “the word frequency”),
2) have a high information content based on the inverse document frequency IDF(Riotblat, Column 12, lines 8-9 “These entries are then further transformed by dividing them by the log of the number of words in the document”; line 12 “use this number to compute a log inverse document frequency, idf”), and
3) are similar to each other(Riotblat, Column 7, lines 1-3 “The proximity of one vector to another corresponds to the similarity between the two vectors”);
e) performing a query expansion as generating a result query vector (Roitblat, Column 9, lines 32-37 “may be computationally more efficient to compute the result vector and compare a sparse representation of the terms in the document vector with a sparse representation of the result vector. The only documents that need to be evaluated are those containing the terms described by the result vector,”), [[
i) first query terms that occur in the SimSet groups (Roitblat, Column 6, lines 46-47 “represents a semantic profile of each search term”; Column 9, lines 10-20) that matches at least one query term of the received search query as identifying the centroid that matches the query vector (Roitblat, Column 10, lines 24-27 “One of these centroids will provide the best match to the query vector, so the patterns represented by that neuron are likely to be the best match for the query.”); or
ii) [[
iii) [[
f) determining a query embedding as the semantic profile for the query (Roitblat, Column 6, lines 42-44 “A user query (FIG. 4, 401) is processed through the same neural network to produce its semantic profile (FIG. 4, 402, 403).”) from said expanded query terms as the result query vector (Roitblat, Column 9, lines 32-37) by summing as ANDing (Roitblat, Column 12, lines 58-65 “This result vector is modified when more than one search term is included in the query. A separate temporary vector is maintained which is the product of the initial result vector from each of the terms in the query. The idea here is to emphasize those terms that are common to the multiple search terms in order to emphasize the shared meaning of the terms. The effect of the multiplication is to “AND” the two vectors”) and normalizing (Roitblat, Column 12, lines 39-43 “the user submits a query, which is stripped of stop words and extraneous nonalphabetic characters, and converted to lowercase. Words that are in the vocabulary cause the corresponding elements of the text vector to be set to 1.”) word embeddings of the expanded query terms as the result vector (Roitblat, Column 12, lines 55-59 “After stemming is complete, the elements of the query vector corresponding to the query terms are set to 1.0. The query vector is then submitted to the neural network and a result vector is produced. This result vector is modified when more than one search term is included in the query”);
g) retrieving, with the expanded query terms as result query vector (Roitblat, Column 9, lines 32-37), preselected documents as the best matching cluster (Roitblat, Column 7, lines 7-10 “As a result, we can limit the number of comparisons that have to be made between the query profile and the cached document profiles to just those that are in the cluster with the best match to the query profile (FIG. 4,404).”) from the at least one subset of the set of documents using the [[(Roitblat, Column 12, line 24 “the word-to-index dictionary, and the word-by-document matrix”; Column 4, Table 1), wherein the preselected documents are selected as the best matching cluster (Roitblat, Column 7, lines 7-10 “As a result, we can limit the number of comparisons that have to be made between the query profile and the cached document profiles to just those that are in the cluster with the best match to the query profile (FIG. 4,404).”) based on the expanded query terms as result query vector (Roitblat, Column 9, lines 32-37) and are used to (Please note this claim limitation has been read as an intended use of the claimed expanded query terms) limit the number of document embeddings (Roitblat, Column 13, lines 5-6 “This technique [referring to Claim 12, lines 55-67 result query vector generation], as a result, allows the query terms to help disambiguate one another.”) to be (Please note this claim limitation has been read as an intended use, the claim does not require comparing in a similarity ranking, but merely recites that intent to use the document embeddings for such a comparison) compared in a similarity ranking (Roitblat, Column 6, lines 46-49 “It is then a simple matter to compare the semantic profile of the search terms against the semantic profiles of each stored page (FIG. 4, 405).”); and
h) comparing the query embedding with the document embeddings (Roitblat, Column 6, lines 46-49 “It is then a simple matter to compare the semantic profile of the search terms against the semantic profiles of each stored page (FIG. 4, 405).”) of the preselected documents (Roitblat, Column 7, lines 7-10 “As a result, we can limit the number of comparisons that have to be made between the query profile and the cached document profiles to just those that are in the cluster with the best match to the query profile (FIG. 4,404).”) to automatically determine a similarity score as measuring relevance (Roitblat, Column 9, lines 21-24 “Alternatively, one can compare the estimated text vector (also called a result vector) x̂ with each document's text vector and again measure relevance using the dot product of these vectors.”) for ranking (Roitblat, Column 10, lines 27-29 “The semantic profiles of the documents in this cluster are compared with the query vector and ranked in order of decreasing dot products”) the similarity of the preselected documents and displaying (Roitblat, Column 6 lines 50-51 “The pages that match most closely are the most relevant to the search and should be displayed first to the user”) or storing said preselected documents (Roitblat, Column 6, lines 15 “then the document is kept and stored in the database”; Column 7, lines 9 “the cached document profiles”).
Roitblat does not explicitly teach that the index is inverted.
Sommer teaches an inverted index… the inverted index (Sommer, Column 13, lines 50 “for each filed containing textual data, an inverted index is maintained, which maps all terms appearing in every document to the documents in the document database”).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the index taught by Roitblat as an inverted index as taught by Sommer as it is a known indexing method within the field of art that would yield the expected results of enabling the system to locate documents using the terms. Within the proposed combination, the index is still performing the same functionality as disclosed by Roitblat (to enable lookup and retrieval of the document). The proposed combination is merely replacing the structure of the index to be an inverted structure as depicted by Sommer.
While heavily implied Roitblat does not explicitly teach calculating a respective document embedding by adding.
Schuetze teaches calculating a respective document embedding (Schuetze, Page 5, line 39 “Ψ - document encoding”) for each document as document dj (Schuetze, Page 5, lines 44-45 “The representationon Ψ(dj) is computed for document dj by summing up the vectors of all tokens occurring in it”) in the at least one set of the documents (Schuetze, Page 5, line 35 “ D - a set of documents”) by adding (Schuetze, Page 5, lines 44-45 “The representationon Ψ(dj) is computed for document dj by summing up the vectors of all tokens occurring in it”) … the word embedding of all the tokenized character strings as the word imbedding for al tokens (Id) with that document (Schuetze, Page 5, lines 44-45 “The representationon Ψ(dj) is computed for document dj by summing up the vectors of all tokens occurring in it”) and normalizing (Figure 10, 210 “Normalize context vector”) the sum (Schuetze, Page 5, lines 44-45) of said word embeddings (Schuetze, Page 5, line 35 “ ϕ - word encoding”) with a number of the tokenized character strings in that document (Schuetze, Page 13 line 57 to Page 14 line 1 “By normalizing the context vectors, all the context vectors will have the same length regardless of the size of the document”).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented calculated the document vectors depicted by the proposed combination using the techniques taught by Schuetze as it would yield the predictable results of generating the vectors without the need to train the ML algorithm.
Roitblat does not explicitly teach (e) performing query expansion, resulting in a set of expanded query terms comprising: i) first query terms that occur in the SimSet groups that include at least one query term; or ii) second query terms that do not occur in the SimSet groups but do occur in at least one document of the at least one subset of the documents; or iii) third query terms that do not occur in any document of the at least one set of the documents.
Vailaya teaches (e) performing query expansion as the recommended query is an expanded query (Vailaya, ¶171 “query recommender”; ¶173 “Query Recommender shall find alternatives to that term in product model numbers”) for example, generating alterative terms or corrected terms (Vailaya, ¶172 “find good alternatives”; ¶173 “the query recommender may correct a single unmatched term… shall find alternatives to that term”), resulting in a set of expanded query terms as the alternative or corrected terms (Id) comprising:
i) first query terms that occur in the SimSet groups that match at least one query term of the received query as the suggested ‘canon A40’ for the query term ‘cannon A45” (Vailaya, ¶173 “For example, suggest "canon A40" for "canon A45".”; ¶174 “Based on rest of the matched terms in the query, a dictionary is constructed. Additionally, words from the dictionary are sought that are closest to the replacement terms. If the proximity of the candidates passes certain thresholds, the model numbers corresponding to these candidates are returned”); or
ii) second query terms that do not occur in the SimSet groups as when a query term is not in the list of known dictionary terms (Vailaya, ¶192 “The algorithm assumes as input a list of dictionary terms (known model names that may consist of full model name, alphanumeric or alpha only model parts, etc.),”) but do occur in at least one document of the at least one subset of the set of documents as when the query terms match some documents, the recommender finds alternative product models from the model dictionaries (Vailaya, ¶173 “When all terms match some document, the query recommender shall take a content term and find product model alternatives. For example, suggest "sorry DCRDVD200" for "sorry dv 200."”); or
iii) third query terms that do not occur in any document of the at least one subset of the set of documents (Vailaya, ¶172 “When user makes mistakes in entering the query, they may not get the expected results. The mistake may be a result of misspelled words or imprecise model numbers. A query recommender tries to find good alternatives in these circumstances. For example, the query recommender may be used to correct product model numbers.”; ¶173 “when a single query term does not match any documents, Query recommender shall find alternatives to that term”) including misspelled or phonetically similar variants as misspelled words (Id).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the query vector generation within the proposed combination to include corrections to misspelt terms as suggested by Vailaya as it yields the predictable results of capturing the concept the user is searching for in the query vector.
Roitblat does not explicitly teach wherein the tokenized character strings are formed by tokenizing … punctuation marks … or calculating word embeddings … in a continuous vector space.
Bellegarda teaches wherein the tokenized character strings are formed by tokenizing… punctuation marks as natural language analyzation determining tokens according to punctuation marks (Bellegarda ¶209 “In some examples, based on the unstructured natural language information contained in messages 802, natural language analyzer 820 obtains unstructured natural language texts and determines token sequences according to semantics, syntax, and/or punctuation marks associated with the unstructured natural language texts.”; ¶210 “Using the above described scenario illustrated in FIG. 9A as an example, natural language analyzer 820 can obtain a token sequence corresponding to the messages 802A-C. For example, the token sequence can include the one or more words and punctuation marks included in messages 802A-C.”) or calculating word embeddings… in a continuous vector space (¶244 “latent semantic analysis techniques, in which clustering technique are applied to the sequence representations (e.g. vectors)”; Page 78 Section “II. Latent Semantic Analysis”: “The LSA paradigm defines a mapping between the discrete sets V. T and a continuous vector space S, whereby each word wi in V is represented by a vector [vector ui] in S, and each document di in T is represented by a vector [vector vj] in S.”).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the text analysis and text vector generation using the techniques taught by Bellegarda as it yields the predictable results of extracting term components. Within the proposed combination Roitblat teaches performing principal component analysis, and provides a few methods known in the art of performing this operation.
Within the proposed combination Roitblat teaches generating a vector space, but does not explicitly state that this vector space is continuous. One of ordinary skill in the art would recognize that vector space generated by Roitblat would be equivalent to the continuous vector space taught by Bellegarda, and would view the specific vector spaces as mathematical equivalent.
One of ordinary skill in the art would recognize the method taught by Bellegarda as being yet another known method of performing this operation which would result in the same vector generation.
With regard to claim 8 Roitblat teaches An apparatus for pre-selecting (Roitblat, Column 10, lines 20-27 “The resulting profile is then compared with the centroids of each of the categories learned by the self-organizing map. Once of these centroids will provide the best match to the query vector, so the patterns represented by that neuron are likely to be the best match for the query”) and determining similar documents as the documents in this cluster (Roitblat, Column 10, liens 27-28 “The semantic profiles of the documents in this cluster are compared with the query vector and ranked in order of decreasing dot product”) from a set of documents, wherein the set of documents as the documents retrieved (Roitblat, Column , lines 63 “Documents are retrieved (FIG. 1, 104)”) have tokenized character strings (Roitblat, Column 5, lines 67 “Words in the documents”), comprising:
at least one processor as the CPU required within the computer (Roitblat, Column 6, lines 39-41 “This collection can be stored on a server or on any other accessible computer, including the user's computer.”); and
at least one non-transitory computer readable medium comprising one or more instructions that, when executed by the at least one processor as the memory required within the computer (Id), cause the at least one processor to:
[a] perform an indexing method (Roitblat, Column 12, line 24 “the word-to-index dictionary, and the word-by-document matrix”; Column 4, Table 1) to calculate an [[ as by document (Roitblat, Column 12, line 24 “the word-to-index dictionary, and the word-by-document matrix”; Column 4, Table 1) of the set of documents (Roitblat, Column 12, line 24 “the word-to-index dictionary, and the word-by-document matrix”; Column 4, Table 1) that have the tokenized (Roiblat, Column 7 lines 30-36 “The first embodiment to be described is based on a neural network that extracts the principal components from the co-occurrence matrix. As shown in FIG. 1, the base text (the text corpus) is pre-processed to remove all formatting and all hard return characters except those between paragraphs (step 101). Very short paragraphs, such as titles are combined with the subsequent paragraphs.”; Figure 1, 101 “Process Base Text”; Please note that this claim limitation has been read in light of the definition given for “Tokenizing” provided in Page 15, lines 14-16: “coding. Tokenizing means breaking down a text into individually processable components (words, terms and punctuation marks).”) character strings (Roitblat, Column 5, lines 67 “Words in the documents”; Figure 1, 102 “Built Text Term Vectors”), wherein each word embedding comprises a respective vector as the vector (Roitblat, Column 5, lines 49-54 “an input vector are taken to be a semantic profile of the text unit that produced the input vector. We can think of the elements of the hidden layer as representing a set of unnamed semantic primitives representing the words (in context) on which it was trained. ”) corresponding to a respective tokenized character string as the text unit that is used to produce the input vector (Id), wherein the tokenized character strings are formed by tokenizing as extracting the principal components (Roiblat, Column 7 lines 30-36) each document as of each document (Roiblat, Column 7 lines 30-36) in the set of documents into respective character strings (Roiblat, Column 7, lines 37-39 “A dictionary (vocabulary) is created that maps word forms and their uninflected stems to elements in a text vector (step 102).”) by breaking down text as extracting (Roiblat, Column 7 lines 30-36) in each document into one or more of (Please note that only one of the following is required for the prior art to read on the claimed device) words as word forms (Roiblat, Column 7, lines 37 “”word forms and their uninflected stems”; Please note the claim limitation ‘word’ has been construed in light of Page 6, lines 8-9 of the original specification: “Coherent character strings (alphanumeric characters, hyphen) can be understood as words of a language”), terms as their uninflected stems (Roiblat, Column 7, lines 37-39; Please note the claim limitation ‘term’ has been construed in light of Page 6, lines 9-12 of the original specification: “A term can be regarded as a superset of words that can comprise still further punctuation marks or printable special characters or can consist of multiple related words and terms.”), or [[
[b] calculate word embeddings as sparce vectors when PCA is used (Roitblat, Column 7, lines 45-52 “the network implements a principal components analysis of the collection of text vectors. This analysis reduces the data representation from a set of sparse vectors with length K to a collection of reduced vectors with length N. It projects the original data vectors onto another set of vectors eliminating the redundancy, i.e., the correlation, among the elements of the original vectors.”) or low dimensional representations when decomposition techniques are used (Roitblat, Column 8, lines 48-52 “There are other techniques that can be used in place of principal components to project the high dimensional text vectors onto lower dimensional representations. These techniques are known in the statistical literature as matrix decomposition techniques.”) that represent the tokenized character strings as the text vectors when using PCA (Roitblat, Column 7, lines 46-50) or the high dimensional text vectors when using Decomposition (Roiblat, Column 8, lines 48-52) of the at least one subset of the set of documents as the documents being analyzed (Roitblat, Column 7, lines 30-44) in a [[ as the covariance matrix when PCA is used (Roiblat, Column 7 lines 67- Column 8, line 1 “”The matrix C is the KxK covariance matrix defined by C=E{xxT}”) or the lower dimensional subspace when Decomposition techniques are used (Roiblat, Column 8, lines 48-52), wherein contextual information (Roitblat, Column 3, lines 5-7 “The present invention uses the context in which words appear as part of the representation of what those words mean.”) of the surrounding (Riotblat, Column 3, lines 20-24 “The context for the system comes from the analysis of a body of text, called a base text or a text corpus, that is representative of the interests of a particular community, as a guide to interpreting the words in their vocabulary, i.e., the aggregate of words in the text corpus”) tokenized character strings as the text (Id) is encoded in each component of their corresponding vector as projecting the text vector into the matrix (Roiblat, Column 7, lines 45-50; Column 8, lines 48-52”), and wherein the resulting vectors for similar tokenized character strings have (Please note this claim limitation reads as a description of a property of the inherent vectors themselves, and does not recite the step of calculating similarity. One of ordinary skill in the art would recognize that the covariance matrix/dimensional subspace [Roiblat, Column 7, lines 45-50; Column 8, lines 48-52] would inherently have the property.) less spatial distance to each other than resulting vectors for non-similar tokenized character strings (Roitblat, Column 7, lines 1-3 “The proximity of one vector to another corresponds to the similarity between the two vectors, obtained using the dot product of the two vectors.”; One of ordinary skill in the art would recognize that while Roitblat discusses the use of this property within the high-dimensional space generated in step 109, this is built upon the covariance matrix/dimensional subspace generated during step 102. One of ordinary skill in the art would recognize that it is the use of the covariance matrix/dimensional subspace when provides the ability to use proximity to calculate this similarity. Roitblat specifically uses this property to identify duplicate documents: Column 6, lines 10-14 “If the cosine (the normalized dot product) between the document's original text vector”);
[c] calculate document embeddings as Document Term vectors (Figure 1, 105), wherein a respective document embedding for each document in the at least one subset of the set of documents as one vector for each document (Roitblat, Column 5, lines 63-65 “Documents are retrieved (FIG. 1, 104) and converted to text vectors - one vector for each document (FIG. 1, 105).”) by [[ as combining each paragraphs text vector with the subsequent text paragraph, e.g. counting the frequency of occurrence of words in the document (Roitblat, Column 5, lines 15-19 “One text vector is produced for each paragraph. The paragraphs that are used are those that occur naturally in the text, except that very short paragraphs-titles, for example-are combined with the subsequent text paragraphs.”; Figure 2, 203; Column 12, lines 3-4 “Entries in the matrix are the frequencies with which each word occurred in each document”), for each document of the at least one subset of the set of documents as one vector for each document (Roitblat, Column 5, lines 63-65 “Documents are retrieved (FIG. 1, 104) and converted to text vectors -- one vector for each document (FIG. 1, 105).”) the word embeddings of all of the tokenized character strings (Roitblat, Column 5, lines 15-19 “One text vector is produced for each paragraph. The paragraphs that are used are those that occur naturally in the text, except that very short paragraphs-titles, for example-are combined with the subsequent text paragraphs.”; Figure 2, 203; Column 12, lines 3-4 “Entries in the matrix are the frequencies with which each word occurred in each document”), and normalizing as normalizing (Roitblat, Column 12, lines 8-13) the sum as matrix as a whole (Roitblat, Column 11, line 66 – Column 12, line 4 “The word-by-document matrix has as many rows as there are elements in the text vectors, i.e., K. Each row corresponds to one word or word stem. The columns of the matrix, stored in sparse form, are the documents remaining in the database. Entries in the matrix are the frequencies with which each word occurred in each document”) of said word embeddings as each row of the matrix, e.g. representing the text vectors (Id) with the number of the tokenized character strings in that document as the number of words in the document(Roitblat, Column 12, lines 8-13 “These entries are then further transformed by dividing them by the log of the number of words in the document. Normalizing by document length in this way reduces the inherent bias for longer documents, which typically contain more examples of any particular word in the dictionary.”);
[d] calculate SimSet groups (Roitblat, Column 6, lines 52-54 “the profiles for each of the cached documents can be organized into clusters”) using the previously calculated word embeddings as the semantic profile is the vector representing the document (Roitblat, Column 5, lines 49-54 “an input vector are taken to be a semantic profile of the text unit that produced the input vector. We can think of the elements of the hidden layer as representing a set of unnamed semantic primitives representing the words (in context) on which it was trained. ”) of the tokenized character strings as the text unit (Id), in each document of the at least one subset of the set of documents (Column 6, lines 31-34 “The text vectors of those documents that pass the relevance tests are submitted to the neural network as input and the hidden unit activation pattern is used as a semantic profile of the page (FIG. 1, 109).”), which are sets of tokenized character strings computed by clustering (Roitblat, Column 6, lines 52-54 “the profiles for each of the cached documents can be organized into clusters”; Column 9, lines 63-64 “the documents are organized into clusters”) the previously calculated word embeddings as the semantic profile is the vector representing the document (Roitblat, Column 5, lines 49-54 “an input vector are taken to be a semantic profile of the text unit that produced the input vector. We can think of the elements of the hidden layer as representing a set of unnamed semantic primitives representing the words (in context) on which it was trained.”) that represent the respective tokenized character string as the text unit (Id) based on vector similarity (Riotblat, Column 7, lines 1-3 “The proximity of one vector to another corresponds to the similarity between the two vectors”; Roitblat, Column 9, lines 54-64) of those word embeddings as the two vectors (Id), via an unsupervised, non-parameterized clustering algorithm as k-means clustering is inherently an unsupervised, non-parameterized clustering algorithm (Roitblat, Column 6, lines 52-56 “the profiles for each of the cached documents can be organized into clusters using either self-organizing feature map neural networks or the equivalent K-means clustering statistical procedure.”; Claim 12, “using k-means clustering”), wherein for the calculation of the SimSet groups (Roitblat, Column 9, lines 63-64 “the documents are organized into clusters”) only the terms of the document set which are above an importance threshold as only preserving the most important relations among terms and neglecting those relationships that are more idiosyncratic in the text, via the use of the threshold (Riotblat, Column 11, lines 23-24 “A threshold is then calculated corresponding to nonlinear lateral inhibition in the neural network”; Column 11, lines 27-30 “Only those items substantially above the mean are maintained, the others are set to 0. In doing this, we preserve the most important relations among the terms and neglect those relationships that are more idiosyncratic in the text”) based on the largest combined as for the document matrix (Roitblat, Column 11, lines 66) TF (Roitblat, Column 12, lines 4-5 “the frequencies with which each word occurred in each document”; line 7 “the word frequency”) IDF (Riotblat, Column 12, lines 8-9 “These entries are then further transformed by dividing them by the log of the number of words in the document”; line 12 “use this number to compute a log inverse document frequency, idf”) are used as only preserving the most important relations among terms and neglecting those relationships that are more idiosyncratic in the text, via the use of the threshold (Riotblat, Column 11, lines 23-24 “A threshold is then calculated corresponding to nonlinear lateral inhibition in the neural network”; Column 11, lines 27-30 “Only those items substantially above the mean are maintained, the others are set to 0. In doing this, we preserve the most important relations among the terms and neglect those relationships that are more idiosyncratic in the text”), and for the calculation of the SimSet groups a similarity threshold (Riotblat, Column 9, lines 53-60 “As a potential document is processed, it is transformed into a text vector x and passed through the neural network. The activation patterns of the units are saved as the semantic profile for the document and the result vector x is obtained. The cosine of the angle between x and x is computed. Those documents with cosines greater than a predetermined criterion ( e.g., 0.39) are admitted to the database, the rest are discarded”; Column 11, lines 23-24 “A threshold is then calculated corresponding to nonlinear lateral inhibition in the neural network”; Column 11, lines 27-30 “Only those items substantially above the mean are maintained, the others are set to 0. In doing this, we preserve the most important relations among the terms and neglect those relationships that are more idiosyncratic in the text”; Column 11, lines 42-44 “Relatedness is computed using the cosine of the angle between the two vectors. Documents with cosines above 0.39 are maintained in the database, cosines below 0.39 are not processed further”) is further used to exclude as preserving the most important relations and neglecting those that are more idiosyncratic in text (Id) dissimilar terms as not judged similar (Id) when word embeddings as the vectors (Id) have a cosine similarity as cosine (Id) of less than 0 as only those items substantially above the mean are maintained, the others are set to 0 (Id) , resulting in SimSet groups (Roitblat, Column 9, lines 63-64 “the documents are organized into clusters”) only comprising strings or words, that:
d1) frequently occur in the text based on a term frequency TF (Roitblat, Column 12, lines 4-5 “the frequencies with which each word occurred in each document”; line 7 “the word frequency”),
d2) have a high information content based on the inverse document frequency IDF(Riotblat, Column 12, lines 8-9 “These entries are then further transformed by dividing them by the log of the number of words in the document”; line 12 “use this number to compute a log inverse document frequency, idf”), and
d3) are similar to each other(Riotblat, Column 7, lines 1-3 “The proximity of one vector to another corresponds to the similarity between the two vectors”);
[e] perform a query expansion as generating a result query vector (Roitblat, Column 9, lines 32-37 “may be computationally more efficient to compute the result vector and compare a sparse representation of the terms in the document vector with a sparse representation of the result vector. The only documents that need to be evaluated are those containing the terms described by the result vector,”), [[
i) first query terms that occur in the SimSet groups (Roitblat, Column 6, lines 46-47 “represents a semantic profile of each search term”; Column 9, lines 10-20) that include at least one query term of the received search query as identifying the centroid that matches the query vector (Roitblat, Column 10, lines 24-27 “One of these centroids will provide the best match to the query vector, so the patterns represented by that neuron are likely to be the best match for the query.”); or
ii) [[
iii) [[
[f] calculate a query embedding as the semantic profile for the query (Roitblat, Column 6, lines 42-44 “A user query (FIG. 4, 401) is processed through the same neural network to produce its semantic profile (FIG. 4, 402, 403).”) from said expanded query terms as the result query vector (Roitblat, Column 9, lines 32-37);
[g] retrieving with the query expansion as result query vector (Roitblat, Column 9, lines 32-37) preselected documents as the best matching cluster (Roitblat, Column 7, lines 7-10 “As a result, we can limit the number of comparisons that have to be made between the query profile and the cached document profiles to just those that are in the cluster with the best match to the query profile (FIG. 4,404).”) from the at least one subset of the set of documents using the [[(Roitblat, Column 12, line 24 “the word-to-index dictionary, and the word-by-document matrix”; Column 4, Table 1), wherein the preselected documents are selected as the best matching cluster (Roitblat, Column 7, lines 7-10 “As a result, we can limit the number of comparisons that have to be made between the query profile and the cached document profiles to just those that are in the cluster with the best match to the query profile (FIG. 4,404).”) to quantitatively limit the number (Roitblat, Column 10, lines 30-33 “Typically only the documents from one cluster will have to be compared one by one to the query vector and the remaining documents in the database will not need to be examined”) of document embeddings as semantic profile of the document (Roitblat, Column 6, lines 46-49 “It is then a simple matter to compare the semantic profile of the search terms against the semantic profiles of each stored page (FIG. 4, 405).”) to be compared as compared (Id; Column lines 7, lines 7-10 “limit the number of comparisons”); and
[h] compare the query embedding with the document embeddings (Roitblat, Column 6, lines 46-49 “It is then a simple matter to compare the semantic profile of the search terms against the semantic profiles of each stored page (FIG. 4, 405).”) of the preselected documents (Roitblat, Column 7, lines 7-10 “As a result, we can limit the number of comparisons that have to be made between the query profile and the cached document profiles to just those that are in the cluster with the best match to the query profile (FIG. 4,404).”) to automatically determine a similarity score as measuring relevance (Roitblat, Column 9, lines 21-24 “Alternatively, one can compare the estimated text vector (also called a result vector) x̂ with each document's text vector and again measure relevance using the dot product of these vectors.”)
for each document and rank the documents accordingly (Roitblat, Column 10, lines 27-30 “The semantic profiles of the documents in this cluster are compared with the query vector and ranked in order of decreasing dot products, corresponding to decreasing relevance to the query”) and to display or store said preselected documents (Roitblat, Column 6, lines 49-51 “The pages that match most closely are the most relevant to the search and should be displayed first for the user.”).
Roitblat does not explicitly teach that the index is inverted.
Sommer teaches an inverted index… the inverted index (Sommer, Column 13, lines 50 “for each filed containing textual data, an inverted index is maintained, which maps all terms appearing in every document to the documents in the document database”).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the index taught by Roitblat as an inverted index as taught by Sommer as it is a known indexing method within the field of art that would yield the expected results of enabling the system to locate documents using the terms. Within the proposed combination, the index is still performing the same functionality as disclosed by Roitblat (to enable lookup and retrieval of the document). The proposed combination is merely replacing the structure of the index to be an inverted structure as depicted by Sommer.
While heavily implied Roitblat does not explicitly teach calculating a respective document embedding by adding.
Schuetze teaches calculating a respective document embedding (Schuetze, Page 5, line 39 “Ψ - document encoding”) for each document as document dj (Schuetze, Page 5, lines 44-45 “The representationon Ψ(dj) is computed for document dj by summing up the vectors of all tokens occurring in it”) in the at least one set of the documents (Schuetze, Page 5, line 35 “ D - a set of documents”) by adding (Schuetze, Page 5, lines 44-45 “The representationon Ψ(dj) is computed for document dj by summing up the vectors of all tokens occurring in it”) … the word embedding of all the tokenized character strings as the word imbedding for al tokens (Id) with that document (Schuetze, Page 5, lines 44-45 “The representationon Ψ(dj) is computed for document dj by summing up the vectors of all tokens occurring in it”) and normalizing (Figure 10, 210 “Normalize context vector”) the sum (Schuetze, Page 5, lines 44-45) of said word embeddings (Schuetze, Page 5, line 35 “ ϕ - word encoding”) with a number of the tokenized character strings in that document (Schuetze, Page 13 line 57 to Page 14 line 1 “By normalizing the context vectors, all the context vectors will have the same length regardless of the size of the document”).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented calculated the document vectors depicted by the proposed combination using the techniques taught by Schuetze as it would yield the predictable results of generating the vectors without the need to train the ML algorithm.
Roitblat does not explicitly teach (e) performing query expansion, resulting in a set of expanded query terms comprising: i) first query terms that occur in the SimSet groups that include at least one query term; or ii) second query terms that do not occur in the SimSet groups but do occur in at least one document of the at least one subset of the documents; or iii) third query terms that do not occur in any document of the at least one set of the documents.
Vailaya teaches (e) performing query expansion as the recommended query is an expanded query (Vailaya, ¶171 “query recommender”; ¶173 “Query Recommender shall find alternatives to that term in product model numbers”) for example, generating alterative terms or corrected terms (Vailaya, ¶172 “find good alternatives”; ¶173 “the query recommender may correct a single unmatched term… shall find alternatives to that term”), resulting in a set of expanded query terms as the alternative or corrected terms (Id) comprising:
i) first query terms that occur in the SimSet groups that match at least one query term of the received query as the suggested ‘canon A40’ for the query term ‘cannon A45” (Vailaya, ¶173 “For example, suggest "canon A40" for "canon A45".”; ¶174 “Based on rest of the matched terms in the query, a dictionary is constructed. Additionally, words from the dictionary are sought that are closest to the replacement terms. If the proximity of the candidates passes certain thresholds, the model numbers corresponding to these candidates are returned”); or
ii) second query terms that do not occur in the SimSet groups as when a query term is not in the list of known dictionary terms (Vailaya, ¶192 “The algorithm assumes as input a list of dictionary terms (known model names that may consist of full model name, alphanumeric or alpha only model parts, etc.),”) but do occur in at least one document of the at least one subset of the set of documents as when the query terms match some documents, the recommender finds alternative product models from the model dictionaries (Vailaya, ¶173 “When all terms match some document, the query recommender shall take a content term and find product model alternatives. For example, suggest "sorry DCRDVD200" for "sorry dv 200."”); or
iii) third query terms that do not occur in any document of the at least one subset of the set of documents (Vailaya, ¶172 “When user makes mistakes in entering the query, they may not get the expected results. The mistake may be a result of misspelled words or imprecise model numbers. A query recommender tries to find good alternatives in these circumstances. For example, the query recommender may be used to correct product model numbers.”; ¶173 “when a single query term does not match any documents, Query recommender shall find alternatives to that term”) including misspelled or phonetically similar variants as misspelled words (Id).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the query vector generation within the proposed combination to include corrections to misspelt terms as suggested by Vailaya as it yields the predictable results of capturing the concept the user is searching for in the query vector.
Roitblat does not explicitly teach wherein the tokenized character strings are formed by tokenizing … punctuation marks … or calculating word embeddings … in a continuous vector space.
Bellegarda teaches wherein the tokenized character strings are formed by tokenizing… punctuation marks as natural language analyzation determining tokens according to punctuation marks (Bellegarda ¶209 “In some examples, based on the unstructured natural language information contained in messages 802, natural language analyzer 820 obtains unstructured natural language texts and determines token sequences according to semantics, syntax, and/or punctuation marks associated with the unstructured natural language texts.”; ¶210 “Using the above described scenario illustrated in FIG. 9A as an example, natural language analyzer 820 can obtain a token sequence corresponding to the messages 802A-C. For example, the token sequence can include the one or more words and punctuation marks included in messages 802A-C.”) or calculating word embeddings… in a continuous vector space (¶244 “latent semantic analysis techniques, in which clustering technique are applied to the sequence representations (e.g. vectors)”; Page 78 Section “II. Latent Semantic Analysis”: “The LSA paradigm defines a mapping between the discrete sets V. T and a continuous vector space S, whereby each word wi in V is represented by a vector [vector ui] in S, and each document di in T is represented by a vector [vector vj] in S.”).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the text analysis and text vector generation using the techniques taught by Bellegarda as it yields the predictable results of extracting term components. Within the proposed combination Roitblat teaches performing principal component analysis, and provides a few methods known in the art of performing this operation.
Within the proposed combination Roitblat teaches generating a vector space, but does not explicitly state that this vector space is continuous. One of ordinary skill in the art would recognize that vector space generated by Roitblat would be equivalent to the continuous vector space taught by Bellegarda, and would view the specific vector spaces as mathematical equivalent.
One of ordinary skill in the art would recognize the method taught by Bellegarda as being yet another known method of performing this operation which would result in the same vector generation.
With regard to claims 4 and 10 the proposed combination does not explicitly teach the use of a divisive clustering method or an agglomerative method. Schuetze teaches wherein the clustering method is in the form of a hierarchic method, in particular a divisive clustering method or an agglomerative method (Schuetze, Page 9, line 1-5 “group average agglomerative clustering”; Page 12, line 11-14 “The clustering algorithm… using group average agglomerative clustering”). It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the proposed combination using the clustering methods taught by Schuetze in place of the clustering method taught by Roitblat as it would yield the expected results of identifying a valid cluster of word embeddings. Please note that one of ordinary skill in the art would recognize any of the disclosed clustering methods as being expected to generate a valid cluster of word embeddings. Each method is well established and has its own advantages and disadvantages known within the field of art.
With regard to claim 7 the proposed combination further teaches wherein a cosine similarity (Roitblat, Column 6, line 12 “if the cosine (the normalized dot product)”; Column 10, lines 9-19 “the dot product … The neuron with the highest dot product is called the winner… The neurons in the sheet come to represent the centroids of profile vector categories and nearby neurons come to represent nearby (in similarity space) clusters”), a term frequency and/or an inverse document frequency (Roitblat, Column 6, lines 1-9; Column 12, lines 3-6 “Entries in the matrix are the frequencies with which each word occurred in each document”) are used as threshold value for a cluster formation (Roitblat, Column 10, lines 61-65 “The essence of this learning rule is a matrix in which the rows represent all of the unique words in the vocabulary and the columns represent the frequencies with which each other word appeared with that specific word.”; Column 12, lines 1-29).
With regard to claim 9 the proposed combination further teaches wherein the tokenized character strings are words (Roitblat, Column 5, lines 67 “Words in the documents”).
Claims 2, 5, 6, 11, and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Roitblat in view of Sommer, Schuetze, Vailaya, Bellegarda, and Tacchi [9747165].
With regard to claim 2, the proposed combination teaches all the limitations of claim 1 as discussed above. Neither Roitblat nor Schuetze teaches wherein the word embedding method used is a CBOW model or a Skip-gram model. Tacchi teaches wherein the wordas a bag of words model (Tacchi, Column 4 line 66 through Column 5 line 1 “algorithm may be used by the feature vector creator 142 for feature extraction. Or in some embodiments, a bag-of-words model may be used, with vectors corresponding to n-grams”; Please note that one of ordinary skill in the art would reconzie the claim limitation as an acronym “continuous Bag of Words”.) or a Skip-gram model as a n-gram model (Id). It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the word embedding taught by the proposed device using the known embedding models taught by Tacchi as it yields the predictable results of generating feature vectors for words.
With regard to claim 5 the proposed combination does not explicitly teach where the clustering method is a DBSCAN metho or OPTICS method. Tacchi teaches wherein the clustering method comprises a DBSCAN method (Tacchi, Column 12, lines 21-24 “Some embodiments may execute a density-based clustering algorithm, like DBSCAN, to establish groups corresponding to the resulting clusters and exclude outliers”) or an OPTICS method.
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the proposed combination using the any one of the clustering methods taught by Tacchi in place of the clustering method taught by Roitblat as it would yield the expected results of identifying a valid cluster of word embeddings. Please note that one of ordinary skill in the art would recognize any of the disclosed clustering methods as being expected to generate a valid cluster of word embeddings. Each method is well established and has its own advantages and disadvantages known within the field of art.
With regard to claims 6 and 12 the proposed combination does not explicitly teach where the clustering method is a spectral clustering method or Louvain method. Tacchi teaches wherein the clustering method is a spectral clustering method or a Louvain method (Tacchi, Column 4, lines 18-37 “The clustering method may cluster the documents into document collections, e.g., based on semantic similarity… the semantic similarity graph may be clustered. In one example, Louvan method for community detection, or other greedy optimization clustering algorithm, may be applied by the clustering module”).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the proposed combination using the any one of the clustering methods taught by Tacchi in place of the clustering method taught by Roitblat as it would yield the expected results of identifying a valid cluster of word embeddings. Please note that one of ordinary skill in the art would recognize any of the disclosed clustering methods as being expected to generate a valid cluster of word embeddings. Each method is well established and has its own advantages and disadvantages known within the field of art.
With regard to claim 11 the proposed combination does not explicitly teach where the clustering method is a density-based method. Tacchi teaches wherein the clustering method is in the form of a density-based method (Tacchi, Column 12, lines 21-24 “Some embodiments may execute a density-based clustering algorithm, like DBSCAN, to establish groups corresponding to the resulting clusters and exclude outliers”; The term “density-based method” has been read in light of Page 13, lines 34-35 of the original specification which recites “a density-based method, in particular DBSAN or OPTICS”).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the proposed combination using the any one of the clustering methods taught by Tacchi in place of the clustering method taught by Roitblat as it would yield the expected results of identifying a valid cluster of word embeddings. Please note that one of ordinary skill in the art would recognize any of the disclosed clustering methods as being expected to generate a valid cluster of word embeddings. Each method is well established and has its own advantages and disadvantages known within the field of art.
Response to Arguments
Applicant's arguments filed December 24, 2025 have been fully considered but they are not persuasive.
With regard to the prior art applicant argues that the claims require that each tokenized character string is mapped to its own respective vector.
In response it is noted that the claim recites “wherein each word embedding comprises a respective vector corresponding to a respective tokenized character string”. This does not require that all character strings in the document have a word embedding. This requires that each word embedding comprises a vector that corresponds to a token character string. Furthermore, this does not specify what qualifies as “a tokenized character string”. One of ordinary skill in the art may reasonably read that the tokenized character string as any set of character strings that is tokenized from the document, including a set that includes more than one term or phrase. The prior art generates a vector for each text input (Column 5, lines 49-54”). This ensures that each vector corresponds to the specific tokenized character string that is the text unit that was used to produce said vector.
The claim language does not preclude the vectors from being reduced through dimensional reduction. After reduction, each vector still corresponds to the token character string that was used to produce said vector originally.
Applicant further argues that the vectors disclosed in Roitblat are produced through statistical dimensionality reduction of co-occurrence matrices and are not generated as stand-alone embedded vectors corresponding to individualized token character strings for reuse across the claimed processing pipeline.
In response Roitblat explicitly teaches that one text vector is produced for each paragraph (Column 5, lines 15). One of ordinary skill in the art may reasonably read that paragraph as an “individualized token character string”. The claim does not recite “stand-alone embedded vectors” or that said embedded vectors are “reused” across the claimed processing pipeline. Furthermore, these terms are undefined and do not have a clear meaning within the context of the claimed device. To be clear, it is unclear what applicant is arguing is the structural or functional distinction between the intended vectors and the vectors generated by the prior art. The claim language does not preclude vectors that are dimensionally reduced. The prior art addresses the structure of the claimed vectors.
The cited claim mapping addresses the broadest reasonable interpretation of the claim language.
Applicant argues that the clustering performed by Roitblat is performed on document-level semantic profiles, and that the resulting clusters organize documents rather than token-level representations.
In response to the preceding argument it is noted that the claim recites a ‘tokenized character string’ as being “formed by tokenizing each document into the set of respective character strings by breaking down text in each document into one or more of words, terms, or punctuation marks”. This does not restrict what a ‘tokenized character string’ is, but instead recites how a tokenized character string is formed. Roitblat processes the base text, the base text including words, terms, and punction marks. The base text is processed by extracting ‘word forms’ and their ‘uninflected stems’ to elements in the text vector. One of ordinary skill in the art would recognize this as meaning that the text is broken down into its word forms, e.g. words, and their uninflected stems, e.g. terms; and that the system then generates the vector based on this information. As such, one of ordinary skill in the art would read Roitblat as addressing the claim language.
Furthermore, even if one were to take the claim language to indicate that the tokenized character string is meant to be “one or more words, terms, or punctuation marks”, one of ordinary skill in the art would recognize that a paragraph (Roitblat, Column 5, line 15) is comprised of one or more words, terms, or punctuation marks, and therefore reads on the claim language. One of ordinary skill in the art would recognize a document as being comprised of one or more words, terms, or punctuation marks. As, it is reasonable for one of ordinary skill in the art to read the ‘tokenized character string’ as including the document itself. Applicants arguments suggest that applicant intends each word in a document to have its own vector. The claim language simply does not require this limited scope.
Based on the above reasoning the applied art reads on the claim language.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMANDA WILLIS whose telephone number is (571)270-7691. The examiner can normally be reached Monday-Friday 8am-2pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ajay Bhatia can be reached at 571-272-3906. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/AMANDA L WILLIS/ Primary Examiner, Art Unit 2156