Last updated: May 29, 2026
Application No. 19/207,257
TOKENIZED TEXT FOR EFFICIENT SEARCHING BY MACHINE LEARNING (ML) APPLICATIONS

Non-Final OA §103§112
Filed
May 13, 2025
Priority
Apr 26, 2024 — provisional 63/639,536 +2 more
Examiner
SOMERS, MARC S
Art Unit
2159
Tech Center
2100 — Computer Architecture & Software
Assignee
Anacode Labs Inc.
OA Round
1 (Non-Final)
This examiner grants 65% of cases after interview

— +34.5% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 567 resolved cases, 2023–2026
Examiner Intelligence

SOMERS, MARC S View full profile →
Grants 65% of resolved cases
Career Allowance Rate
367 granted / 567 resolved
+9.7% vs TC avg
Strong +34% interview lift
Without
With
+34.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
22 currently pending
Career history
600
Total Applications
across all art units
Statute-Specific Performance

§101
6.0%
-34.0% vs TC avg
§103
73.0%
+33.0% vs TC avg
§102
4.1%
-35.9% vs TC avg
§112
5.4%
-34.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 567 resolved cases
Office Action

§103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification
The use of the term Kaggle and Wikipedia, which is a trade name or a mark used in commerce, has been noted in this application. The term should be accompanied by the generic terminology; furthermore the term should be capitalized wherever it appears or, where appropriate, include a proper symbol indicating use in commerce such as ™, SM , or ® following the term.
Although the use of trade names and marks used in commerce (i.e., trademarks, service marks, certification marks, and collective marks) are permissible in patent applications, the proprietary nature of the marks should be respected and every effort made to prevent their use in any manner which might adversely affect their validity as commercial marks.

Claim Objections
Claims 1-4 and 7-10 are objected to because of the following informalities: 
Claim 7 recites the acronym “AI” in the body of the claim without first defining the term.
Claim 10 recites the acronym “API” in the body of the claim without first defining the term.
Claim 8 recites the phrase “clients distributed over s data communication network” where the letter ‘s’ is being construed as a typographical error and meant to be the letter ‘a’.  
Claims 1-4, 9, and 10 recite tokenID with no definition of the presumed acronym at the end of the term, the Examiner recommends defining the term tokenID (e.g. token identifiers (tokenIDs).
Claims 1, 9, and 10 recite similar issues as the tokenID above but with other similar phrases such as chunkID and blockIDs.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-10 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites the limitation "the raking" in the last limitation of the body of the claim.  There is insufficient antecedent basis for this limitation in the claim.  It is unclear what process or function “the raking” is meant to represent since there does not appear to be any raking steps recited earlier.  Is it a parsing or filtering step?  For purposes of compact prosecution, the Examiner is construing the phrase “the raking” to refer to “the ranking” to coincide with comparing step which has the intended result of “rank sentences”.
Claim 1 recites the limitation "the blockIDs" in the identifying tokenIDs limitation.  There is insufficient antecedent basis for this limitation in the claim.  There is no mention of nay blocks of blockIDs before the above noted recitation with only mention of IDs being chunkID and tokenID; for purpose of compact prosecution, the Examiner is construing blockIDs to refer to chunkIDs.
Claim 1 recites the limitation "the list of tokenized sentences" in the comparing limitation.  There is insufficient antecedent basis for this limitation in the claim.  The claim makes mention of filtering the tokenized database “for the one or more tokenized words” but makes no mention a list of tokenized sentences.  It is unclear if the limitation refers to (1) a sentence being represented as a single token; (2) any sentence that was parsed/tokenized to identify words; or (3) a sentence made up of tokens (or tokenIDs).  
Claim 1 recites the limitation "the search query" in the body of the claim.  There is insufficient antecedent basis for this limitation in the claim.  The claim recites determining words from the one or more sentences to use for querying the tokenized database.  Is the search query the respective words from the received one or more sentences or does it refer to a different step.
Claims 2-8 depend upon claim 1 and inherit the same deficiencies as noted above and are rejected for similar reasons as discussed above.
Claims 9 and 10 are substantially similar to claim 1 and have the same issues as claim 1 as discussed above and are rejected for the same reasons as claim 1.
Claim 3 recites the limitation "the query response" in the body of the claim.  There is insufficient antecedent basis for this limitation in the claim.  The claim does not recite any prior “query response” in the parent claim and it is unclear what “the query response” is meant to refer to.  The independent claim indicates a reply to the ML source with the sentences and determining words from the sentences for querying the tokenized database.  
Claim 4 recites the limitation "the tokenizing one or more facts" in the body of the claim.  There is insufficient antecedent basis for this limitation in the claim.  The claim does not recite any prior fact or facts in the parent claim and it is unclear what “the one or more facts” is meant to refer to.  For purposes of compact prosecution, the Examiner is construing the term to relate to the tokenized sentences in claim 1.

Claims 2-4 contains the trademark/trade name Kaggle.  Where a trademark or trade name is used in a claim as a limitation to identify or describe a particular material or product, the claim does not comply with the requirements of 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph.  See Ex parte Simpson, 218 USPQ 1020 (Bd. App. 1982).  The claim scope is uncertain since the trademark or trade name cannot be used properly to identify any particular material or product.  A trademark or trade name is used to identify a source of goods, and not the goods themselves.  Thus, a trademark or trade name does not identify or describe the goods associated with the trademark or trade name.  In the present case, the trademark/trade name is used to identify/describe a marketplace or source to provision/retrieve data or datasets for data modeling and, accordingly, the identification/description is indefinite.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1 and 5-10 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et al [US 2020/0193153 A1] in view of Dean et al [US 2006/0036593 A1] and Khanwalkar et al [US 12,393,620] (provisional date on Feb 2, 2024 from provisional US 63/539,107).
With regard to claim 1, Lee teaches a method, in a computer device, for tokenizing text for efficient searching by machine learning (ML) applications, the method comprising: providing a database of tokenized data, wherein the tokenized database has been trained with a chunk of original text with words that have been compressed with tokens corresponding to the words (see paragraphs [0131], [0136], and [0137]; the system can have a dictionary or vocabulary/database or tokens/words where the system can be trained to develop or recognize particular tokens/words/text fragments most suited for the particular dataset(s) intended to be used by the system;
“Tokenizer 542 takes as input a sequence of characters representing an unknown number of words and unlimited vocabulary and transforms the sequence of characters into a sequence of tokens with a fixed size vocabulary. The sequence of characters may comprise a text segment such as input text segment or stored text segment. Each token may comprise a subset of the character sequence representing a word or part of a word. The vocabulary may comprise a set of unique tokens. The number of unique tokens, also called vocabulary size, may be finite. The vocabulary size may be specified as a parameter. The resulting list of tokens is more easily processed by other parts of the model than raw text. For example, input text tokens 552 and stored text tokens 553 may be generated by tokenizer 542 from input text segment 550 and stored text segment 551 respectively.”, paragraph 136;
“Tokenizer 542 may be trained to perform the transformation using large amounts of training text which may be representative of inputs to text similarity model 505. In some embodiments, tokenizer 542 may be trained by splitting the training text using a heuristic or regular expression such as splitting on white space. The resulting text fragments may be scanned. The most common fragments up to the vocabulary size may be recorded as tokens, while less common fragments may be discarded or replaced with a placeholder token.”, para 137), 
wherein the text chunk is assigned a chunkID (see paragraph [0058]; the text chunks or text segments can have an ID associated with it; 
“The stored text segment table 210 may store individual text segments from a plurality of reference documents and collectively comprise the entire text of the reference documents. Each text segment element 216 of the table may include data in a plurality of fields. An ID field 211 may provide a unique identifier for the text segment element to enable quick indexing and retrieval. The ID of the text segment element may be used as an identifier of the text segment element by other parts of the system.”);
receiving, 
“Pre-processing may comprise removing punctuation, addressing abbreviations by removing them or replacing them with the corresponding full word based on lookup in a dictionary of abbreviations, removing unknown words or rare words, removing stop words, which may comprise words that are so common as to be unhelpful in text processing, lemmatization, stemming, or other processing.”, paragraph 116;
“In one approach, the encoding for a text segment may be generated by pre-processing the text segment and iterating over the words of the pre-processed text segment and looking up the corresponding word embedding for each word of the pre-processed text segment from a dictionary of word embeddings.”, paragraph 131;
“Each token may comprise a subset of the character sequence representing a word or part of a word. The vocabulary may comprise a set of unique tokens. The number of unique tokens, also called vocabulary size, may be finite. The vocabulary size may be specified as a parameter. The resulting list of tokens is more easily processed by other parts of the model than raw text. For example, input text tokens 552 and stored text tokens 553 may be generated by tokenizer 542 from input text segment 550 and stored text segment 551 respectively.”, paragraph 136);
identifying token
“Each text segment is tokenized using tokenizer 542 and encoded using language model 544.”, para 135;
“Tokenizer 542 takes as input a sequence of characters representing an unknown number of words and unlimited vocabulary and transforms the sequence of characters into a sequence of tokens with a fixed size vocabulary.”, para 136), 
wherein token
“Tokenizer 542 takes as input a sequence of characters representing an unknown number of words and unlimited vocabulary and transforms the sequence of characters into a sequence of tokens with a fixed size vocabulary. The sequence of characters may comprise a text segment such as input text segment or stored text segment. Each token may comprise a subset of the character sequence representing a word or part of a word. The vocabulary may comprise a set of unique tokens. The number of unique tokens, also called vocabulary size, may be finite. The vocabulary size may be specified as a parameter. The resulting list of tokens is more easily processed by other parts of the model than raw text. For example, input text tokens 552 and stored text tokens 553 may be generated by tokenizer 542 from input text segment 550 and stored text segment 551 respectively.”, para 136
“Tokenizer 542 may be trained to perform the transformation using large amounts of training text which may be representative of inputs to text similarity model 505. In some embodiments, tokenizer 542 may be trained by splitting the training text using a heuristic or regular expression such as splitting on white space. The resulting text fragments may be scanned. The most common fragments up to the vocabulary size may be recorded as tokens, while less common fragments may be discarded or replaced with a placeholder token. In other embodiments, algorithms such as Sub Word tokenization or WordPiece tokenization may be trained on the training text in order to produce tokenizer 542. In some embodiments, complex, compound, or uncommon words may be split into one or more tokens representing parts of the word in order to satisfy the vocabulary size limit. For example, compound words may be split into tokens representing each component word, or prefixes and suffixes may be split as separate tokens from the root word.”, para 137).
Lee does not appear to explicitly teach:
at least some of the words are assigned a tokenID;
receiving, from a ML source, one or more sentences, and determining one or more words from the one or more sentences to use for querying the tokenized database;
identifying tokenIDs from a database of tokenIDs corresponding to the blockIDs for the one or more words from the one or more sentences, wherein the tokenID database associates a list of blockIDs to tokenIDs of words using a fixed number of bytes, wherein tokenIDs are limited in number based on the fixed number of bytes and assigned based at least in part on frequency of use; 
filtering the tokenized database based on the tokenlDs for the one or more tokenized words from the search query, wherein each tokenID exposes a list of blocksIDs; 
decompressing a chunk of original text corresponding to each of the chunkIDs; 
comparing the one or more sentences to each sentence of the list of tokenized sentences to rank sentences; 
and replying, back to the ML source, one or more sentences based on the raking.
Dean teaches at least some of the words are assigned a tokenID (see paragraphs [0020], [0023], [0031]; the system has means to assign identifiers to tokens;
“Each sorted unique token is then assigned a unique global token identifier (hereinafter also referred to as "GTokenID"). GTokenIDs can include any suitable data type and width depending upon the platform used to implement the document processing system 102 (e.g., 32-bit unsigned integers). In some embodiments, GTokenIDs are assigned to the sorted unique tokens in increasing order, so that high-frequency tokens are assigned small valued GTokenIDs and low-frequency tokens are assigned large valued GTokenIDs.”, para 31;
“A "token" can be any object typically found in a document, including but not limited to terms, phrases, punctuation, HTML tags and the like”, para 20);
wherein tokenIDs are limited in number based on the fixed number of bytes and assigned based at least in part on frequency of use (see paragraph [0031]; the system has a fixed width for the tokens and can assign the token IDs based on frequency; 
“Each sorted unique token is then assigned a unique global token identifier (hereinafter also referred to as "GTokenID"). GTokenIDs can include any suitable data type and width depending upon the platform used to implement the document processing system 102 (e.g., 32-bit unsigned integers). In some embodiments, GTokenIDs are assigned to the sorted unique tokens in increasing order, so that high-frequency tokens are assigned small valued GTokenIDs and low-frequency tokens are assigned large valued GTokenIDs.”, para 31);
filtering the tokenized database based on the tokenlDs for the one or more tokenized words from the search query (see paragraphs [0063] and [0067]; the system tokenizes the words of the query and can utilize those tokens to filter/search to find matching documents/results;
“A query string 502 is tokenized and parsed by a query parser 504 into query terms (i.e., each distinct term in the query is treated as a token). The tokenized query terms are translated by the global-lexicon 508 to corresponding GTokenIDs using a translation table or mapping, as previously described with respect to FIGS. 2 and 4.”, para 63;
“The first stage query processor 510 uses the query terms to search against a tokenspace inverted index 512 and to identify documents matching the query. The first stage query processor 510 accesses the inverse index 512 to produce a list of token positions (also called tokenspace repository positions) for terms in the query tree and accesses the DocID Map 516 to produce a set of DocIDs for the documents corresponding to the token positions. In addition, the first stage processor 510 performs the Boolean logic specified by the query or query tree so as to generate a set of DocIDs that are responsive to the query. In some embodiments, the first stage query processor 510 also computes a first set of relevancy scores S.sub.1 between the query and each document based on one or more scoring algorithms.”, para 67);
wherein each tokenID exposes a list of blocksIDs (see paragraph [0067]; an index can be used that exposes or maps tokens or token IDs to respective document identifiers/IDs). 
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify the text search system of Lee by assigning identifiers to the tokens and usage a respective inverse index as taught by Dean in order to allow for quick indexing and retrieval of respective tokens as well as being able to find respective documents that are associated with respective tokens to allow for the system to be able to find documents associated with particular tokens as well as where those tokens are in the document.
Lee in view of Dean do not appear to explicitly teach:
receiving, from a ML source, one or more sentences, and determining one or more words from the one or more sentences to use for querying the tokenized database;
identifying tokenIDs from a database of tokenIDs corresponding to the blockIDs for the one or more words from the one or more sentences, wherein the tokenID database associates a list of blockIDs to tokenIDs of words using a fixed number of bytes, 
decompressing a chunk of original text corresponding to each of the chunkIDs; 
comparing the one or more sentences to each sentence of the list of tokenized sentences to rank sentences; 
and replying, back to the ML source, one or more sentences based on the raking.
Khanwalkar teaches receiving, from a ML source, one or more sentences (see col 12, lines 43-60; corresponds to section 3; second to last paragraph on page 6 of the provisional; the system can utilize sentences from a ML source such as the generative response/answer as means to verify the accuracy of the responses;
“When using GPTs for question answering, adding references to GPT responses would be beneficial for high quality learning [3]. In some embodiments, a reference system can provide document evidence for users to verify the accuracy of GPT responses. We developed embodiments of a system with a large index of study passages (e.g., over 2.5 billion) across thirty eight academic subjects. In some embodiments, given a GPT answer & explanation, we first conduct a search (e.g., lexical search) using, as one example, Opensearch to retrieve top relevant passages (e.g. 20) (any other search may be utilized as appropriate). Next, in some embodiments, we rank the retrieved passages based on, for example, their similarity (e.g., semantic similarity) to the GPT response using a model such as a Sentence-BERT model. As one example, we select the top three ranked passages from study documents (any other number of passages may be selected, as appropriate) and in some embodiments add document links to the GPT response as references.”, section 3 of provisional).
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify the text search system of Lee in view of Dean by accepting input from machine learning sources as taught by Khanwalkar in order to expand the usability of the system to allow for various clients, including clients making use of machine learning models, to have access to text similarity/comparison by being able to compare generated answers/responses to respective reference documents thus allowing the system to be able to find similar stored text segments from documents including factual verification.
Lee in view of Dean and Khanwalkar teach receiving, from a ML source, one or more sentences and determining one or more words from the one or more sentences to use for querying the tokenized database (see Khanwalkar, see col 12, lines 43-60; corresponds to section 3; second to last paragraph on page 6 of the provisional; see Lee, paragraphs [0044], [0116], [0136], [0139], and [0131]; the system can receive a response comprising at least one sentence from a ML source and be able to preprocess that input that input to identify/determine words);
identifying tokenIDs from a database of tokenIDs corresponding to the blockIDs for the one or more words from the one or more sentences, wherein the tokenID database associates a list of blockIDs to tokenIDs of words using a fixed number of bytes (see Lee, paragraphs [0131], [0135]-[0137]; see Dean, paragraph [0067]; see Khanwalkar, see col 12, lines 43-60; corresponds to section 3; second to last paragraph on page 6 of the provisional; the system can identify tokens from a database of tokens corresponding to the blocks/chunks/segments of various documents where the tokenID can be used to index or associate that token to respective source material), 
decompressing a chunk of original text corresponding to each of the chunkIDs; comparing the one or more sentences to each sentence of the list of tokenized sentences to rank sentences; and replying, back to the ML source, one or more sentences based on the raking (see Dean, paragraphs [0043] and [0067]-[0068]; see Lee, paragraph [0059] and [0092]; see Khanwalkar, see col 12, lines 43-60; corresponds to section 3; second to last paragraph on page 6 of the provisional; the system can decompress/decode the respective text so that it can be retrieved and utilized in other processes including similarity comparisons with other text sentences/segments and rank them accordingly so that the system can provide a response back to the client of the system).

With regard to claims 9 and 10, these claims are substantially similar to claim 1 and are rejected for similar reasons as claim 1 as discussed above.

With regard to claim 5, Lee in view of Dean and Khanwalkar teach wherein comparing the one or more sentences to each sentence of the list of tokenized sentences comprises using a natural language processor (NLP) to determine similarity (see Khanwalkar, col 12, lines 43-60; corresponds to section 3; second to last paragraph on page 6 of the provisional; a NLP processor can be used to determine similarity of the sentences).

With regard to claim 6, Lee in view of Dean and Khanwalkar teach wherein the computer device is communicatively coupled to a data communication network (see Lee, paragraph [0053] and Figure 1; computer device is coupled via a communication network).

With regard to claim 7, Lee in view of Dean and Khanwalkar teach wherein the computer device comprises an AI appliance (see Lee, paragraphs 136-137; at least one AI appliance is used).

With regard to claim 8, Lee in view of Dean and Khanwalkar teach wherein the computer device services a plurality of clients distributed over s data communication network (see Lee, paragraph [0053] and Figure 1; computer device can service/communicate with a plurality of clients over a communication network).



Claims 2-4 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et al [US 2020/0193153 A1] in view of Dean et al [US 2006/0036593 A1] and Khanwalkar et al [US 12,393,620] (provisional date on Feb 2, 2024 from provisional US 63/539,107) in further view of Milazzo et al [US 2020/0073902 A1].
With regard to claim 2, Lee in view of Dean and Khanwalkar teach all the claim limitations of claim 1 as discussed above.
Lee in view of Dean and Khanwalkar do not appear to explicitly teach wherein tokenlDs correspond to Kaggle terms.
Milazzo teaches Kaggle terms (see paragraph [0110]; database with words for various associations can be provided by Kaggle).
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify the text search system of Lee in view of Dean and Khanwalkar by utilizing available databases that can provide datasets to the machine learning model so that it can be trained as taught by Milazzo in order to allow the system greater flexibility and versatility in being able to utilize other datasets including third-party datasets for training purposes thereby allowing for already established and widely-used and respected databases to be used to help train the various models without having to force developers to spend time and effort to have to create all their own proprietary datasets.
Lee in view of Dean and Khanwalkar in further view of Milazzo teach wherein tokenlDs correspond to Kaggle terms (see Milazzo, paragraph [0110]; see Lee, paragraphs [0131], [0135]-[0137]; see Dean, paragraphs [0020], [0023], [0031]; the system can identify tokens from a database of tokens where the training of the vocabulary can be based on datasets/data sources from Kaggle).

With regard to claim 3, Lee in view of Dean and Khanwalkar teach all the claim limitations of claim 1 as discussed above.
Lee in view of Dean and Khanwalkar do not appear to explicitly teach wherein tokenizing the query response comprises retrieving Kaggle tokenlDs associated with one or more words of the query response.
Milazzo teaches Kaggle terms (see paragraph [0110]; database with words for various associations can be provided by Kaggle).
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify the text search system of Lee in view of Dean and Khanwalkar by utilizing available databases that can provide datasets to the machine learning model so that it can be trained as taught by Milazzo in order to allow the system greater flexibility and versatility in being able to utilize other datasets including third-party datasets for training purposes thereby allowing for already established and widely-used and respected databases to be used to help train the various models without having to force developers to spend time and effort to have to create all their own proprietary datasets.
Lee in view of Dean and Khanwalkar in further view of Milazzo teach wherein tokenizing the query response comprises retrieving Kaggle tokenlDs associated with one or more words of the query response (see Milazzo, paragraph [0110]; see Lee, paragraphs [0131], [0135]-[0137]; see Dean, paragraphs [0020], [0023], [0031]; the system can identify tokens from a database of tokens where the training of the vocabulary can be based on datasets/data sources from Kaggle).

With regard to claim 4, Lee in view of Dean and Khanwalkar teach all the claim limitations of claim 1 as discussed above.
Lee in view of Dean and Khanwalkar do not appear to explicitly teach wherein the tokenizing one or more facts comprises Kaggle tokenlDs associated with one or more words of the one or more facts.
Milazzo teaches Kaggle terms (see paragraph [0110]; database with words for various associations can be provided by Kaggle).
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify the text search system of Lee in view of Dean and Khanwalkar by utilizing available databases that can provide datasets to the machine learning model so that it can be trained as taught by Milazzo in order to allow the system greater flexibility and versatility in being able to utilize other datasets including third-party datasets for training purposes thereby allowing for already established and widely-used and respected databases to be used to help train the various models without having to force developers to spend time and effort to have to create all their own proprietary datasets.
Lee in view of Dean and Khanwalkar in further view of Milazzo teach wherein the tokenizing one or more facts comprises Kaggle tokenlDs associated with one or more words of the one or more facts (see Milazzo, paragraph [0110]; see Lee, paragraphs [0131], [0135]-[0137], [0139]; see Dean, paragraphs [0020], [0023], [0031]; the system can identify tokens from a database of tokens where the training of the vocabulary can be based on datasets/data sources from Kaggle).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Hajela et al [US 2007/0027853 A1] teaches at Figures 2-4 and respective paragraphs 33-37 words/tokens being associated with multiple documents as well as keeping track of the frequency of usage of those words/tokens.
Kobren et al [US 12,242,568] teaches at col 3, line 56 through col 5, line 15 the usage of various data sources for a machine learning system including Kaggle.
Lipka et al [US 2025/0181620 A1] teaches at paragraphs 37-42 an answer generator with fact check attribution that involves a LLM to provide sources for each sentence in the answer and provide a list of those sources when answering the input question and includes a text encoder to generate embeddings and compare sentence embeddings to answer sentence embeddings.
Kanuga et al [US 2026/0094717 A1] teaches at paragraphs 76-79 the usage of a LLM to answer queries and to avoid hallucinations by grounding responses generated by LLMs with contextual (reference) information with the ability to compare the generated response to context documents sentence-by-sentence.
Reimers et al, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”, In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China; 2019 (11 total pages). 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARC S SOMERS whose telephone number is (571)270-3567. The examiner can normally be reached M-F 11-8 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached at 5712729767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/MARC S SOMERS/Primary Examiner, Art Unit 2159                                                                                                                                                                                                        
3/31/2026
Read full office action
Prosecution Timeline

May 13, 2025
Application Filed
Apr 09, 2026
Non-Final Rejection mailed — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/769,063
Patent 12632420
LOCK CONTROLLER AND METHOD TO IMPLEMENT BUFFER LOCK UTILIZING REMOTE DIRECT MEMORY ACCESS
1y 10m to grant Granted May 19, 2026
16/713,663
Patent 12619625
SYSTEM FOR PERFORMING DATA TRANSFORMATIONS USING A SET OF INDEPENDENT SOFTWARE COMPONENTS
6y 4m to grant Granted May 05, 2026
18/824,014
Patent 12579099
CONTROL LEVEL TAGGING METHOD AND SYSTEM
1y 6m to grant Granted Mar 17, 2026
17/813,218
Patent 12561288
METHOD AND APPARATUS TO VERIFY FILE METADATA IN A DEDUPLICATION FILESYSTEM
3y 7m to grant Granted Feb 24, 2026
18/172,315
Patent 12554681
SYSTEM AND METHOD OF UNDOING DATA BASED ON DATA FLOW MANAGEMENT
2y 12m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
65%
Grant Probability
99%
With Interview (+34.5%)
3y 11m (~2y 11m remaining)
Median Time to Grant
Low
PTA Risk
Based on 567 resolved cases by this examiner. Grant probability derived from career allowance rate.