Last updated: April 19, 2026
Application No. 17/961,069
ENHANCED DOCUMENT INGESTION USING NATURAL LANGUAGE PROCESSING

Non-Final OA §101§102§103
Filed
Oct 06, 2022
Examiner
LE, UYEN T
Art Unit
2156
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
1 (Non-Final)
Interview Optional

— +9.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 797 resolved cases, 2023–2026
Examiner Intelligence

LE, UYEN T View full profile →
Grants 84% — above average
Career Allow Rate
669 granted / 797 resolved
+28.9% vs TC avg
Moderate +10% lift
Without
With
+9.7%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
24 currently pending
Career history
821
Total Applications
across all art units
Statute-Specific Performance

§101
15.8%
-24.2% vs TC avg
§103
27.6%
-12.4% vs TC avg
§102
20.0%
-20.0% vs TC avg
§112
22.2%
-17.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 797 resolved cases
Office Action

§101 §102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 6 October 2022 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claims 1-20 are pending.
Claim Objections
Claim 9 is objected to because of the following informalities: claim 9 line 2 “at one of” should be –at least one of—for the sentence to make sense.
Appropriate correction is required.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
An analysis of subject matter patentability is presented below for method claim 15:
Step 1: claim 15 recites a method thus is one of the statutory categories of invention. 
Step 2A Prong 1: Claim 15 recites: identifying within a first document…identifying within a second document….comparing the one or more terms…upon determining… These limitations are processes that, under their broadest reasonable interpretation, covers performance of the limitation by a human user. Note nothing in the claim element precludes the steps from practically being performed by a human user with the aid of pen and paper. If a claim limitation, under its broadest reasonable interpretation, cover performance of the limitation in the mind, then it falls within the “Mental Processes’ grouping of abstract idea (concept performed in the human mind including an observation, evaluation, judgment and opinion). The mere nominal recitation of “the method is carried out by at least one computing device” does not take the claim limitations out of the mental processes grouping. 
Step 2A Prong 2: the judicial exception is not integrated into a practical application. The claim recites the additional element “processing the at least a first document… processing the at least a second document…” recited at a high level of generality amounts to mere insignificant extra solution activity because processing documents by using different models do not impose any meaningful limits on practicing the abstract idea (see MPEP 2106.05(g)).
Step 2B: the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The limitation of “upon determining…reprocessing the at least a first document using the second natural language processing model” recited at a high level of generality is mere well understood, routine and conventional activities in the field of document processing using generic computers (see MPEP 2106.05(d) (II) (IV). Accordingly these additional elements do not integrate the abstract idea into a practical application. 
Claims 1, 10 merely recite the limitations of claim 15 in form of system and computer program product respectively, thus are not patent eligible for the same reasons discussed in claim 1 above.
Claims 2, 11, 16 merely further perform one or more automated actions based on the determination, considered insignificant extra solution activity (see MPEP 2106.05(g)).
Claims 3, 12. 17 merely further describe the automated action of training the first natural language processing model, considered insignificant extra solution activity (see MPEP 2106.05(g)).
Claims 4, 13, 18 merely further describe the automated action of automatically identifying one or more additional document containing unknown terms to the first natural language processing model, considered insignificant extra solution activity (see MPEP 2106.05(g)).
Claims 5, 14, 19 merely further describes the automated action of reprocessing each of the one or more additional document, considered insignificant extra solution activity (see MPEP 2106.05(g)).
Claim 6 merely further describes the processing includes implementing enrichment fields with the first document, considered insignificant extra solution activity (see MPEP 2106.05(g)).
Claim 7 merely adds storing the one or more unknown terms in a database, considered insignificant extra solution activity (see MPEP 2106.05(g)).
Claim 8 merely recites removing terms unknown to the first natural language processing model from the database, considered insignificant extra solution activity (see MPEP 2106.05(g)). 
Claim 9 merely further describes the second natural language processing model, considered insignificant extra solution activity (see MPEP 2106.05(g)).
Claim 20 merely describes software implementing the method is provided as a service in a cloud environment, considered insignificant extra solution activity (see MPEP 2106.05(g)).
As discussed above although the dependent claims seem more detailed than their parent claims, none amounts to significantly more than the judicial exception of an abstract idea. No claim is patent eligible.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-19 is/are rejected under 35 U.S.C. 102(a)(1) as anticipated by or, in the alternative, under 35 U.S.C. 103 as obvious over Potts et al (US 11263391) provided by the applicant,
Regarding claim 1, Potts substantially discloses a system comprising: 
a memory configured to store program instructions (see col.7 lines 39-43); and 
a processor operatively coupled to the memory to execute the program instructions (see col.5 lines 51-55) to: 
identify, within at least a first document, one or more terms (unknown) to a first natural language processing model by processing the at least a first document using the first natural language processing model (see at least col.27 lines 23-34: “In some implementations, an RKG can be used to develop lexical resources that may be subsequently employed by the Alpine Annotation Manager to facilitate annotation projects. For example, in one implementation, documents in an annotation project dataset may be explored and pre annotated (prior to initial annotation by manual annotators) using one or more lexicons and/or NLP models referred to as “extractors.” In some examples, such extractors are built on lexical resources harvested from an RKG and are employed in Alpine to process respective documents of the annotation project dataset to automatically find and label certain entity types (“concepts”) mentioned in the documents.”); 
Note although Potts does not specifically use the claim language of “terms unknown to a first natural language processing model”, Potts clearly teaches Alpine explores documents using multiple NLP models. This clearly suggests that documents are processed by different NLP models, each trained by a specific data set (see at least col.28 lines 3-7: ”The one or more computers also perform various processing of the respective documents of an annotation project data set and, in some implementations, also facilitate NLP model building and training, according to the various functionalities described herein”.),
it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention upon determining a document has terms unknown to one of the natural language processing models of Potts to reprocess the document using another processing model that processes document with terms matching the unknown terms in order to train a project NLP target model (see col.28 lines 61-67); 
identify, within at least a second document, one or more terms known to a second natural language processing model by processing the at least a second document using the second natural language processing model (see at least col.27 lines 23-34: “In some implementations, an RKG can be used to develop lexical resources that may be subsequently employed by the Alpine Annotation Manager to facilitate annotation projects. For example, in one implementation, documents in an annotation project dataset may be explored and pre annotated (prior to initial annotation by manual annotators) using one or more lexicons and/or NLP models referred to as “extractors.” In some examples, such extractors are built on lexical resources harvested from an RKG and are employed in Alpine to process respective documents of the annotation project dataset to automatically find and label certain entity types (“concepts”) mentioned in the documents.”); 
compare the one or more terms unknown to the first natural language processing model to the one or more terms known to the second natural language processing model (see at least col.57 lines 22-32: “In some implementations, as part of block 9320, at least some of the data in the single file representing the dataset may be “normalized” (or “canonicalized”), i.e., modified in some respect according to a predetermined standard or format so it may be more readily compared to other pieces of data (e.g., in other datasets) relating to the same or similar thing); and 
upon determining, in connection with the comparing, that at least one of the one or more terms unknown to the first natural language processing model matches at least one of the one or more terms known to the second natural language processing model, reprocess the at least a first document using the second natural language processing model (see at least col.1, line 65 to col.2, line10: NLP models often work better when the models are provided with “pointers” to what is relevant about a source text, rather than just massive amounts of text. Such pointers also are referred to as “annotations” to the original text in question; generally speaking, any metadata tag (or “label”) added to one or more elements of text to categorize or specifically identify the text in some manner may be considered as an annotation. “Supervised learning” refers to an NLP model that can learn to automatically label text with certain annotations, based on example text that is first annotated by humans according to a set of predetermined labels; this human-annotated text provides “labeled training data” for the NLP model.). Note the claimed one or more terms unknown/known to first/second natural processing models reads on specific labeled training data used to train particular models taught by Pippin. 

Regarding claim 2, Potts further teaches or suggests the system of claim 1, wherein the processor is further operatively coupled to the memory to execute the program instructions to: perform one or more automated actions based at least in part on the determination that at least one of the one or more terms unknown to the first natural language processing model matches at least one of the one or more terms known to the second natural language processing model (see at least col.45 lines 46-58: “Thus, it may be readily appreciated from the foregoing that the active learning framework facilitated by the Alpine AUI enables iterative training of the project NLP target models based on annotated and marked (e.g., corrected) documents all within the same tool. Trained project NLP target models are then deployed to automatically annotate the entire project dataset and thereby identify significant entities and concepts of particular interest to the use-case or business question at hand. These identified entities and concepts constitute structured data extracted from free-form text in the original documents, and in turn may serve as the basis of adding additional structured information to these documents.”).

Regarding claim 3, Potts further teaches or suggests the system of claim 2, wherein performing one or more automated actions comprises automatically training the first natural language processing model based at least in part on the determination that at least one of the one or more terms unknown to the first natural language processing model matches at least one of the one or more terms known to a second natural language processing model (see at least col.31 lines 22-29: “each annotation project generally is associated with one or more trained project NLP target models to automatically annotate the documents in the corresponding project dataset (e.g., 216aA-216dA) according to the annotation scheme for the project data set. These project NLP target models can be developed, improved (e.g., trained and retrained iteratively), and monitored using Alpine.”).

Regarding claim 4, Potts further teaches or suggests the system of claim 2, wherein performing one or more automated actions comprises automatically identifying one or more additional documents containing the at least one term unknown to the first natural language processing model that matches at least one of the one or more terms known to the second natural language processing model (see at least col.30 lines 57-67: “To this end, at 3255 in FIG. 2B, the project NLP target model can be re-trained on original annotations and all marked documents (e.g., via the “Build” functionality 106 in FIG. 1). At 3260, the re-trained project NLP target model can further be applied to unmarked documents (e.g., another subset of model-annotated documents that has not yet been corrected by the annotators). At 3265, a determination can be made on if the model performs sufficiently well (e.g., via the “Build” functionality 106 in FIG. 1). If the model does perform sufficiently well, at 3270, the re-trained NLP target model can be applied to the entire project dataset (or remaining unannotated documents) to provide structured data from free-form text. If the model does not perform sufficiently well, then the method reverts back to step 3245.”).

Regarding claim 5, Potts further teaches or suggests the system of claim 4, wherein performing one or more automated actions comprises automatically reprocessing each of the one or more additional documents using the second natural language processing model (see at least col.28 line 61- col.29 line 10: “As also shown in FIG. 1, the AUI can also include a “Build” functionality 106 to facilitate designing and/or training of one or more project NLP target models. More specifically, the “Build” functionality 106 can enable users, who need not necessarily be machine learning and/or NLP engineers or experts, to design and/or train project NLP target models. In example implementations, the annotations made in at least a subset of project dataset documents using the “Annotate” functionality 104 (and optionally the “Explore” functionality 102 as well) can be used as training data to design and/or train project NLP target models. Once a project NLP target model is trained and designed, this project NLP target model can then be used to automatically annotate other documents within the same project dataset and/or documents within a different project dataset (presumably involving a same or similar domain and associated entities/concepts)”.).

Regarding claim 6, Potts further teaches or suggests the system of claim 1, wherein processing the at least a first document using the first natural language processing model comprises implementing one or more enrichment fields with the at least a first document, wherein the one or more enrichment fields comprise one or more of at least one enrichment field corresponding to one or more terms unknown to the first natural language processing model and at least one enrichment field corresponding to context information associated with one or more terms unknown to the first natural language processing model (see at least col.4 lines 20-36: “With respect to knowledge graphs and their utility for annotation of documents, the Inventors have recognized and appreciated that many things, if not everything—a name, a number, a date, an event description—acquires greater meaning in context, where it can be compared with other things. Context is essential for understanding, and the more context one has, the fuller one's understanding can be. Individual pieces of information or relatively confined sources of data are often unlikely to provide sufficient context to facilitate a deeper understanding of the meaning of the information at hand. Even with relatively larger amounts of information available, respective pieces of information may remain unconnected, inconsistent or disjointed in some manner, and relationships between certain pieces of information may not be readily apparent or even discernible from the respective (and often unconnected, inconsistent, or disjointed) pieces”);
(see also col.27 lines 47-54: “in this manner, the structured information derived from the annotations of the documents in the annotation project dataset can be readily coupled to the existing RKG, benefit from the broader context of RKG, and the RKG itself can be augmented with the structured information extracted from the text documents of the project dataset to provide greater context for the overall information domain of interest.”).

Regarding claim 7, Potts further teaches or suggests the system of claim 1, wherein identifying one or more terms unknown to the first natural language processing model comprises storing the one or more terms in at least one database (see at least col.73 lines 35-47: “The semantic parsing engine maps English texts to statements in the declarative Neo4j query language Cypher. FIG. 43 depicts the architecture. The boxes namely “Language models,” “Entity index,” “Lexical resources,” and “Grammar” highlight the numerous ways in which the system is defined by its underlying graph. The language models used for entity detection are trained on ‘name’-type attributes of nodes, and resolving those entities is graph-backed: the ‘Entity index’ is automatically created from the database and provides fast look-up. The ‘Lexical analysis’ step is similarly graph-backed: node and edge type-names provide the core lexicon, which can then be expand using Wiktionary, WordNet, and heuristic morphological expansion”.).

Regarding claim 8, Potts further teaches or suggests the system of claim 7, wherein the processor is further operatively coupled to the memory to execute the program instructions to:
remove the terms unknown to the first natural language processing model from the at least one database subsequent to reprocessing the at least a first document using the second natural language processing model (see at least col.40 lines 8-14: “In some inventive aspects, annotators can change and/or delete spannotation labels and spannotation relations, annotate new spans using existing annotation scheme, and/or alter/augment the annotation scheme in real time with new spannotation labels and spannotation relations. In any and all of these cases, Alpine automatically updates the annotation scheme”.).

Regarding claim 9, Potts further teaches or suggests the system of claim 1, wherein the second natural language processing model comprises at one of a distinct natural language processing model from the first natural language processing model and a modified version of the first natural language processing model (see at least col.45 lines 12-28: “Once the annotator reviews the document, the document can be used as data to re-train the model. FIG. 29 illustrates a twenty third screen shot 2800 of the AUI re-training the project NLP model following inclusion of new data (47 additional documents) after corrections to the annotated predictions by the annotators. As illustrated in FIG. 29, with the inclusion of these additional documents as training data, there is an increase in the performance metric 2524A for each of the annotation labels 2522A in the annotation scheme. The plot 2526A illustrates that the performance metric with respect to the version of the project NLP target model (v2) improves as the version increases. Put differently, the performance metric of initially trained project NLP model (trained with 200 initial documents) is lower than the performance metric of the re-trained project NLP model (trained with 247 documents—47 of which includes corrections to predicted annotations by annotators)”).

Claims 10-14 essentially recite limitations similar to claims 1-5 in form of computer program product, thus are rejected for the same reasons discussed in claims 1-5 above.

Claims 15-19 essentially recite limitations similar to claims 1-5 in form of methods, thus are rejected for the same reasons discussed in claims 1-5 above.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Potts et al (US 11263391) provided by the applicant, further in view of McNeil et al (US 20220043980 A1). 
Regarding claim 20, Potts does not specifically show the computer-implemented method of claim 15, wherein software implementing the method is provided as a service in a cloud environment.
However it is customary in the art to do so as shown by McNeil (see [0010]… ,”further embodiments could include where software is provided as a service in a cloud environment for providing the result from the NLP model having accounted for the context of the user”).
it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include providing such software as a service in a cloud environment as taught by McNeil in order to benefit from the advantage of a cloud environment (see McNeil [0022]…in a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Kang et al (US 11544476 B2) teach a method for model customization according to an embodiment includes providing a user with prediction results of each of a plurality of pre-trained natural language processing models for a document subjected to analysis selected from a document set including a plurality of documents, acquiring user feedback on the prediction results from the user, generating a plurality of augmented documents from at least one of the plurality of documents based on data attributes of each of the plurality of documents and the user feedback; and retraining at least one of the plurality of natural language processing models, using training data including the plurality of augmented documents.
Word swap may be performed by changing at least one of words included in a text subjected to conversion to a synonym using a predefined thesaurus. For example, as the example shown in FIG. 4, the data augmenter 122 may change the words “established”, “product” and “suitable” in a text subjected to conversion to the synonyms “launched”, “merchandise” and “appropriate”, respectively to convert the text subjected to conversion. A re-trainer 123 retrains at least one of a plurality of natural language processing models using training data including a plurality of augmented documents generated by the data augmenter 122. 
	Wu; Tianhao (US 20200320433 A1) teaches a system and method for real-time machine learning include an interface device and a processing device to responsive to receiving a document, identify tokens in a document object model (DOM) tree associated with the document, present, on a user interface of the interface device, the document including the identified tokens, label, based on user actions on the user interface, one or more of the tokens in the DOM tree as one of a strong positive, a strong negative, or one of a weak positive or a weak negative token, and provide the DOM tree including the labeled tokens to train a machine learning model.
	Singh Bawa et al (US 20220237373 A1) teach systems, methods, and computer-readable storage media that support automated summarization of documents using machine learning and artificial intelligence. The summaries include document category-specific summaries for different categories of documents. To illustrate, a document processing device may receive input data representing a document and provide first feature data based on the input data to one or more machine learning (ML) models to select a document category of a plurality of predefined document categories. The document processing device may provide second feature data based on the input data to a second set of ML models associated with the selected document category to generate annotation data associated with the document. The document processing device may provide the annotation data to a third set of ML models to generate a summary of the document, and the document processing device may generate an output based on the summary.
	DONALDSON et al (US 20210019374 A1) teach multiple natural language training text strings are obtained. For example, text portions may be randomly selected and converted into natural language text based on one or more randomly selected rules. A formatted training text string is generated for each natural language training text string, for example using a context-free grammar parser. The formatted training text strings are inputted to a machine learning model. For each formatted training text string, using the machine learning model, a natural language text string is generated. The natural language text string is associated with one of the natural language training text strings. One or more parameters of the machine learning model are adjusted based on one or more differences between at least one of the natural language text strings and its associated natural language training text string. For each term of the natural language text string that is not included in keyword database 514, synonym module 504 determines whether the term is equal to or identical to an explicit synonym in the language-specific thesaurus. After synonym module 504 has processed the natural language text string, the natural language text string, including any terms updated to reflect keywords recognizable by parser module 506 (i.e. that are comprised in keyword database 514), is then processed by parser module 506, At this point, all terms within the natural language text string are either identified as “unknown” or have been converted to terms recognizable by parser module 506. Parser module 506 then uses context-free grammar (CFG) parsing to determine associations between words. 
DA SILVA FURÃO SARA (.EP 4020304 A1) teaches providing a Natural Language Processing solution based on a method using small trained data models, which may be individually activated or signed, to select the most suitable in order to give the best answer, given the context of the dialogue. This method allows the use of multiple models previously created, without requiring the training of a huge dataset, and is based on the application of heuristics (102) over the results of a Natural Language Understanding engine, taking advantage of dataset segmentation (101) to define contexts for each dataset (100), enabling context aware Natural Language Processing and a Hybrid NLP engine which reduce or eliminate ambiguity on choosing the best answer from all models. This would not be possible with a Natural Language Understanding engine alone.
Kumar et al (US 20210073471 A1) teach the method (400) involves identifying (401) a natural language document associated with multiple document attributes. The natural language document comprises multiple natural language words. An attribute-based document embedding is determined (402) for the natural language document. The attribute-based document embedding is generated based at least in part on a document vector for the natural language document and a word vector for each natural language word of the natural language words. The classifier model is configured to update the document vector for the natural language document based at least in part on an attribute-detection optimization goal. The attribute-based document embedding is processed (403) using a predictive inference model to determine multiple document-related predictions for the natural language document. Multiple prediction-based actions are performed (404) based at least in part on the document-related predictions.

Greiner-Petter, André, et al. "Math-word embedding in math search and semantic extraction." Scientometrics 125.3 (2020): 3017-3046.
Abstract Word embedding, which represents individual words with semantically fixed-length vectors, has made it possible to successfully apply deep learning to natural language processing tasks such as semantic role-modeling, question answering, and machine translation. As math text consists of natural text, as well as math expressions that similarly exhibit linear correlation and contextual characteristics, word embedding techniques can also be applied to math documents. However, while mathematics is a precise and accurate science, it is usually expressed through imprecise and less accurate descriptions, contributing to the relative dearth of machine learning applications for information retrieval in this domain. Generally, mathematical documents communicate their knowledge with an ambiguous, context-dependent, and non-formal language. Given recent advances in word embedding, it is worthwhile to explore their use and effectiveness in math information retrieval tasks, such as math language processing and semantic knowledge extraction. In this paper, we explore math embedding by testing it on several different scenarios, namely, (1) math-term similarity, (2) analogy, (3) numerical concept-modeling based on the centroid of the keywords that characterize a concept, (4) math search using query expansions, and (5) semantic extraction, i.e., extracting descriptive phrases for math expressions. Due to the lack of benchmarks, our investigations were performed using the arXiv collection of STEM documents and carefully selected illustrations on the Digital Library of Mathematical Functions (DLMF: NIST digital library of mathematical functions. Release 1.0.20 of 2018-09-1, 2018). Our results show that math embedding holds much promise for similarity, analogy, and search tasks. However, we also observed the need for more robust math embedding approaches. Moreover, we explore and discuss fundamental issues that we believe thwart the progress in mathematical information retrieval in the direction of machine learning.

Zhou, D., Truran, M., Brailsford, T., & Ashman, H. (2008). A hybrid technique for English-Chinese cross language information retrieval. acm transactions on asian Language information Processing (taLiP), 7(2), 1-35.
Abstract- In this article we describe a hybrid technique for dictionary-based query translation suitable for English-Chinese cross language information retrieval. This technique marries a graph-based model for the resolution of candidate term ambiguity with a pattern-based method for the translation of out-of-vocabulary (OOV) terms. We evaluate the performance of this hybrid technique in an experiment using several NTCIR test collections. Experimental results indicate a substantial increase in retrieval effectiveness over various baseline systems incorporating machine- and dictionary-based translation.

Isabel Nadine de Santana, Raphael Souza de Oliveira, Erick Giovani Sperandio Nascimento “Text Classification of News Using Deep Learning and Natural Language Processing Models Based on Transformers for Brazilian Portuguese”. Proceedings of the 26th World Multi-Conference on Systemics, Cybernetics and Informatics: WMSCI 2022, Vol. III, pp. 134-139 (2022); https://doi.org/10.54808/WMSCI2022.03.134
ABSTRACT This work proposes the use of a fine-tuned Transformer-based Natural Language Processing (NLP) model called BERTimbau to generate the word embeddings from texts published in a Brazilian newspaper, to create a robust NLP model to classify news in Portuguese, a task that is costly for humans to perform for big amounts of data. To assess this approach, besides the generation of the embeddings by the fine-tuned BERTimbau, a comparative analysis was conducted using the Word2Vec technique. The first step of the work was to rearrange the news from nineteen to ten categories to reduce the existence of class imbalance in the corpus, using the K-means and TF-IDF techniques. In the Word2Vec step, the CBOW and Skip-gram architectures were applied. In BERTimbau and Word2Vec steps, the Doc2Vec method was used to represent each news as a unique embedding, generating a document embedding for each news. The metrics accuracy, weighted accuracy, precision, recall, F1-Score, AUC ROC and AUC PRC were applied to evaluate the results. It was noticed that the fine-tuned BERTimbau captured distinctions in the texts of the different categories, showing that the classification model based on the fine-tuned BERTimbau has a superior performance than the other explored techniques.

Jiang, Sihang, et al. "Towards the completion of a domain-specific knowledge base with emerging query terms." 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 2019.
Abstract: Domain-specific knowledge bases play an increasingly important role in a variety of real applications. In this paper, we use the product knowledge base in the largest Chinese e-commerce platform, Taobao, as an example to investigate a completion procedure of a domain-specific knowledge base. We argue that the domain-specific knowledge bases tend to be incomplete, and are oblivious to their incompleteness, without a continuous completion procedure in place. The key component of this completion procedure is the classification of emerging query terms into corresponding properties of categories in existing taxonomy. Our proposal is that we use query logs to complete the product knowledge base of Taobao. However, the query driven completion usually faces many challenges including distinguishing the fine-grained semantic of unrecognized terms, handling the sparse data and so on. We propose a graph based solution to overcome these challenges. We first construct a lot of positive evidence to establish the semantical similarity between terms, and then run a shortest path or alternatively a random walk on the similarity graph under a set of constraints derived from a set of negative evidence to find the best candidate property for emerging query terms. We finally conduct extensive experiments on real data of Taobao and a subset of CN-DBpedia. The results show that our solution classifies emerging query terms with a good performance. Our solution is already deployed in Taobao, helping it find nearly 7 million new values for properties. The complete product knowledge base significantly improves the ratio of recognized queries and recognized terms by more than 25% and 32%, respectively.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to UYEN T LE whose telephone number is (571)272-4021. The examiner can normally be reached M-F 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ajay M Bhatia can be reached at 5712723906. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
                                                                                                                                                                                            /UYEN T LE/Primary Examiner, Art Unit 2156                                                                                                                                                                                                        20 December 2025
Read full office action
Prosecution Timeline

Oct 06, 2022
Application Filed
Oct 18, 2023
Response after Non-Final Action
Nov 25, 2025
Non-Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/652,579
Patent 12591550
SHARE REPLICATION BETWEEN REMOTE DEPLOYMENTS
2y 5m to grant Granted Mar 31, 2026
18/680,128
Patent 12591540
DATA MIGRATION IN A DISTRIBUTIVE FILE SYSTEM
2y 5m to grant Granted Mar 31, 2026
18/473,037
Patent 12581301
MEDIA AGNOSTIC CONTENT ACCESS MANAGEMENT
2y 5m to grant Granted Mar 17, 2026
18/658,616
Patent 12579189
METHOD, DEVICE, AND COMPUTER PROGRAM PRODUCT FOR GENERATING OBJECT IDENTIFIER
2y 5m to grant Granted Mar 17, 2026
18/520,891
Patent 12561371
GRAPH OPERATIONS ENGINE FOR TENANT MANAGEMENT IN A MULTI-TENANT SYSTEM
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
84%
Grant Probability
94%
With Interview (+9.7%)
2y 11m
Median Time to Grant
Low
PTA Risk
Based on 797 resolved cases by this examiner. Grant probability derived from career allow rate.