DETAILED ACTION
This communication is in response to the Amendments filed on 12/15/2025. Claims 1-20 are pending and have been examined.
Any previous objection/rejection not mentioned in this Office Action has been withdrawn by the
examiner.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 13, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. JP2021025815, filed on 7/8/2021.
Response to Arguments
With respect to the 35 U.S.C. 101 rejections, applicant submits that it is impractical for the human mind by using pen and paper, and by executing the generic computing component to perform a sequence of operations comprising: the extracting the text in the data that is extracted from a shared memory to automatically create at least a part of learning data, acquiring subsequent data, and excluding, based at least on the meta information, the subsequent text from extracting and storing in the database
Examiner respectfully disagrees, the overarching process of the amended independent claim still represents an abstract idea. The acquiring step introduces additional components that would be considered pre-solution activity. The determining, extracting, storing, acquiring, and excluding steps can all be performed by the human mind (more details in rejection below). The training step is post-solution activity representing an intended use for the method. Overall, the claims still represent the process of taking data from multiple documents and training something with it. The attempts tie the MLM training into the claim language are not sufficient as there is no clear language showing how this is an improvement to training process. Furthermore, the excluding step attempts to show an improvement in the invention but the “condition” is stated broadly and fails to connect throughout the claim language to show an improvement to the training process.
With respect to the 35 U.S.C. 103 rejections, applicant respectfully traverses the § 103 rejections because the Examiner failed to state a prima facie case of obviousness and/or the current amendments to the claims now render the Examiner's arguments moot.
Excluding a text from being a part of learning data based on meta information that indicates prohibiting the extraction enables the present technology to create learning data automatically and selectively with accuracy. In contrast, Bucher generally describes extracting "information from a knowledge graph to provide training data in the form of graph-based representations of molecules and the known or suspected bioactivity of those molecules with certain proteins. Bucher, at Col. 10, Lines 50-54. Accordingly, Bucher focuses on data extraction from a knowledge graph with a predetermined graph structure. If one skilled in the art were to modify the training data generation of Bucher to combine with the teachings of Li and He, the result would still describe generating training data in graph-based representation based on data in a variety of file types. The result would not need to describe extracting a text from data in a shared memory and excluding a subsequent text from being a part of learning data based on meta information that indicates prohibiting the extraction.
This argument is considered moot in view of an updated prior art search necessitated by the amendments to the claims. Details on an updated 35 U.S.C. 103 rejection can be found below.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Claims 1, 7, and 8 recite A data collection device comprising a [processor] configured to execute operations comprising: acquiring data from a shared memory in response to receiving a notification about the data stored in the shared memory, wherein the shared memory is accessible by one or more users, and the shared memory comprises a condition for extracting data; determining of the acquired data being in a predetermined data format, wherein the predetermined data format represents a format in which text included in the data is extractable by executing a predetermined library of computer-executables for extracting text; extracting, by the processor executing the predetermined library, the text included in the data by a text extraction operation according to the determined data in the predetermined data format to automatically create at least a part of learning data for training a [machine learning model], wherein the predetermined library is distinct from the machine learning model; and storing the extracted text as the learning data in a database; acquiring subsequent data from the shared memory, wherein the subsequent data comprises a subsequent text and meta information; excluding, based at least on the meta information, the subsequent text from extracting and storing in the database, wherein the meta information indicates prohibiting the extraction of the subsequent text, and the meta information satisfies the condition of the shared memory; and training, based on the automatically created at least the part of the learning data, the machine learning model and performing, by the machine learning model, a task of processing a natural language input.
The limitations in these claims, as drafted, are a process that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. Acquiring data from a shared storage is the human equivalent of referencing a document at a library or a shared filing cabinet at work/school. A notification could be someone requesting you to access this information. The human mind is capable of understanding the format of a document just by looking at it, for example, is it handwritten, typed, or words within an image. While the human mind can’t execute a computer-executable predetermined library, it is capable if determining if a certain piece of text would be executable by that library. The human mind is also capable of making the design decision to use this text for training a machine learning model. A predetermined library could also represent if all the extracted terms are present in a dictionary or it could mean if the writing is in a legible enough manner that the words could be traced onto another piece of paper to extract them. A human can extract words from text by writing them down separately or tracing them onto a new piece of paper. A human can store these extracted words by keeping them in a journal/notebook, for example, a student taking notes out of a textbook is extracting terms and storing them. The human mind can continue to look at text and repeat the above steps. The human mind can exclude pieces of text based on preset conditions, for example, not including any text found in magazines as opposed to books. The meta information in this sense could be the title/author of a text, the section of a library the text was found in, the format the text is in, etc. The human mind is capable making the design decision to use the data they gathered as training data for a model that process natural language input. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application. In particular, claims 1, 7, and 8 recite a processor and a machine learning model. The processor is merely used to apply the method via a computing device. The processor is detailed in paragraph 23 of the specification and is given a generic description of the computer component. The machine learning model is merely used to apply steps that the human mind can perform. The machine learning model is given no specific structure within the specification and thus can be assumed to be a general-purpose implementation of an existing machine learning model. Claim 8 specifically lists the additional component computer-readable non-transitory recording medium. The recording medium is merely used to apply the method via a computing device. The recording medium is detailed in paragraph 22 of the specification with generic examples of the component provided. Accordingly, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claims are not patent eligible.
Claims 2, 9, and 14 recite wherein the extracting further comprises: when the result of the determining indicates that the format of the data is the format in which the text included in the data is extractable by a predetermined library, extracting the text included in the data by the library; and when the result of the determining indicates that the format of the data is not the format in which the text included in the data is extractable by a predetermined library, extracting the text included in the data by [[OCR]] optical character recognition.
The limitations in these claims, as drafted, are a process that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. The human mind is capable of determining if a document is in a format in which a library could be used to extract data from it. Also, the human equivalent of extracting data via a library would be simply handwriting the text you’d like to extract into another location. Furthermore, the human mind can determine if a document is an image (not extractable using a library) and read/rewrite any words found within the image. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application. These claims do not recite any additional elements that were not present in the independent claims. Accordingly, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claims are not patent eligible.
Claims 3, 10, and 15 recite wherein the storing further comprises processing the text into an input format of the machine learning model, and storing the processed text in the database as the learning data.
The limitation in these claims, as drafted, is a process that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. Using the text as an input to train an MLM and also storing that learning data in a database are design decisions when training a model, the human mind is capable of making such design decisions. Furthermore, the human mind is capable of changing the format of text and storing it in a database, for example, someone re-writing a note into more legible hand-writing before storing it away. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application. These claims do not recite any additional elements that were not present in the independent claims. Accordingly, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claims are not patent eligible.
Claims 4, 11, and 16 recite wherein the storing further comprises dividing the text into predetermined units, and storing each piece of the divided text in the database as the learning data.
The limitation in these claims, as drafted, is a process that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. A human can break text into predetermined units, for example, using line breaks or writing them on individual note cards before storing the documents away in a folder/filing cabinet. The decision to use this type of data as learning data is a design decision associated with the MLM that the human mind is capable of making. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application. These claims do not recite any additional elements that were not present in the independent claims. Accordingly, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claims are not patent eligible.
Claims 5, 12, and 17 recite wherein the shared storage area comprises at least one of a shared folder of a storage existing in a local area network or a shared folder of an external storage available via the Internet.
The limitation in these claims, as drafted, is a process that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. The human equivalent of this type of storage setup would be deciding to store documents in a cabinet that all your coworkers at the office have access to versus scanning the document and uploading it to a website. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application. These claims do not recite any additional elements that were not present in the independent claims. Accordingly, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claims are not patent eligible.
Claims 6, 13, and 18 recite wherein the database comprises a data store having a search function for the text.
The limitation in these claims, as drafted, is a process that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. The human mind equivalent of this is organizing a filing cabinet alphabetically where a human can perform a “search function” of going letter by letter until they find the document they are looking for. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application. These claims do not recite any additional elements that were not present in the independent claims. Accordingly, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claims are not patent eligible.
Claims 19 recite the processor further configured to execute operations comprising automatically creating a plurality of paragraphs by dividing the extracted text into the plurality of paragraphs, wherein the condition comprises a paragraph as a predetermined unit of the learning data.
The limitation in these claims, as drafted, is a process that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. The human mind is capable of segmenting text into paragraphs, whether it be purely conceptually or literally cutting paragraphs out of the text with scissors. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application. These claims do not recite any additional elements that were not present in the independent claims. Accordingly, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claims are not patent eligible.
Claims 20 recite the processor further configured to execute operations comprising automatically creating a plurality of sentences by dividing the extracted text into the plurality of sentences, wherein the condition comprises a sentence as a predetermined unit of the learning data.
The limitation in these claims, as drafted, is a process that, under broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. The human mind is capable of segmenting text into sentences, whether it be purely conceptually or literally cutting sentences out of the text with scissors. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application. These claims do not recite any additional elements that were not present in the independent claims. Accordingly, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claims are not patent eligible.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-18 are rejected under 35 U.S.C. 103 as being unpatentable over US Patent Publication US 11256995 B1 (Bucher et al.) in view of “Apache Tika: What is it and why should I use it?” (Li), China Patent Publication CN 111857942 A (He), and US Patent Publication US 11494512 B2 (Sarferaz).
Regarding Claims 1, 7, and 8, Bucher et al. teaches A data collection device comprising a processor configured to execute operations comprising:
(According to a general methodology description, generating active examples (i.e., chemically valid ligand-receptor pairs) is performed by the first step of gathering known active examples from databases, web-crawlers, and other sources previously described in past figures 1401.) (Column 15, 37-42).
(a computing device comprising a memory and a processor; a point-cloud based bioactivity module comprising a first plurality of programming instructions stored in the memory and operating on the processor, wherein the first plurality of programming instructions causes the computing device to:) (Column 2, Lines 7-12)
Bucher et al. collects data and uses a processor.
Claim 8 specifically recites A computer-readable non-transitory recording medium storing a computer-executable program instructions
(Because such information and program instructions may be employed to implement one or more systems or methods described herein, at least some network device aspects may include non-transitory machine-readable storage media, which, for example, may be configured or designed to store program instructions,) (Column 36, Lines 26-31)
determining of the acquired data being in a predetermined data format, wherein the predetermined data format represents a format in which text included in the (data is extractable by executing a predetermined library of computer-executables for extracting text) (Taught by Li);
(the data extraction engine 212 may first determine a format of each of the materials received (e.g., text, PDFs, images), and perform conversions of materials not in a machine-readable or extractable format (e.g., performing optical character recognition (OCR) on PDFs and images to extract any text contained therein). Once the text has been extracted from the materials, natural language processing (NLP) techniques may be used to extract useful information from the materials for use in analysis by machine learning algorithms.) (Column 9, Lines 61-66).
Bucher et al. makes a determination based on the format of the document and then extracts text from it. Bucher et al. specifically states determining if it is in a machine-readable format which suggests that a predetermined library is used for the extractions. The use of various computer-executable predetermined libraries is taught by Li below.
extracting, by the processor executing the predetermined library, the text included in the data by a text extraction operation according to the determined data in the predetermined data format to automatically create at least a part of learning data for training a machine learning model, wherein the predetermined library is distinct from the machine learning model; and
(the data extraction engine 212 may first determine a format of each of the materials received (e.g., text, PDFs, images), and perform conversions of materials not in a machine-readable or extractable format (e.g., performing optical character recognition (OCR) on PDFs and images to extract any text contained therein). Once the text has been extracted from the materials, natural language processing (NLP) techniques may be used to extract useful information from the materials for use in analysis by machine learning algorithms.) (Column 9, Lines 61-66).
(The data analysis engine 220 utilizes the information gathered, organized, and stored in the data curation platform 210 to train machine learning algorithms at a training stage 230 and conduct analyses in response to queries and return results based on the analyses at an analysis stage 240.) (Column 10, Lines 34-38) (Column 10, Lines 35-49)
(At the training stage 230, information from the knowledge graph 215 is extracted to provide training data in the form of graph-based representations of molecules and the known or suspected bioactivity of those molecules with certain proteins.) (Column 10, Lines 50-54)
After determining the format of the document, text is extracted and used for the rest of the process. This extracted data is stored in a knowledge graph and used to train machine learning algorithms.
storing the extracted text as the learning data in a database.
(The data analysis engine 220 utilizes the information gathered, organized, and stored in the data curation platform 210 to train machine learning algorithms at a training stage 230 and conduct analyses in response to queries and return results based on the analyses at an analysis stage 240.) (Column 10, Lines 34-38) (Column 10, Lines 35-49)
(At the training stage 230, information from the knowledge graph 215 is extracted to provide training data in the form of graph-based representations of molecules and the known or suspected bioactivity of those molecules with certain proteins.) (Column 10, Lines 50-54)
In Fig. 2 it can be seen that the Data Extraction Engine contains a knowledge graph for storing data extracted from the documents which is equivalent to a database. The data stored in this section is then used for training of ML algorithms. The machine learning algorithms respond to queries which is a natural language processing task.
training, based on the automatically created at least the part of the learning data, the machine learning model and performing, by the machine learning model, a task of processing a natural language input.
(The data analysis engine 220 utilizes the information gathered, organized, and stored in the data curation platform 210 to train machine learning algorithms at a training stage 230 and conduct analyses in response to queries and return results based on the analyses at an analysis stage 240.) (Column 10, Lines 34-38) (Column 10, Lines 35-49)
Bucher et al. does not explicitly teach: acquiring data from a shared memory in response to receiving a notification about the data stored in the shared memory, wherein the shared memory is accessible by one or more users, and the shared memory comprises a condition for extracting data; data is extractable by executing a predetermined library of computer-executables for extracting text; acquiring subsequent data from the shared memory, wherein the subsequent data comprises a subsequent text and meta information; excluding, based at least on the meta information, the subsequent text from extracting and storing in the database, wherein the meta information indicates prohibiting the extraction of the subsequent text, and the meta information satisfies the condition of the shared memory; and
However, Li teaches data is extractable by executing a predetermined library of computer-executables for extracting text;
(According to their site, “The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). … It provides a Java library but also has server and command line tools that make it suitable for use from other programming languages.) (Page 2, Paragraph 2).
Li discusses an existing technology known as Apache Tiki which provides a library for extracting text information from many different file types including spreadsheets, text documents, and PDF’s
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the multi-format training data extraction method as taught by Bucher et al. to use an existing library for the text extraction of certain formats as taught by Li. This would have been an obvious improvement to extract data from over a thousand different file types and broaden the scope of the invention. (Li, Page 2, Paragraph 2).
Bucher et al. in view of Li does not explicitly teach: acquiring data from a shared memory in response to receiving a notification about the data stored in the shared memory, wherein the shared memory is accessible by one or more users, and the shared memory comprises a condition for extracting data; acquiring subsequent data from the shared memory, wherein the subsequent data comprises a subsequent text and meta information; excluding, based at least on the meta information, the subsequent text from extracting and storing in the database, wherein the meta information indicates prohibiting the extraction of the subsequent text, and the meta information satisfies the condition of the shared memory; and
However, He teaches acquiring data from a shared memory in response to receiving a notification about the data stored in the shared memory, wherein the shared memory is accessible by one or more users, and the shared memory comprises a condition for extracting data;
(Similarly, the network file system nfs server, can be shared file or directory between different host systems through network such as local area network. Therefore, the target computing server can quickly network file system nfs server obtains the training data of the training deep learning and the training data of the training of the model deep learning to docker container, so as to improve the efficiency of the deep learning model.) (Page 16, Paragraph 4).
He utilizes a shared storage area, in this case a network file server, to obtain training data to use in training a deep learning model.
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the multi-format training data extraction method as taught by Bucher et al. in view of Li to acquire data from a shared storage as taught by He. This would have been an obvious improvement to allow data from multiple hosts, have more storage space, and have multiple backups available. (He, Page 16, Paragraph 3).
Bucher et al. in view of Li and He does not explicitly teach: acquiring subsequent data from the shared memory, wherein the subsequent data comprises a subsequent text and meta information; excluding, based at least on the meta information, the subsequent text from extracting and storing in the database, wherein the meta information indicates prohibiting the extraction of the subsequent text, and the meta information satisfies the condition of the shared memory; and
However, Sarferaz teaches acquiring subsequent data from the shared memory, wherein the subsequent data comprises a subsequent text and meta information;
(A request is received from or on behalf of a machine learning application for data stored in a data store, such as data maintained in a relational database. A data view associated with the request is retrieved. The data view includes computer-implementable instructions to retrieve a first selected portion of data from the data store. The computer-implementable instructions also include instructions to filter, and not return in response to the request, a second portion of data selected from the first selected portion of data. The second portion of data, that is not returned, corresponds to data of the first selected portion of data having an indicator that a given data element of the first selection portion of data has been blocked from use.) (Col. 2, Lines 7-20).
(The remote storage repository 132 can be, for instance, a cloud-based storage system. In addition, or alternatively, the application logic 120 may access data stored in the remote storage repository 132. Similarly, although not shown, in at least some cases, the local machine learning component 122 may access data stored in the remote storage repository 132.) (Col. 6, Lines 36-43).
Sarferaz teaches a system for gathering training data in which a subsequent request for information can be made (second portion of data).
excluding, based at least on the meta information, the subsequent text from extracting and storing in the database, wherein the meta information indicates prohibiting the extraction of the subsequent text, and the meta information satisfies the condition of the shared memory; and
(Techniques and solutions are described for restricting data that is provided to a machine learning application. Restrictions can be based on use status information, such as use status information associated with a retention manager and indicating whether data is blocked from use. Data identifiers used by a cloud-based system can be correlated with archiving objects of a local system so that the cloud-based system can receive use status information to avoid using blocked data. Restrictions can include restricting data based on whether a data subject has provided consent that allows the data to be used by the machine learning application.) (Col. 1, Lines 56-66).
(However, data of the application data 232 that has been marked as blocked may still be available to the application 208 unless additional steps are taken. If made available, breaches of data privacy or data protection policies may occur. In a particular example, data views 224 are annotated to include filters that will exclude blocked data from results provided in response to a request, including data provided to a machine learning algorithm 220 in response to a request from an application 208. A data view 224, in a specific example, can include query language statements to exclude blocked data. A portion of a suitable view definition can include statements such as:) (Col. 9, Line 54 to Col. 10, Line 9).
(where TrainingInput is the data view 224 being defined, <TableName> is the name of a table in a data store holding application data 232, <BlockedField> is the field of that table holding flag information about whether records are blocked, < > represents the “not equal” operator, and a value of “x” indicates that data is blocked. Other examples of such a statement are:) (Col. 10, Lines 12-18).
Sarferaz blocks certain inputs from being included in training data by using restrictions. An example of this can be metadata such as a “blocked field” being marked with a value “x” or not.
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the multi-format training data extraction method as taught by Bucher et al. in view of Li and He to filter subsequent data as taught by Sarferaz. This would have been an obvious improvement to avoid breaches of data privacy or data protection policies. (Sarferaz, Col. 9, Line 54 to Col. 10, Line 9).
Regarding Claims 2, 9, and 14, Bucher et al. in view of Li and He, and Sarferaz teaches the system of claims 1, 7, and 8.
Furthermore, Bucher et al. and Li teaches wherein the extracting further comprises:
when the result of the determining indicates that the format of the data is the format in which the text included in the data is extractable by a predetermined library, extracting the text included in the data by the library; and
Bucher (the data extraction engine 212 may first determine a format of each of the materials received (e.g., text, PDFs, images), and perform conversions of materials not in a machine-readable or extractable format (e.g., performing optical character recognition (OCR) on PDFs and images to extract any text contained therein).) (Column 9, Lines 61-66).
Li (According to their site, “The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). … It provides a Java library but also has server and command line tools that make it suitable for use from other programming languages.) (Page 2, Paragraph 2).
Bucher et al. describes the determination of the format of the file before the extraction and extracting text from the documents. Li describes existing technology for libraries used to extract text from many different document types.
Furthermore, Bucher et al. teaches when the result of the determining indicates that the format of the data is not the format in which the text included in the data is extractable by a predetermined library, extracting the text included in the data by [[OCR]] optical character recognition.
(the data extraction engine 212 may first determine a format of each of the materials received (e.g., text, PDFs, images), and perform conversions of materials not in a machine-readable or extractable format (e.g., performing optical character recognition (OCR) on PDFs and images to extract any text contained therein).) (Column 9, Lines 61-66).
If the text is not in a machine-readable format, then it is extracted using OCR.
Regarding Claims 3, 10, and 15, Bucher et al. in view of Li and He, and Sarferaz teaches the system of claims 1, 7, and 8.
Furthermore, Bucher et al. teaches a wherein the storing further comprises processing the text into an input format of the machine learning model, and storing the processed text in the database as the learning data.
(Separately from the knowledge graph 215, vector representations of proteins, molecules, interactions, and other information may be represented as vectors 216, which may either be extracted from the knowledge graph 215 or may be created directly from data received from the data extraction engine 212.) (Column 10, Lines 21-26)
(Simultaneously, a sequence-based machine learning algorithm is likewise trained, but using information extracted 216 from the knowledge graph 215 in the form of vector representations of protein segments and the known or suspected bioactivity of those protein segments with certain molecules. The vector representations of the protein segments and their associated bioactivities are used to train the concatenated outputs 235, as well as the machine learning algorithms) (Column 10, Lines 61-67)
It is discussed how the information extracted and stored in the knowledge graph can be vectorized in order to train the machine learning model.
Regarding Claims 4, 11, and 16, Bucher et al. in view of Li and He, and Sarferaz teaches the system of claims 1, 7, and 8.
Furthermore, Bucher et al. teaches a wherein the storing further comprises dividing the text into predetermined units, and storing each piece of the divided text in the database as the learning data.
(Once the text has been extracted from the materials, natural language processing (NLP) techniques may be used to extract useful information from the materials for use in analysis by machine learning algorithms. … Of particular importance is recognition of standardized biochemistry naming conventions including, but not limited to, stock nomenclature, International Union of Pure and Applied Chemistry (IUPAC) conventions, and simplified molecular-input line-entry system (SMILES) and FASTA text-based molecule representations. The data extraction engine 212 feeds the extracted data to a knowledge graph constructor 213, which constructs a knowledge graph 215 based on the information in the data, representing informational entities (e.g., proteins, molecules, diseases, study results, people) as vertices of a graph and relationships between the entities as edges of the graph.) (Column 9, Lines 66-67 to Column 10, Lines 1-18)
(“FASTA” as used herein means any version of the FASTA family (e.g., FASTA, FASTP, FASTA, etc.) of chemical notations for describing nucleotide sequences or amino acid (protein) sequences using text (e.g., ASCII) strings.) (Column 6, Lines 61-64).
(“SMILES” as used herein means any version of the “simplified molecular-input line-entry system,” which is form of chemical notation for describing the structure of molecules using short text (e.g., ASCII) strings.) (Column 7, Lines 43-46).
The extracted data is broken up into predetermined units (proteins, molecules, diseases, study results, people) and the stored in a knowledge graph which is a form of database. Bucher et al. specifically mentions SMILES and FASTA representation for which are word strings used to represent some of the categories included in the knowledge graph. By identifying and categorizing these words strings they are dividing the text into predetermined units. This data is considered learning data as it is acquired from an outside source and used to train the machine learning algorithms.
Regarding Claims 5, 12, and 17, Bucher et al. in view of Li, He, and Sarferaz teaches the system of claims 1, 7, and 8.
Furthermore, He teaches a wherein the shared storage area comprises at least one of a shared folder of a storage existing in a local area network or a shared folder of an external storage available via the Internet.
(Similarly, the network file system nfs server, can be shared file or directory between different host systems through network such as local area network. Therefore, the target computing server can quickly network file system nfs server obtains the training data of the training deep learning and the training data of the training of the model deep learning to docker container, so as to improve the efficiency of the deep learning model.) (Page 16, Paragraph 4).
He states the shared storage can be done with a local area network.
Regarding Claims 6, 13, and 18, Bucher et al. in view of Li, He, and Sarferaz teaches the system of claims 1, 7, and 8.
Furthermore, Bucher et al. teaches a wherein the database comprises a data store having a search function for the text.
(As an example, the user may submit a query for identification of molecules likely to have similar bioactivity to a molecule with known bioactivity. The data analysis engine 113 may process the knowledge graph 111 through a GNN to identify such molecules based on the information and relationships in the knowledge graph 111.) (Column 8, Lines 25-30).
When a user submits a query the knowledge graph, which is the database in Bucher et al, is searched through to find an answer, thus operating as a search function.
Claims 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over US Patent Publication US 11256995 B1 (Bucher et al.) in view of “Apache Tika: What is it and why should I use it?” (Li), China Patent Publication CN 111857942 A (He), US Patent Publication US 11494512 B2 (Sarferaz), and US Patent Publication US 11574121 B2 (Bhat et al.).
Regarding Claim 19, Bucher et al. in view of Li, He, and Sarferaz teaches the method of claim 1.
Bucher et al. in view of Li, He, and Sarferaz does not explicitly teach: the processor further configured to execute operations comprising automatically creating a plurality of paragraphs by dividing the extracted text into the plurality of paragraphs, wherein the condition comprises a paragraph as a predetermined unit of the learning data.
However, Bhat et al. teaches the processor further configured to execute operations comprising automatically creating a plurality of paragraphs by dividing the extracted text into the plurality of paragraphs, wherein the condition comprises a paragraph as a predetermined unit of the learning data.
(the Parser 205 can parse the input Documents 105 to delineate logically distinct sections. For example, the Parser 205 may split the input text into individual sentences, paragraphs, sections, chapters, or any other logical separation. In some embodiments, the Parser 205 parses the text to extract a hierarchical structure. For example, the Parser 205 may split each sentence and identify, for each sentence, the corresponding paragraph, subheading, heading, and/or chapter.) (Col. 4, Lines 20-31).
Bhat et al. teaches a machine learning based system for evaluating text on input document. The system parses incoming documents into separated paragraphs. This can be visualized in Fig. 2
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the multi-format training data extraction method as taught by Bucher et al. in view of Li, He, and Sarferaz to segment incoming text into paragraphs as taught by Bhat et al. This would have been an obvious improvement to create smaller portions of data which can be individually processed to allow the system to more efficiently handle the data. (Bhat et al. Col. 1, Lines, 9-18).
Regarding Claim 20, Bucher et al. in view of Li, He, and Sarferaz teaches the method of claim 1.
Bucher et al. in view of Li, He, and Sarferaz does not explicitly teach: the processor further configured to execute operations comprising automatically creating a plurality of sentences by dividing the extracted text into the plurality of sentences, wherein the condition comprises a sentence as a predetermined unit of the learning data.
However, Bhat et al. teaches the processor further configured to execute operations comprising automatically creating a plurality of sentences by dividing the extracted text into the plurality of sentences, wherein the condition comprises a sentence as a predetermined unit of the learning data.
(the Parser 205 can parse the input Documents 105 to delineate logically distinct sections. For example, the Parser 205 may split the input text into individual sentences, paragraphs, sections, chapters, or any other logical separation. In some embodiments, the Parser 205 parses the text to extract a hierarchical structure. For example, the Parser 205 may split each sentence and identify, for each sentence, the corresponding paragraph, subheading, heading, and/or chapter.) (Col. 4, Lines 20-31).
Bhat et al. teaches a machine learning based system for evaluating text on input document. The system parses incoming documents into separated sentences. This can be visualized in Fig. 2.
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the multi-format training data extraction method as taught by Bucher et al. in view of Li, He, and Sarferaz to segment incoming text into sentences as taught by Bhat et al. This would have been an obvious improvement to create smaller portions of data which can be individually processed to allow the system to more efficiently handle the data. (Bhat et al. Col. 1, Lines, 9-18).
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICHOLAS DANIEL LOWEN whose telephone number is (571)272-5828. The examiner can normally be reached Mon-Fri 8:00am - 4:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras D Shah can be reached at (571) 270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/NICHOLAS D LOWEN/Examiner, Art Unit 2653
/Paras D Shah/Supervisory Patent Examiner, Art Unit 2653
02/24/2026