Last updated: April 19, 2026
Application No. 18/210,498
LAYOUT AWARE MULTI-MODAL NETWORKS FOR DOCUMENT UNDERSTANDING

Final Rejection §103
Filed
Jun 15, 2023
Examiner
BARNES JR, CARL E
Art Unit
2178
Tech Center
2100 — Computer Architecture & Software
Assignee
Oracle International Corporation
OA Round
2 (Final)
Interview Optional

— +25.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 202 resolved cases, 2023–2026
Examiner Intelligence

BARNES JR, CARL E View full profile →
Grants only 32% of cases
Career Allow Rate
65 granted / 202 resolved
-22.8% vs TC avg
Strong +25% interview lift
Without
With
+25.2%
Interview Lift
resolved cases with interview
Typical timeline
4y 4m
Avg Prosecution
32 currently pending
Career history
234
Total Applications
across all art units
Statute-Specific Performance

§101
14.3%
-25.7% vs TC avg
§103
62.6%
+22.6% vs TC avg
§102
9.0%
-31.0% vs TC avg
§112
8.7%
-31.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 202 resolved cases
Office Action

§103
TAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Claims 1-20 were previously pending and subject to non-final action filed 09/04/2025. In the response filed on 11/21/2025, claim 1, 3, 9-10, 12, 14-15, 17-18, and 20 were amended, claim 11 was canceled, and claim 21 newly added claim. Therefore, claims 1-10, and 12-21 are currently pending and subject to the final action below.

Response to Arguments
Applicant’s arguments, see pages, filed 11/21/2025, with respect to 35 U.S.C. 101 have been fully considered and are persuasive.  The 101 rejection of claims 1-20 has been withdrawn.

Applicant's arguments, see pages 13-16, filed 11/21/2025, with respect to 35 U.S.C. 102 (a)(2) of claims 1-3, 9-12, and 14-17 have been fully considered but are moot because the arguments do not apply to the new combinations of references being used in the current rejection.

Applicant's arguments, see pages 13-16, filed 11/21/2025, with respect to 35 U.S.C. 103 of claims 4-8, 13, and 18-20 have been fully considered but are moot because the arguments do not apply to the new combinations of references being used in the current rejection.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 1-3, 9-12, 14-17 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Morariu (US PGPUB: US 20240161529 A1, Filed Date: Nov. 15, 2022) in view of Wu (US PGPUB: 20240289550 A1, Filed Date: Feb. 28, 2023).
Regarding independent claim 1, Morariu teaches: A method comprising:
identifying a plurality of word embeddings that was generated based on a set of words that was extracted from an image of a document; (Morariu – [0023] As mentioned above, the document hierarchy generation system generates feature embeddings from a digital document image. For example, the document hierarchy generation system extracts multimodal information from a digital document image for generating feature embeddings [0048-0049] [0048] As indicated, the document hierarchy generation system 104 extracts visual elements from a digital document image and uses the visual elements to generate feature embeddings. To illustrate, visual elements include graphical components such as text; feature embedding such as text is word embedding)
based on the image, (Morariu − [0048-0049] [0048] For example, visual elements include graphical components, items, or portions of a digital document image.)
determining a set of table features of one or more tables in the document; (Morariu − [0048-0049] [0048] As indicated, the document hierarchy generation system 104 extracts visual elements from a digital document image and uses the visual elements to generate feature embeddings. To illustrate, visual elements include graphical components such as tables; [0106] For example, the document hierarchy generation system 104 can utilize a training dataset that comprises a ground truth document hierarchy composed of elements arranged in a tree-like format where a higher element (e.g., a list or field) may be a parent of one or more basic elements (e.g., OCR extracted text). The document hierarchy can include a variety of element classifications, TableRow, Table, TableCell or Header. TableRow, Table, TableCell or Header are a set of table features.)
identifying one or more table embeddings that were generated based on a set of table features of one or more tables that were detected in the image of the document; (Morariu − [0048-0050] [0048] The document hierarchy generation system 104 extracts visual elements from a digital document image and uses the visual elements to generate feature embeddings. For example, visual elements include graphical components, items, or portions of a digital document image. To illustrate, visual elements include graphical components such as text, images, buttons, tables, menus, fields, or columns.) visual elements can include tables to generate feature embeddings.
inputting, into a machine-learned model, the plurality of word embedding (Morariu − [0054] More specifically, the document hierarchy generation system 104 utilizes tokens extracted using OCR as input for a trained natural language processing model to generate semantic feature embeddings. For example, semantic feature embeddings are feature embeddings that reflect semantic meaning of text in a digital document image. For example, semantic feature embeddings include a vector of semantic features generated by a machine learning model, such as a neural network. The document hierarchy generation system 104 can utilize a variety of language machine learning models;)
Morariu doesn’t explicitly teach: to generate a word embedding for the document,
However, Wu teaches: inputting, into a machine-learned model, the plurality of word embedding and one the one or more table embedding to generate a document embedding for the document, (Wu [0015] a deep neural network module whose architect may combine inputs including…, character shape embeddings, table feature embeddings, [0031-0034] [0032] The token shape features may include upper case, lower case, number, punctuation shape etc. In this way, token shape feature extraction 210 may extract shape feature vectors of tokens. Responsive to obtaining the shape feature vectors of tokens, token shape feature encoder 224 may encode these shape feature vectors into token shape embeddings (word embedding) which may have much lower dimensionality than shape feature vectors. [0034] Document images may often include tables. A table may include tokens that are organized according to their characteristics. Table feature encoder 228 may encode table feature vectors into table feature embeddings.) 
wherein the document embedding represents the image of the document; ([0020] FIG. 2 illustrates a subsystem 200 including feature extractions and embedding encoders in detail according to an implementation of the disclosure. Subsystem 200 may receive tokenized document images 202 and generate different types of embeddings as discussed in the following. [0031-0034])
performing a task based on the document embedding; (Morariu − [0029] In one or more implementations, the document hierarchy generation system trains a neural network to generate document classifications and parent-child element links for digital document images. [0052] For example, the document hierarchy generation system 104 uses generated element classifications of the extracted visual elements (e.g., from a previous layer or object detector) and generates structural embeddings from the element classifications. For example, structural feature embeddings include vectors or numerical representations of the element classifications (e.g., categorical numerical representations or encodings generated by a machine learning model). [0058] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. Generate document classification and parent-child element links for digital document images are task(s) based on the document data representation.)
wherein the method is performed by one or more computing devices. (Morariu − [0133] Turning to FIG. 8, additional detail will now be provided regarding various component and capabilities of the document hierarchy generation system 104. In particular, FIG. 8 shows the document hierarchy generation system 104 implemented by the computing device 800 (e.g., the server(s) 101 and/or the client device 106 discussed above with reference to FIG. 1).)
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teaching of Morariu and Wu as each invention relates to determining document layout, structure and content within a digital document. Adding the teaching of Wu provide Morariu with method for combining a plurality of document embedding as a group. One of ordinary skill in the art would have been motivated for improving high accuracy and scalability in constructing a neural networking model for text analysis ([0015]).
Regarding dependent claim 2, depends on claim 1, Morariu teaches: wherein the set of table features for a table detected in the image of the document includes two or more of: coordinates of the table, a width of the table, a height of the table, a number of columns in the table, or a number of rows in the table. (Morariu − [0024] For instance, the document hierarchy generation system uses spatial elements of the digital document image by using positional coordinates of visual elements within the digital document image to generate spatial feature embeddings [0076] For example, the bounding box of each child element is present through its upper left ([x.sub.1, y.sub.1]) and bottom-right ([x.sub.1, y.sub.1]) co-ordinates that are normalized b=[x.sub.1/W, y.sub.1/H, x.sub.2/ W, y.sub.2/H], where H and W are the height and width of digital document image 302. The height and width of visual elements such as the tables within the document and positional coordinates of visual elements)
Regarding dependent claim 3, depends on claim 1, Morariu teaches: further comprising: based on the image, (Morariu − [0048-0049] [0048] For example, visual elements include graphical components, items, or portions of a digital document image.) 
determining a set of layout features of the document; generating one or more layout embeddings based on the set of layout features of the document; wherein inputting comprises inputting the one or more layout embeddings into the machine-learned model; (Morariu − [0054] More specifically, the document hierarchy generation system 104 utilizes tokens extracted using OCR as input for a trained natural language processing model to generate semantic feature embeddings. For example, semantic feature embeddings are feature embeddings that reflect semantic meaning of text in a digital document image. For example, semantic feature embeddings include a vector of semantic features generated by a machine learning model, such as a neural network. The document hierarchy generation system 104 can utilize a variety of language machine learning models; [0058] FIG. 2, [0081] As illustrated in FIG. 4, the document hierarchy generation system 104 utilizes the spatial feature embeddings 306, the semantic feature embeddings 310, the visual feature embeddings 314, and/or the structural feature embeddings 318 as input for a multimodal transformer encoder 402 (and the neural network 400).)
wherein the document data representation is generated also based on the one or more layout embeddings. (Morariu − [0066] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 210 of generating a digital document hierarchy. For example, the document hierarchy generation system 104 generates a digital document hierarchy by combining the determined child visual elements and visual elements determined at each layer-wise analysis.)
Regarding dependent claim 9, depends on claim 1, Morariu teaches: further comprising: identifying one or more image embeddings that were generated based on a set of image features of the image of the document; (Morariu − [0048-0049] [0048] As indicated, the document hierarchy generation system 104 extracts visual elements from a digital document image and uses the visual elements to generate feature embeddings. For example, visual elements include graphical components, items, or portions of a digital document image. To illustrate, visual elements include graphical components such as images, tables, or columns. In some implementations, the document hierarchy generation system utilizes boxes or other shapes (e.g., bounding shapes) to define visual elements. For example, document hierarchy generation system determines a visual element by extracting a bounding box representing the visual element.)
wherein inputting comprises inputting the one or more image embeddings into the machine-learned model; (Morariu − [0054] More specifically, the document hierarchy generation system 104 utilizes tokens extracted using OCR as input for a trained natural language processing model to generate semantic feature embeddings. For example, semantic feature embeddings are feature embeddings that reflect semantic meaning of text in a digital document image. For example, semantic feature embeddings include a vector of semantic features generated by a machine learning model, such as a neural network. The document hierarchy generation system 104 can utilize a variety of language machine learning models; [0058] FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. In one or more embodiments, element classifications refer to a label, type, or name assigned to a visual element. For example, a neural network can generate or assign element classifications for candidate parent visual elements (or selected parent visual elements). Thus, for example, an element classification can include a text block, choice group, table, or field.)
wherein the document data representation is generated also based on the one or more image embeddings. (Morariu − [0066] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 210 of generating a digital document hierarchy. For example, the document hierarchy generation system 104 generates a digital document hierarchy by combining the determined child visual elements and visual elements determined at each layer-wise analysis.)
Regarding dependent claim 10, depends on claim 1, Morariu teaches: further comprising: based on the image, (Morariu − [0048-0049] [0048] For example, visual elements include graphical components, items, or portions of a digital document image.) 
identifying a set of words extracted from the image; (Morariu − [0054] More specifically, the document hierarchy generation system 104 utilizes tokens extracted using OCR as input for a trained natural language processing model to generate semantic feature embeddings. For example, semantic feature embeddings are feature embeddings that reflect semantic meaning of text in a digital document image. For example, semantic feature embeddings include a vector of semantic features generated by a machine learning model, such as a neural network. The document hierarchy generation system 104 can utilize a variety of language machine learning models; [0058] FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. In one or more embodiments, element classifications refer to a label, type, or name assigned to a visual element. For example, a neural network can generate or assign element classifications for candidate parent visual elements (or selected parent visual elements). Thus, for example, an element classification can include a text block, choice group, table, or field.)
generating the plurality of word data representations based on the set of words; (Morariu − [0048-0049] [0048] As indicated, the document hierarchy generation system 104 extracts visual elements from a digital document image and uses the visual elements to generate feature embeddings. To illustrate, visual elements include graphical components such as text (word data representations); For example, document hierarchy generation system determines a visual element by extracting a bounding box representing the visual element. To illustrate, in some implementations, the document hierarchy generation system represents visual elements as words and bounding box coordinates (e.g., OCR tokens generated utilizing an optical character recognition model). OCR model to determine and identify text in order to generate OCR tokens)
based on the image, (Morariu − [0048-0049] [0048] For example, visual elements include graphical components, items, or portions of a digital document image.)
determining the set of table features of one or more tables in the document; (Morariu − [0048-0049] [0048] As indicated, the document hierarchy generation system 104 extracts visual elements from a digital document image and uses the visual elements to generate feature embeddings. To illustrate, visual elements include graphical components such as tables; [0106] For example, the document hierarchy generation system 104 can utilize a training dataset that comprises a ground truth document hierarchy composed of elements arranged in a tree-like format where a higher element (e.g., a list or field) may be a parent of one or more basic elements (e.g., OCR extracted text). The document hierarchy can include a variety of element classifications, TableRow, Table, TableCell or Header. TableRow, Table, TableCell or Header are a set of table features.)
generating the one or more table embeddings based on the set of table features of one or more tables that were detected in the image of the document. (Morariu − [0058] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. In one or more embodiments, element classifications refer to a label, type, or name assigned to a visual element. For example, a neural network can generate or assign element classifications for candidate parent visual elements (or selected parent visual elements). Thus, for example, an element classification can include a text block, choice group, table, or field.) to generate a document data representation for the document; (Morariu − [0066] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 210 of generating a digital document hierarchy. For example, the document hierarchy generation system 104 generates a digital document hierarchy by combining the determined child visual elements and visual elements determined at each layer-wise analysis.)
Regarding independent claim 12, is directed to non-transitory storage media. Claim 12 have similar/same technical features/limitations as claim 1. Claim 12 is rejected under the same rationale.
Regarding dependent claim 14, depends on claim 12, Morariu teaches: further comprising: based on the image, (Morariu − [0048-0049] [0048] For example, visual elements include graphical components, items, or portions of a digital document image.) 
identifying a set of words extracted from the image; 
generating the plurality of word embeddings based on the set of words; based on the image, (Morariu − [0054] More specifically, the document hierarchy generation system 104 utilizes tokens extracted using OCR as input for a trained natural language processing model to generate semantic feature embeddings. For example, semantic feature embeddings are feature embeddings that reflect semantic meaning of text in a digital document image. For example, semantic feature embeddings include a vector of semantic features generated by a machine learning model, such as a neural network. The document hierarchy generation system 104 can utilize a variety of language machine learning models; [0058] FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. In one or more embodiments, element classifications refer to a label, type, or name assigned to a visual element. For example, a neural network can generate or assign element classifications for candidate parent visual elements (or selected parent visual elements). Thus, for example, an element classification can include a text block, choice group, table, or field.)
identifying the set of layout features of the document; (Morariu − [0074] As previously mentioned, document hierarchy generation system 104 generates spatial feature embeddings 306 from positional coordinates. Specifically, the document hierarchy generation system 104 generates spatial feature embeddings from the bounding boxes representing the child elements. In some embodiments, document hierarchy generation system 104 extracts the bounding box coordinates to derive the relative layout information for each child element.)
generating the one or more layout embeddings based on the set of layout features. (Morariu − [0066] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 210 of generating a digital document hierarchy. For example, the document hierarchy generation system 104 generates a digital document hierarchy by combining the determined child visual elements and visual elements determined at each layer-wise analysis.)
Regarding independent claim 15, is directed to one or more non-transitory storage media storing instructions which, when executed by one or more computing devices, (Morariu − [0141] FIGS. 1-8, the corresponding text, and the examples provide a number of different methods, systems, devices, and non-transitory computer-readable media of the document hierarchy generation system 10. [0158] In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non-transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., memory), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.)
Claim 15 have similar/same technical features/limitations as claim 1 and the claims are rejected under the same rationale.
Regarding dependent claim 16, depends on claim 15, Morariu teaches: wherein the set of table features for a table detected in the image of the document includes two or more of: coordinates of the table, a width of the table, a height of the table, a number of columns in the table, or a number of rows in the table. (Morariu − [0024] For instance, the document hierarchy generation system uses spatial elements of the digital document image by using positional coordinates of visual elements within the digital document image to generate spatial feature embeddings [0076] For example, the bounding box of each child element is present through its upper left ([x.sub.1, y.sub.1]) and bottom-right ([x.sub.1, y.sub.1]) co-ordinates that are normalized b=[x.sub.1/W, y.sub.1/H, x.sub.2/ W, y.sub.2/H], where H and W are the height and width of digital document image 302. The height and width of visual elements such as the tables within the document and positional coordinates of visual elements)
Regarding dependent claim 17, depends on claim 15, Morariu teaches: further comprising: based on the image, (Morariu − [0048-0049] [0048] For example, visual elements include graphical components, items, or portions of a digital document image.) 
determining a set of layout features of the document; (Morariu − [0074] As previously mentioned, document hierarchy generation system 104 generates spatial feature embeddings 306 from positional coordinates. Specifically, the document hierarchy generation system 104 generates spatial feature embeddings from the bounding boxes representing the child elements. In some embodiments, document hierarchy generation system 104 extracts the bounding box coordinates to derive the relative layout information for each child element.)
generating one or more layout embeddings based on the set of layout features of the document; (Morariu − [0066] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 210 of generating a digital document hierarchy. For example, the document hierarchy generation system 104 generates a digital document hierarchy by combining the determined child visual elements and visual elements determined at each layer-wise analysis.)
wherein inputting comprises inputting the one or more layout embeddings into the machine-learned model; (Morariu − [0054] More specifically, the document hierarchy generation system 104 utilizes tokens extracted using OCR as input for a trained natural language processing model to generate semantic feature embeddings. For example, semantic feature embeddings are feature embeddings that reflect semantic meaning of text in a digital document image. For example, semantic feature embeddings include a vector of semantic features generated by a machine learning model, such as a neural network. The document hierarchy generation system 104 can utilize a variety of language machine learning models; [0058] FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. In one or more embodiments, element classifications refer to a label, type, or name assigned to a visual element. For example, a neural network can generate or assign element classifications for candidate parent visual elements (or selected parent visual elements). Thus, for example, an element classification can include a text block, choice group, table, or field.)
wherein the document data representation is generated also based on the one or more layout embeddings. (Morariu − [0066] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 210 of generating a digital document hierarchy. For example, the document hierarchy generation system 104 generates a digital document hierarchy by combining the determined child visual elements and visual elements determined at each layer-wise analysis.)
Regarding dependent claim 21, depends on claim 15, Morariu teaches: identifying one or more image embeddings that were generated based on a set of image features of the image of the document; (Morariu − [0048-0049] [0048] As indicated, the document hierarchy generation system 104 extracts visual elements from a digital document image and uses the visual elements to generate feature embeddings. For example, visual elements include graphical components, items, or portions of a digital document image. To illustrate, visual elements include graphical components such as images, tables, or columns. In some implementations, the document hierarchy generation system utilizes boxes or other shapes (e.g., bounding shapes) to define visual elements. For example, document hierarchy generation system determines a visual element by extracting a bounding box representing the visual element.)
wherein inputting comprises inputting the one or more image embeddings into the machine-learned model; (Morariu − [0054] More specifically, the document hierarchy generation system 104 utilizes tokens extracted using OCR as input for a trained natural language processing model to generate semantic feature embeddings. For example, semantic feature embeddings are feature embeddings that reflect semantic meaning of text in a digital document image. For example, semantic feature embeddings include a vector of semantic features generated by a machine learning model, such as a neural network. The document hierarchy generation system 104 can utilize a variety of language machine learning models; [0058] FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. In one or more embodiments, element classifications refer to a label, type, or name assigned to a visual element. For example, a neural network can generate or assign element classifications for candidate parent visual elements (or selected parent visual elements). Thus, for example, an element classification can include a text block, choice group, table, or field.)
wherein the document embedding is generated also based on the one or more image embeddings. (Morariu − [0066] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 210 of generating a digital document hierarchy. For example, the document hierarchy generation system 104 generates a digital document hierarchy by combining the determined child visual elements and visual elements determined at each layer-wise analysis.)


Claim(s) 4-8, 13, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Morariu as applied to claims 1, 12 and 15 above, and further in view of Pena Pena, K. (US PAT: US 11861884 B1, Filed Date: Apr. 10, 2023, hereinafter “Pena”).
Regarding dependent claim 4, depends on claim 1, Morariu teaches: wherein the machine-learned model is a first model, comprising: automatically generating a plurality of labels, each label for a training instance of a plurality of training instances; (Morariu − [0058] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. In one or more embodiments, element classifications refer to a label, type, or name assigned to a visual element. For example, a neural network can generate or assign element classifications for candidate parent visual elements (or selected parent visual elements). [0106] For example, the document hierarchy generation system 104 can utilize a training dataset that comprises a ground truth document hierarchy composed of elements arranged in a tree-like format where a higher element (e.g., a list or field) may be a parent of one or more basic elements (e.g., OCR extracted text). The document hierarchy can include a variety of element classifications, such as Widget, TableRow, ChoiceGroup, Footer, Section, ListItem, Table, TextRun, TableCell, TextBlock, List, Image, Field, Form, or Header.)
training the first model based on the plurality of training instances. (Morariu − [0029] For instance, the document hierarchy generation system uses a training dataset comprising a digital document image portraying a plurality of visual elements and further including ground-truth parent-child relationships and ground-truth element classifications and uses that dataset to generate predicted element classifications and predicted parent-child element links. [0059] machine learning model can include a computer algorithm with branches, weights, or parameters that change based on training data to improve for a particular task. Retraining to improve a particular task within the model (first model).)
wherein the second model predicts whether an input text item is from a table in a document; (Morariu − [0029] For instance, the document hierarchy generation system uses a training dataset comprising a digital document image portraying a plurality of visual elements and further including ground-truth parent-child relationships and ground-truth element classifications and uses that dataset to generate predicted element classifications and predicted parent-child element links. In some instances, the document hierarchy generation system modifies the parameters of the neural network by comparing the predicted element classifications with the ground truth classifications and comparing the parent-child element links with the ground truth parent-child element relationships. Predicting the parent-child element links with the ground truth parent-child element relationship, that text is from a table. Retraining a model is a second model.)
Morariu does not explicitly teach: wherein each label of the plurality of labels indicates whether a text item in a corresponding training instance is from a table in a document; wherein an output of the first model is input to a second model that is different than the first model;
However, Pena teaches: wherein each label of the plurality of labels indicates whether a text item in a corresponding training instance is from a table in a document; (Pena − [Col. 3 ll. 40-45] Fig. 3, a second multi-modal transformer model, which, after being fully trained, can be used to create pseudo-labels for the in-domain unlabeled data)
wherein an output of the first model is input to a second model that is different than the first model; (Pena − [Col. 8 ll. 19-35] Fig. 3, output of the first multimodal transformation model is inputted into second multimodal transformation model)
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teaching of Morariu, Wu and Pena as each invention relates to determining document layout, structure and content within a digital document. Adding the teaching of Pena provide Morariu with iterative training approach for improving the classification/labeling of structure data. One of ordinary skill in the art would have been motivated for improving machine learning model performance and accuracy of classifying data within a document.
Regarding dependent claim 5, depends on claim 4, Morariu teaches: wherein each label of the plurality of labels indicates whether the text item is from a header in a table in the document, from content of a table in the document, or not from any table in the document. (Morariu − [0058] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. In one or more embodiments, element classifications refer to a label, type, or name assigned to a visual element. For example, a neural network can generate or assign element classifications for candidate parent visual elements (or selected parent visual elements). Thus, for example, an element classification can include a text block, choice group, table, or field. [0106] For example, the document hierarchy generation system 104 can utilize a training dataset that comprises a ground truth document hierarchy composed of elements arranged in a tree-like format where a higher element (e.g., a list or field) may be a parent of one or more basic elements (e.g., OCR extracted text). The document hierarchy can include a variety of element classifications, TableRow, Table, TableCell or Header. TableRow, Table, TableCell or Header are a set of table features. NOTE: Element type classification and parent-child element link for content of a table)
Regarding dependent claim 6, depends on claim 4, Morariu teaches: automatically making a determination of whether the text item is part of a table in said each section document; (Morariu − [0058] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. In one or more embodiments, element classifications refer to a label, type, or name assigned to a visual element. For example, a neural network can generate or assign element classifications for candidate parent visual elements (or selected parent visual elements). Thus, for example, an element classification can include a text block, choice group, table, or field. [0106] For example, the document hierarchy generation system 104 can utilize a training dataset that comprises a ground truth document hierarchy composed of elements arranged in a tree-like format where a higher element (e.g., a list or field) may be a parent of one or more basic elements (e.g., OCR extracted text). The document hierarchy can include a variety of element classifications, TableRow, Table, TableCell or Header. TableRow, Table, TableCell or Header are a set of table features. NOTE: Element type classification and parent-child element link for content of a table)
Morariu does not explicitly teach: further comprising, prior to training the second model: for each sectioned document of a plurality of sectioned documents: selecting a text item from said each sectioned document; 
However, Pena teaches: further comprising, prior to training the second model: for each sectioned document of a plurality of sectioned documents: selecting a text item from said each sectioned document; (Pena – [Col. 3 ll. 25-35] Fig. 3, The method includes pre-training a first multimodal transformer model on an unlabeled dataset comprising documents including text features and layout features, training a second multimodal transformer model on a first labeled dataset comprising documents including text features and layout features to perform a key information extraction task, and processing the unlabeled dataset with the second multimodal transformer model to generate pseudo-labels for the unlabeled dataset.)
generating a training instance based on the determination. (Pena − [Col. 3 ll. 40-45] Fig. 3, a second multi-modal transformer model, which, after being fully trained, can be used to create pseudo-labels for the in-domain unlabeled data)
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teaching of Morariu, Wu and Pena as each invention relates to determining document layout, structure and content within a digital document. Adding the teaching of Pena provide Morariu with iterative training approach for improving the classification/labeling of structure data. One of ordinary skill in the art would have been motivated for improving machine learning model performance and accuracy of classifying data within a document.
Regarding dependent claim 7, depends on claim 4, Morariu teaches: wherein the plurality of training instances is a first plurality of training instance, (Morariu − [0029] For instance, the document hierarchy generation system uses a training dataset comprising a digital document image portraying a plurality of visual elements and further including ground-truth parent-child relationships and ground-truth element classifications and uses that dataset to generate predicted element classifications and predicted parent-child element links. [0059] machine learning model can include a computer algorithm with branches, weights, or parameters that change based on training data to improve for a particular task. Retraining to improve a particular task within the model (first model).)
Morariu does not explicitly teach: further comprising: after training the first model based on the first plurality of training instances, training, based on a second plurality of training instances that is different than the first plurality of training instances, a third model that takes output of the first model as input; wherein the third model has a first task objective that is different than a second task objective of the second model.
However, Pena teaches: further comprising: after training the first model based on the first plurality of training instances, training, based on a second plurality of training instances that is different than the first plurality of training instances, a third model that takes output of the first model as input; wherein the third model has a first task objective that is different than a second task objective of the second model. (Pena – [Col. 3 ll. 25-35] Fig. 3, The method includes pre-training a first multimodal transformer model on an unlabeled dataset comprising documents including text features and layout features, training a second multimodal transformer model on a first labeled dataset comprising documents including text features and layout features to perform a key information extraction task, and processing the unlabeled dataset with the second multimodal transformer model to generate pseudo-labels for the unlabeled dataset. processing the unlabeled dataset with the third multimodal transformer model to update the pseudo-labels for the unlabeled dataset, and training the third multimodal transformer model using a noise-aware loss function and the updated pseudo-labels to generate an updated third multimodal transformer model.)
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teaching of Morariu, Wu and Pena as each invention relates to determining document layout, structure and content within a digital document. Adding the teaching of Pena provide Morariu with iterative training approach for improving the classification/labeling of structure data. One of ordinary skill in the art would have been motivated for improving machine learning model performance and accuracy of classifying data within a document.
Regarding dependent claim 8, depends on claim 7, Morariu does not explicitly teach: further comprising updating weights of the first model based on the training of the third model based on the second plurality of training instances.
However, Pena teaches: further comprising updating weights of the first model based on the training of the third model based on the second plurality of training instances. (Pena – [Col. 3 ll. 25-35] Fig. 3, The method includes pre-training a first multimodal transformer model on an unlabeled dataset comprising documents including text features and layout features, training a second multimodal transformer model on a first labeled dataset comprising documents including text features and layout features to perform a key information extraction task, and processing the unlabeled dataset with the second multimodal transformer model to generate pseudo-labels for the unlabeled dataset. processing the unlabeled dataset with the third multimodal transformer model to update the pseudo-labels for the unlabeled dataset, and training the third multimodal transformer model using a noise-aware loss function and the updated pseudo-labels to generate an updated third multimodal transformer model.)
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teaching of Morariu, Wu and Pena as each invention relates to determining document layout, structure and content within a digital document. Adding the teaching of Pena provide Morariu with iterative training approach for improving the classification/labeling of structure data. One of ordinary skill in the art would have been motivated for improving machine learning model performance and accuracy of classifying data within a document.
Regarding dependent claim 13, depends on claim 12, Morariu teaches: wherein the machine-learned model is a first model, the method further comprising: automatically generating a plurality of labels, each label for a training instance of a plurality of training instances; (Morariu − [0058] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. In one or more embodiments, element classifications refer to a label, type, or name assigned to a visual element. For example, a neural network can generate or assign element classifications for candidate parent visual elements (or selected parent visual elements). [0106] For example, the document hierarchy generation system 104 can utilize a training dataset that comprises a ground truth document hierarchy composed of elements arranged in a tree-like format where a higher element (e.g., a list or field) may be a parent of one or more basic elements (e.g., OCR extracted text). The document hierarchy can include a variety of element classifications, such as Widget, TableRow, ChoiceGroup, Footer, Section, ListItem, Table, TextRun, TableCell, TextBlock, List, Image, Field, Form, or Header.)
training the second model based on the plurality of training instances. (Morariu − [0029] For instance, the document hierarchy generation system uses a training dataset comprising a digital document image portraying a plurality of visual elements and further including ground-truth parent-child relationships and ground-truth element classifications and uses that dataset to generate predicted element classifications and predicted parent-child element links. [0059] machine learning model can include a computer algorithm with branches, weights, or parameters that change based on training data to improve for a particular task. Retraining to improve a particular task within the model (first model).)
wherein the second model predicts whether an input pair of text items are from a same section in a document;  (Morariu − [0029] For instance, the document hierarchy generation system uses a training dataset comprising a digital document image portraying a plurality of visual elements and further including ground-truth parent-child relationships and ground-truth element classifications and uses that dataset to generate predicted element classifications and predicted parent-child element links. In some instances, the document hierarchy generation system modifies the parameters of the neural network by comparing the predicted element classifications with the ground truth classifications and comparing the parent-child element links with the ground truth parent-child element relationships. Predicting the parent-child element links with the ground truth parent-child element relationship, that text is from a table. Retraining a model is a second model.)
Morariu does not explicitly teach: wherein each label of the plurality of labels indicates whether a pair of text items are from a same section in a document;  wherein an output of the first model is input to a second model that is different than the machine-learned model;
However, Pena teaches: wherein each label of the plurality of labels indicates whether a pair of text items are from a same section in a document; (Pena − [Col. 3 ll. 40-45] Fig. 3, a second multi-modal transformer model, which, after being fully trained, can be used to create pseudo-labels for the in-domain unlabeled data)
wherein an output of the first model is input to a second model that is different than the machine-learned model; Pena − [Col. 8 ll. 19-35] Fig. 3, output of the first multimodal transformation model is inputted into second multimodal transformation model)
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teaching of Morariu, Wu and Pena as each invention relates to determining document layout, structure and content within a digital document. Adding the teaching of Pena provide Morariu with iterative training approach for improving the classification/labeling of structure data. One of ordinary skill in the art would have been motivated for improving machine learning model performance and accuracy of classifying data within a document.
Regarding dependent claim 18, depends on claim 15, Morariu teaches: wherein the machine-learned model is a first model, wherein the instructions, when executed by the one or more computing devices, further cause: automatically generating a plurality of labels, each label for a training instance of a plurality of training instances; (Morariu − [0058] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. In one or more embodiments, element classifications refer to a label, type, or name assigned to a visual element. For example, a neural network can generate or assign element classifications for candidate parent visual elements (or selected parent visual elements). [0106] For example, the document hierarchy generation system 104 can utilize a training dataset that comprises a ground truth document hierarchy composed of elements arranged in a tree-like format where a higher element (e.g., a list or field) may be a parent of one or more basic elements (e.g., OCR extracted text). The document hierarchy can include a variety of element classifications, such as Widget, TableRow, ChoiceGroup, Footer, Section, ListItem, Table, TextRun, TableCell, TextBlock, List, Image, Field, Form, or Header.)
training the first model based on the plurality of training instances. (Morariu − [0029] For instance, the document hierarchy generation system uses a training dataset comprising a digital document image portraying a plurality of visual elements and further including ground-truth parent-child relationships and ground-truth element classifications and uses that dataset to generate predicted element classifications and predicted parent-child element links. [0059] machine learning model can include a computer algorithm with branches, weights, or parameters that change based on training data to improve for a particular task. Retraining to improve a particular task within the model (first model).)
wherein the second model predicts whether an input text item is from a table in a document; (Morariu − [0029] For instance, the document hierarchy generation system uses a training dataset comprising a digital document image portraying a plurality of visual elements and further including ground-truth parent-child relationships and ground-truth element classifications and uses that dataset to generate predicted element classifications and predicted parent-child element links. In some instances, the document hierarchy generation system modifies the parameters of the neural network by comparing the predicted element classifications with the ground truth classifications and comparing the parent-child element links with the ground truth parent-child element relationships. Predicting the parent-child element links with the ground truth parent-child element relationship, that text is from a table. Retraining a model is a second model.)
Morariu does not explicitly teach: wherein each label of the plurality of labels indicates whether a text item in a corresponding training instance is from a table in a document; wherein an output of the first model is input to a second model that is different than the first model;
However, Pena teaches: wherein each label of the plurality of labels indicates whether a text item in a corresponding training instance is from a table in a document; (Pena − [Col. 3 ll. 40-45] Fig. 3, a second multi-modal transformer model, which, after being fully trained, can be used to create pseudo-labels for the in-domain unlabeled data)
wherein an output of the first model is input to a second model that is different than the first model; (Pena − [Col. 8 ll. 19-35] Fig. 3, output of the first multimodal transformation model is inputted into second multimodal transformation model)
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teaching of Morariu, Wu and Pena as each invention relates to determining document layout, structure and content within a digital document. Adding the teaching of Pena provide Morariu with iterative training approach for improving the classification/labeling of structure data. One of ordinary skill in the art would have been motivated for improving machine learning model performance and accuracy of classifying data within a document.
Regarding dependent claim 19, depends on claim 18, Morariu teaches: automatically making a determination of whether the text item is part of a table in said each section document; (Morariu − [0058] As further illustrated in FIG. 2, in some embodiments, the document hierarchy generation system 104 performs an act 206 of generating element type classifications and parent-child element link probabilities. In one or more embodiments, element classifications refer to a label, type, or name assigned to a visual element. For example, a neural network can generate or assign element classifications for candidate parent visual elements (or selected parent visual elements). Thus, for example, an element classification can include a text block, choice group, table, or field. [0106] For example, the document hierarchy generation system 104 can utilize a training dataset that comprises a ground truth document hierarchy composed of elements arranged in a tree-like format where a higher element (e.g., a list or field) may be a parent of one or more basic elements (e.g., OCR extracted text). The document hierarchy can include a variety of element classifications, TableRow, Table, TableCell or Header. TableRow, Table, TableCell or Header are a set of table features. NOTE: Element type classification and parent-child element link for content of a table)
Morariu does not explicitly teach: wherein the instructions, when executed by the one or more computing devices, further cause, prior to training the second model: for each sectioned document of a plurality of sectioned documents: selecting a text item from said each sectioned document; 
However, Pena teaches: wherein the instructions, when executed by the one or more computing devices, further cause, prior to training the second model: for each sectioned document of a plurality of sectioned documents: selecting a text item from said each sectioned document; (Pena – [Col. 3 ll. 25-35] Fig. 3, The method includes pre-training a first multimodal transformer model on an unlabeled dataset comprising documents including text features and layout features, training a second multimodal transformer model on a first labeled dataset comprising documents including text features and layout features to perform a key information extraction task, and processing the unlabeled dataset with the second multimodal transformer model to generate pseudo-labels for the unlabeled dataset.)
generating a training instance based on the determination. (Pena − [Col. 3 ll. 40-45] Fig. 3, a second multi-modal transformer model, which, after being fully trained, can be used to create pseudo-labels for the in-domain unlabeled data)
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teaching of Morariu, Wu and Pena as each invention relates to determining document layout, structure and content within a digital document. Adding the teaching of Pena provide Morariu with iterative training approach for improving the classification/labeling of structure data. One of ordinary skill in the art would have been motivated for improving machine learning model performance and accuracy of classifying data within a document.
Regarding dependent claim 20, depends on claim 18, Morariu teaches: wherein the plurality of training instances is a first plurality of training instance, (Morariu − [0029] For instance, the document hierarchy generation system uses a training dataset comprising a digital document image portraying a plurality of visual elements and further including ground-truth parent-child relationships and ground-truth element classifications and uses that dataset to generate predicted element classifications and predicted parent-child element links. [0059] machine learning model can include a computer algorithm with branches, weights, or parameters that change based on training data to improve for a particular task. Retraining to improve a particular task within the model (first model).)
Morariu does not explicitly teach: wherein the instructions, when executed by the one or more computing devices, further cause: after training the first model based on the first plurality of training instances, training, based on a second plurality of training instances that is different than the first plurality of training instances, a third model that takes output of the first model as input; wherein the third model has a first task objective that is different than a second task objective of the second model.
However, Pena teaches: wherein the instructions, when executed by the one or more computing devices, further cause: after training the first model based on the first plurality of training instances, training, based on a second plurality of training instances that is different than the first plurality of training instances, a third model that takes output of the first model as input; wherein the third model has a first task objective that is different than a second task objective of the second model. (Pena – [Col. 3 ll. 25-35] Fig. 3, The method includes pre-training a first multimodal transformer model on an unlabeled dataset comprising documents including text features and layout features, training a second multimodal transformer model on a first labeled dataset comprising documents including text features and layout features to perform a key information extraction task, and processing the unlabeled dataset with the second multimodal transformer model to generate pseudo-labels for the unlabeled dataset. processing the unlabeled dataset with the third multimodal transformer model to update the pseudo-labels for the unlabeled dataset, and training the third multimodal transformer model using a noise-aware loss function and the updated pseudo-labels to generate an updated third multimodal transformer model.)
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teaching of Morariu, Wu and Pena as each invention relates to determining document layout, structure and content within a digital document. Adding the teaching of Pena provide Morariu with iterative training approach for improving the classification/labeling of structure data. One of ordinary skill in the art would have been motivated for improving machine learning model performance and accuracy of classifying data within a document.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CARL E BARNES JR whose telephone number is (571)270-3395. The examiner can normally be reached Monday-Friday 9am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Hong can be reached at (571) 272-4124. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CARL E BARNES JR/Examiner, Art Unit 2178                                                                                                                                                                                                        
/STEPHEN S HONG/Supervisory Patent Examiner, Art Unit 2178
Read full office action
Prosecution Timeline

Jun 15, 2023
Application Filed
Sep 02, 2025
Non-Final Rejection — §103
Nov 21, 2025
Response Filed
Nov 21, 2025
Applicant Interview (Telephonic)
Nov 21, 2025
Examiner Interview Summary
Mar 04, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/953,132
Patent 12584932
SLIDE IMAGING APPARATUS AND A METHOD FOR IMAGING A SLIDE
2y 5m to grant Granted Mar 24, 2026
16/871,512
Patent 12541640
COMPUTING DEVICE FOR MULTIPLE CELL LINKING
2y 5m to grant Granted Feb 03, 2026
16/262,443
Patent 12536464
SYSTEM FOR CONSTRUCTING EFFECTIVE MACHINE-LEARNING PIPELINES WITH OPTIMIZED OUTCOMES
2y 5m to grant Granted Jan 27, 2026
17/428,937
Patent 12530765
SYSTEMS AND METHODS FOR CALCIUM-FREE COMPUTED TOMOGRAPHY ANGIOGRAPHY
2y 5m to grant Granted Jan 20, 2026
17/975,033
Patent 12530523
METHOD, APPARATUS, SYSTEM, AND COMPUTER PROGRAM FOR CORRECTING TABLE COORDINATE INFORMATION
2y 5m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
32%
Grant Probability
57%
With Interview (+25.2%)
4y 4m
Median Time to Grant
Moderate
PTA Risk
Based on 202 resolved cases by this examiner. Grant probability derived from career allow rate.
LAYOUT AWARE MULTI-MODAL NETWORKS FOR DOCUMENT UNDERSTANDING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email