Office Action Analysis: 18684557 — STRUCTURAL ENCODING AND ATTENTION PARADIGMS FOR SEQUENCE MODELING

Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 and  3-9 are rejected under 35 U.S.C. 103 as being unpatentable over Zuo et. al (“Context-Specific Heterogeneous Graph Convolutional Network for Implicit Sentiment Analysis”) in view of Xu et. al (“LayoutLM: Pre-training of Text and Layout for Document Image Understanding”).

Regarding Claim 1, Zuo teaches a processing system comprising:
one or more processors coupled to the memory and configured to classify text from a given document, comprising: generating a beta-skeleton graph based on a plurality of tokens, each given token of the plurality of tokens corresponding to a given string of text in the given document, and wherein the beta-skeleton graph comprises, for each given token (Zuo: Introduction, III. CONTEXT-SPECIFIC HETEROGENEOUS GRAPH CONVOLUTIONAL NETWORK, and Fig. 2 (shown below));

    PNG
    media_image1.png
    358
    957
    media_image1.png
    Greyscale


    PNG
    media_image2.png
    565
    537
    media_image2.png
    Greyscale

Introduction: “In the heterogeneous graph, the nodes of the graph are composed of tokens or sentences… the whole context at the document level is considered a heterogeneous graph.”
A. WORD REPRESENTATION AND BIDIRECTIONAL GRU CODING: “In this component, each sentence is represented as S = {w1,w2 . . . ,wi . . . ,wn}, where wi is the token.”
E. SENTIMENT CLASSIFICATION AND TRAIN: “The obtained vector r is fed to the fully connected layer, and the input result is classified into three categories using softmax.”
a node corresponding to the given token and comprising a vector based on content and location of the given string of text within the given document (Zuo: Introduction and III. CONTEXT-SPECIFIC HETEROGENEOUS GRAPH CONVOLUTIONAL NETWORK);
A. WORD REPRESENTATION AND BIDIRECTIONAL GRU CODING: “Each token in the sentence is mapped to a low-latitude vector space to obtain a word embedding matrix E ∈ R n×de, where n is the size of the vocabulary and de is the dimension of the word vector.”
Introduction: “First, we separate the emotional target sentence from its context in a document and then represent all the remaining context text as a heterogeneous graph.”
B. CONTEXTUAL-SPECIFIC GRAPH: “To relate the dependency tree of each sentence, we incorporate sentence order as a feature to represent the relationship between sentence nodes.”
Explanation: Each token is encoded as a vector representation, where context and location are incorporated and sentence order explicitly encodes location. 
one or more edges, each edge of the one or more edges linking the node corresponding to the given token to a neighboring node corresponding to another token of the plurality of tokens (Zuo: III. CONTEXT-SPECIFIC HETEROGENEOUS GRAPH CONVOLUTIONAL NETWORK and Fig. 2 (shown above));
B. CONTEXTUAL-SPECIFIC GRAPH: “DT (i, j) is the dependency relationship between the token nodes i and j in the dependency tree…A ∈ R n×n is the adjacency matrix of graph G, which is composed of the relationship between each node.”
Explanation: Token-to-token edges via dependency relationships and explicit adjacency matrix definition. Figure 2 shows token-token edges (dependency edges) connecting neighboring tokens. 
generating, using the graph convolutional network, a plurality of supertokens based on the beta-skeleton graph, each given supertoken of the plurality of supertokens being based at least in part on the vector of a given node and the vector of each neighboring node to which the given node is linked via one of its one or more edges (Zuo: III. CONTEXT-SPECIFIC HETEROGENEOUS GRAPH CONVOLUTIONAL NETWORK);
C. CONVOLUTIONAL OVER THE HETEROGENEOUS GRAPH: “A single-layer GCN can only rely on one layer of convolutions to obtain the information of neighbor nodes, and by deepening the layers of the GCN, it can integrate the knowledge of a broader neighborhood…a two-layered GCN can allow information passing between nodes that are within two steps away.”
Explanation: GCN aggregates information from neighboring token vectors, and node embeddings are updated using neighbor information. The resulting GCN-updated node embeddings correspond to claimed supertokens (aggregated representations of a token and its neighbors).
	Since Zuo only uses a GCN and not a transformer, he does not teach a memory storing a neural network comprising a graph convolutional network and a transformer, generating, using the transformer, a plurality of predictions based on the plurality of super tokens, and generating a set of classifications based on the plurality of predictions, the set of classifications identifying at least one entity class corresponding to at least one token of the plurality of tokens.
	However, Xu teaches a transformer-based neural network for document understanding that generates predictions and classifications for document tokens, stating that “these input embeddings are passed through a multi-layer bidirectional Transformer that can generate contextualized representations with an adaptive attention mechanism” (2.1 The BERT Model) and “LayoutLM predicts {B, I, E, S, O} tags for each token and uses sequential labeling to detect each type of entity in the dataset” (2.5 Fine-tuning LayoutLM). Xu further teaches that incorporating layout (spatial) information together with textual content improves document classification and entity classification.
	Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine Xu’s transformer with Zuo’s GCN. Xu identifies limitations in prior document understanding systems that rely on either textual modeling alone or isolated structural techniques, noting that effective document understanding requires joint modeling of textual content and layout relationships. Xu further teaches that transformer-based contextual modeling benefits from richer token representations that encode structural relationships between tokens, which improves downstream classification and entity recognition performance. In view of these teachings, one of ordinary skill in the art would have been motivated to combine the token-relationship modeling of Zuo with the transformer-based prediction framework of Xu in order to leverage both graph-based structural context and transformer-based contextual reasoning, thereby improving document classification and entity extraction accuracy. 

Regarding Claim 3, Zuo in view of Xu teaches the processing system of claim 1, but Zuo does not teach that the beta- skeleton graph further comprises, for each given token: a given edge embedding corresponding to each given edge of the one or more edges, the given edge embedding being based on a spatial relationship in the given document between the given token and a token corresponding to the neighboring node to which the given edge is linked. Zuo teaches edges with weights representing relationships, providing edge representations per token-neighbor relationship, but does not teach spatial relationship information. 
	However, Xu teaches that each token is associated with explicit spatial layout information, including 2-D positional coordinates (bounding boxes), and that such spatial relationships are critical for document understanding, stating that “2-D position embedding aims to model the relative spatial position in a document…the bounding box can be precisely defined by (𝑥0, 𝑦0, 𝑥1, 𝑦1), where (𝑥0, 𝑦0) corresponds to the position of the upper left in the bounding box, and (𝑥1, 𝑦1) represents the position of the lower right” (2.3 Model Architecture). 
	Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate Xu’s spatial relationship information into Zuo’s GCN system. Xu teaches that document semantics depend not only on textual content but also on spatial arrangement, and that modeling spatial proximity between tokens improves structural understanding of documents. Accordingly, a person of ordinary skill in the art would have been motivated to incorporate spatially defined token-token edges, such as proximity-based graphs including beta-skeleton graphs, into the graph structure of Zuo, and to represent those edges using edge embeddings based on spatial relationships, in order to leverage Xu’s spatial layout signals within a graph-based document representation. 

Regarding Claim 4, Zuo in view of Xu teaches the processing system of claim 1, but Zuo does not teach that the transformer is configured to use a sparse global-local attention paradigm.
	However, Xu discloses local and global attention behavior, stating that their model uses a “self-attention mechanism within the Transformer,” (2.2 The LayoutLM Model) and that “the [CLS] token uses the Faster R-CNN model to produce embeddings using the whole scanned document image as the Region of Interest (ROI) to benefit the downstream tasks which need the representation of the [CLS] token” (2.3 Model Architecture). The local attention uses token-to-token within neighborhoods, and the global attention is the CLS token aggregating global document context. 
	Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate sparse global-local attention paradigm into Zuo’s system. Xu explains that document understanding benefits from selectively attending to relevant tokens based on layout and structure, rather than uniformly attending to all tokens. Accordingly, a person of ordinary skill in the art would have been motivated to configure the transformer to use a sparse global-local attention paradigm, where local attention captures nearby layout-related tokens and global attention captures document-level context, consistent with Xu’s emphasis on efficient, structure-aware attention for document-scale inputs. 

Regarding Claim 5, Zuo in view of Xu teaches the processing system of claim 4, but Zuo does not teach that the transformer is based on an Extended Transformer Construction architecture.
	However, Xu uses an explicit extension of BERT, stating that “inspired by the BERT model [4], where input textual information is mainly represented by text embeddings and position embeddings, LayoutLM further adds two types of input embeddings: (1) a 2-D position embedding that denotes the relative position of a token within a document; (2) an image embedding for scanned token images within a document” (Introduction). Adding new embeddings modalities to a transformer encoder constitutes an Extended Transformer Construction architecture. 
	Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate this Extended Transformer Construction architecture into Zuo’s system. Xu states that standard transformer architectures must be extended beyond pure text processing to incorporate document layout and structure. Accordingly, a person of ordinary skill in the art would have been motivated to implement Zuo’s system using an Extended Transformer Construction architecture, consistent with Xu’s teaching that transformers should be augmented to support document-specific representations and relational modeling. 

Regarding Claim 6, Zuo in view of Xu teaches the processing system of claim 1, but Zuo does not teach that the given document comprises an image of a document, and wherein the one or more processors are further configured to identify, for each given token of the plurality of tokens, the content and location of the given string of text in the given document to which the given token corresponds.
	However, Xu teaches this limitation, stating that “like the original pre-processing in IIT-CDIP Test Collection, we similarly process the dataset by applying OCR to document images…the difference is that we obtain both the recognized words and their corresponding locations in the document image,” (3.3 Document Pre-processing) and that “the bounding box can be precisely defined by (𝑥0, 𝑦0, 𝑥1, 𝑦1), where (𝑥0, 𝑦0) corresponds to the position of the upper left in the bounding box, and (𝑥1, 𝑦1) represents the position of the lower right” (2.3 Model Architecture).
	Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate Xu’s teachings into Zuo’s system. Xu emphasizes that document understanding requires joint modeling of text and layout extracted from document images, and that token location information is essential for capturing document structure. Accordingly, a person of ordinary skill in the art would have been motivated to apply Xu’s document-image token extraction (content and location) as input into Zuo’s graph-based framework, enabling Zuo to operate on layout-aware token representations as taught by Xu. 

Regarding Claim 7, Zuo in view of Xu teaches the processing system of claim 6, but Zuo does not teach that identifying the content and location of the given string of text in the given document comprises using optical character recognition. However, Xu teaches this limitation (see 3.3 Document Pre-processing (shown above in claim 6)).
	Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate OCR into Zuo’s system. Xu identifies OCR as the mechanism for extracting token text and spatial layout from document images. Accordingly, a person of ordinary skill in the art would have been motivated to use OCR when supplying document-image tokens to Zuo, because OCR provides the token-level content and layout information that Xu teaches is necessary for effective document understanding. 

Regarding Claim 8, Zuo in view of Xu teaches the processing system of claim 1, but Zuo does not teach that generating the set of classifications based on the plurality of predictions comprises performing dynamic processing based on the plurality of predictions.
	However, Xu performs sequence labeling with contextual dependency, stating that “LayoutLM predicts {B, I, E, S, O} tags for each token and uses sequential labeling to detect each type of entity in the dataset” (2.5 Fine-tuning LayoutLM). Sequential labeling involves dynamic processing over multiple predictions, especially when using transformer outputs across tokens. 
	Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate dynamic processing into Zuo’s system. Xu teaches that downstream document understanding tasks often require aggregating multiple token-level predictions to produce final outputs. Accordingly, a person of ordinary skill in the art would have been motivated to perform dynamic processing based on a plurality of predictions to combine and refine predictions in a manner consistent with Xu’s document-level inference approach. 

Regarding Claim 9, Zuo in view of Xu teaches the processing system of claim 1, but Zuo does not teach that the set of classifications based on the plurality of predictions are BIOES classifications. However, Xu teaches this (see 2.5 Fine-tuning LayoutLM (shown in claim 8 above)). 
	Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate BIOES classifications into Zuo’s system. Xu teaches applying structured labeling schemes to token-level predictions for document understanding tasks. Accordingly, a person of ordinary skill in the art would have been motivated to use BIOES classifications to label token predictions produced by the combined Zuo-Xu system, as BIOES is a known structured labeling scheme compatible with Xu’s token-level outputs. 
Claim 2 and 11-19 are rejected under 35 U.S.C. 103 as being unpatentable over Zuo et. al in view of Xu et. al, further in view of Shaw et. al (“Self-Attention with Relative Position Representations”).

Regarding Claim 2, Zuo in view of Xu teaches the processing system of claim 1, and Xu further teaches that generating the plurality of predictions based on the plurality of supertokens using the transformer comprises, for a given attender supertoken and a given attendee supertoken (Xu: 2.1 The BERT model and 2.2 The LayoutLM Model); 
2.1 The BERT model: “It accepts a sequence of tokens and stacks multiple layers to produce final representations.”
2.2 The LayoutLM Model: “Based on the self-attention mechanism within the Transformer, embedding 2-D position features into the language representation will better align the layout information with the semantic representation.”
Explanation: Xu uses a Transformer architecture operating on token-level (word-level) representations that may be grouped into higher-level units (supertokens).
	generating a query vector based on the given attender supertoken (2.2 The LayoutLM Model (shown above)); 
Explanation: All standard self-attention transformers fundamentally rely on the Query (Q), Key (K), and Value (V) mechanism.
generating a key vector based on the given attendee supertoken (2.2 The LayoutLM Model (shown above)); 
Explanation: All standard self-attention transformers fundamentally rely on the Query (Q), Key (K), and Value (V) mechanism.
generating a first attention score based on the query vector and the key vector (2.2 The LayoutLM Model (shown above));
Explanation: All standard self-attention transformers compute attention scores from query/key dot products.
	Zuo and Xu fail to teach the remaining limitations of claim 2. However, Shaw teaches them, as shown below:
	generating a first prediction regarding how the given attender supertoken and given attendee supertoken should be ordered if the given attender supertoken and given attendee supertoken are related to one another (Shaw: Abstract and Section 3.2);
Abstract: “In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements.”

    PNG
    media_image3.png
    481
    401
    media_image3.png
    Greyscale

Explanation: Shaw models signed relative position (j-i) between token pairs, which encodes ordering (before/after). The sign of (j-i) distinguishes whether token j comes before or after token i, which is an ordering prediction. 
generating a second prediction regarding how far the given attender supertoken should be from the given attendee supertoken if the given attender supertoken and given attendee supertoken are related to one another (Shaw: Section 3.2 (shown above) and Fig. 1 (shown below));

    PNG
    media_image4.png
    285
    417
    media_image4.png
    Greyscale

Explanation: Shaw explicitly models relative distance magnitude between tokens. The absolute value of (j-i) (after clipping) directly corresponds to how far one token is from another. 
generating a first error value based on the first prediction and a value based on how text corresponding to the given attender supertoken and given attendee supertoken is actually ordered in the given document (Shaw: Relation-aware Self-Attention and Section 3.2);

    PNG
    media_image5.png
    90
    398
    media_image5.png
    Greyscale


    PNG
    media_image6.png
    135
    382
    media_image6.png
    Greyscale

Relation-aware Self-Attention: “We propose an extension to self-attention to consider the pairwise relationships between input elements.”
Explanation: Shaw teaches computing learned relative position representations that are optimized during training, which involves computing discrepancies between predicted relative positions and ground-truth sequence order during parameter optimization. 
generating a second error value based on the second prediction and a value based on how far text corresponding to the given attender supertoken actually is from text corresponding to the given attendee supertoken in the given document (Shaw: Fig. 1 (shown above));
Explanation: Shaw explicitly models distance via clipped (j-i) and learns parameters indexed by that distance. Distance-indexed relative position representations are trained parameters that show computation of optimization error associated wit predicted versus actual relative distances. 
generating a second attention score based on the first attention score, the first error value, and the second error value (Shaw: 3.1 Relation-aware Self-Attention).

    PNG
    media_image7.png
    132
    402
    media_image7.png
    Greyscale


    PNG
    media_image6.png
    135
    382
    media_image6.png
    Greyscale

Explanation: Shaw directly modifies the attention score equation to incorporate relative position representations. The second attention score is computed using the original query-key interaction and learned relative position representations derived from ordering and distance. 
	Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate Shaw’s self-attention with relative positioning into Zuo and Xu’s system. Shaw teaches that incorporating relative positional relationships into attention improves transformer performance and is particularly relevant where models must understand structural relationships between elements. Since Zuo and Xu are concerned with modeling relationships among tokens in visually rich documents, applying Shaw’s relative ordering and distance-aware attention mechanism to Zuo and Xu would have been a predictable and logical modification to improve modeling of token relationships within documents. 

Regarding Claim 11, it is rejected for the same reasons set forth in claims 1 and 2, as claim 11 recites substantially the same steps.

Regarding Claim 12, Zuo in view of Xu further in view of Shaw teaches the processing system of claim 11, and additional limitations are met as in the consideration of claims 1 and 2 above.

Regarding Claim 13, Zuo in view of Xu further in view of Shaw teaches the processing system of claim 12, and additional limitations are met as in the consideration of claim 3 above. 

Regarding Claim 14, Zuo in view of Xu further in view of Shaw teaches the processing system of claim 11, and additional limitations are met as in the consideration of claim 4 above. 

Regarding Claim 15, Zuo in view of Xu further in view of Shaw teaches the processing system of claim 14, and additional limitations are met as in the consideration of claim 5 above. 

Regarding Claim 16, Zuo in view of Xu further in view of Shaw teaches the processing system of claim 11, and additional limitations are met as in the consideration of claim 6 above. 

Regarding Claim 17, Zuo in view of Xu further in view of Shaw teaches the processing system of claim 16, and additional limitations are met as in the consideration of claim 7 above. 

Regarding Claim 18, Zuo in view of Xu further in view of Shaw teaches the processing system of claim 11, and additional limitations are met as in the consideration of claim 8 above. 

Regarding Claim 19, Zuo in view of Xu further in view of Shaw teaches the processing system of claim 11, and additional limitations are met as in the consideration of claim 9 above. 

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Zuo et. al in view of Xu et. al, further in view of Wei et. al (“Masked Conditional Random Fields for Sequence Labeling”).

Regarding Claim 10, Zuo in view of Xu teaches the processing system of claim 9, but they fail to teach that generating the set of classifications based on the plurality of predictions comprises performing dynamic processing based on the plurality of predictions to determine a Viterbi path representing an optimal combination of BIOES types and entity classes that generates the highest overall probability based on the plurality of predictions.
	However, Wei discloses a neural sequence labeling architecture in which token-level predictions produced by a neural encoder are decoded using a conditional random field (CRF) and explicitly teach that, during encoding, “the decoding problem can be efficiently solved by the Viterbi algorithm,” which determines “the path having the highest score” among candidate tag sequences (3.1 Neural CRF Models). Wei further explains that CRF decoding assigns a global sequence score based on emission scores and transition scores, and that restricting decoding to legal BIO or BIOES paths improves accuracy by enforcing valid label transitions and maximizing sequence-level probability rather than independent token probabilities. 
	Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the system of Zuo and Xu to generate the set of classifications by performing dynamic processing using a CRF with Viterbi decoding. Wei motivates the use of CRF-based Viterbi dynamic programming by explaining that independent token-wise decoding can produce illegal BIO/BIOES tag sequences, whereas Viterbi decoding over a constrained CRF path space yields an optimal label sequence and thoroughly resolves such errors, resulting in improved sequence labeling performance. Accordingly, a person of ordinary skill in the art would have been motivated to apply Viterbi decoding in order to determine an optimal BIOES label sequence having the highest overall probability.

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Zuo et. al in view of Xu et. al and Shaw et. al, and further in view of Wei et. al.

Regarding Claim 20, Zuo in view of Xu further in view of Shaw teaches the processing system of claim 19, and additional limitations are met as in the consideration of claim 10 above. 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Torres (US 20230129874 A1) teaches a pre-trained transformer (e.g., BERT) that is used to generate contextual token embeddings for named-entity recognition, with multiple decoder heads that jointly perform token classification and confidence prediction and fine-tune the encoder based on downstream task losses. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM ADU-JAMFI whose telephone number is (571)272-9298. The examiner can normally be reached M-T 8:00-6:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Bee can be reached at (571) 270-5183. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/WILLIAM ADU-JAMFI/Examiner, Art Unit 2677                                                                                                                                                                                                        
/ANDREW W BEE/Supervisory Patent Examiner, Art Unit 2677
Read full office action
STRUCTURAL ENCODING AND ATTENTION PARADIGMS FOR SEQUENCE MODELING

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

STRUCTURAL ENCODING AND ATTENTION PARADIGMS FOR SEQUENCE MODELING

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email