Office Action Analysis: 18312243 — STRUCTURE AWARE TRANSFORMERS FOR NATURAL LANGUAGE PROCESSING

Office Action

§101 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
	The information disclosure statements (IDS) submitted on 08/16/2023, 08/01/2024 and 01/03/2026 are in compliance with provisions of 37 CFR 1.97. Accordingly, the information disclosure statements are being considered by the examiner.

Examiner Remark
	Claims 9 – 13 are directed toward  a computer-readable storage medium and as defined in the original filed specification (See e.g. [00101] In contrast to computer-readable storage media, communication media can embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. As defined herein, computer storage media does not include communication media. That is, computer-readable storage media does not include communications media consisting solely of a modulated data signal, a carrier wave, or a propagated signal, per se.). Therefore, it is statutory subject matter.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


	Claims 1 – 20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The following sections follow the 2019 PEG guidelines for analyzing subject matter eligibility.

	The analysis below of the claims’ subject matter eligibility follows the 2019 Revised
Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50-57 (January 7, 2019) (“2019 PEG”)
and the 2024 Guidance Update on Patent Subject Matter Eligibility, Including on Artificial
Intelligence, 89 Fed. Reg. 58128-58138 (July 17, 2024) (“2024 AI SME Update”).
	When considering subject matter eligibility under 35 U.S.C. 101, it must be determined
whether the claim is directed to one of the four statutory categories of invention, i.e., process,
machine, manufacture, or composition of matter (Step 1). If the claim does fall within one of the statutory categories, the second step in the analysis is to determine whether the claim is directed
to a judicial exception (Step 2A). The Step 2A analysis is broken into two prongs. In the first
prong (Step 2A, Prong 1), it is determined whether or not the claims recite a judicial exception
(e.g., mathematical concepts, mental processes, certain methods of organizing human activity). If
it is determined in Step 2A, Prong 1 that the claims recite a judicial exception, the analysis
proceeds to the second prong (Step 2A, Prong 2), where it is determined whether or not the
claims integrate the judicial exception into a practical application. If it is determined at step 2A,
Prong 2 that the claims do not integrate the judicial exception into a practical application, the
analysis proceeds to determining whether the claim is a patent-eligible application of the
exception (Step 2B). If an abstract idea is present in the claim, any element or combination of
elements in the claim must be sufficient to ensure that the claim integrates the judicial exception
into a practical application, or else amounts to significantly more than the abstract idea itself.

Claim 1
Step 1: The claim recites a method, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1:
tokenizing the structured text into a plurality of tokens; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
determining embedding vectors for the plurality of tokens; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
augmenting the embedding vectors with location vectors that represent locations of the plurality of tokens within the structured text; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
computing a structure-aware attention weight for a pair of the plurality of tokens based on the computed attention weights; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
receiving structured text; (Mere data gathering, Insignificant extra solution activity in MPEP § 2106.05(g))
computing attention weights for pairs of the plurality of tokens using a self- attention mechanism; (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
using the structure-aware attention weight to compute a hidden representation of an individual input token. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
receiving structured text; (receiving or transmitting data, using components and functions claimed at a high level of generality have been determined by the courts as being well-understood, routine, and conventional activities in the field of computer functions (See MPEP § 2106.05(d)(II)(i))
computing attention weights for pairs of the plurality of tokens using a self- attention mechanism; (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
using the structure-aware attention weight to compute a hidden representation of an individual input token. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))
As an ordered whole, the claim is directed to method of tokenizing input data, this is nothing more than using machine learning models to group the provided data. Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.

Claim 2 incorporates the rejections of claim 1.
Step 1: The claim recites a method, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 1 are incorporated.
augmenting the embedding vectors with metadata vectors that indicate a type of metadata associated with individual tokens. (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
The claim does not recite any additional limitations. Therefore, there are no additional elements to integrate the abstract ideas into a practical applications. (Merely asserting that a judicial exception is to be carried out on a generic computer (i.e., “modifying data” of base claim 1) cannot meaningfully integrate the judicial exceptions into a practical application. See MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception. Mere instructions to apply an exception (i.e., “modifying data” of base claim 1) cannot provide an inventive concept. The claim is not patent eligible.

Claim 3 incorporates the rejections of claim 2.
Step 1: The claim recites a method, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 2 are incorporated. Please see the analysis of
claim 2 above. Regarding the method steps recited in claim 2, these steps cover mental
processes based on comparing data.

Therefore, claim 3 is directed to an abstract idea – Mental Processes (i.e., can be performed in
the human mind, or by a human using a pen and paper, making observations, evaluations and
judgments as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
wherein the metadata vectors indicate that a token is a key of a key-value pair, a value of a key-value pair, a keyword, a column name, or a data type. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
wherein the metadata vectors indicate that a token is a key of a key-value pair, a value of a key-value pair, a keyword, a column name, or a data type. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))

Claim 4 incorporates the rejections of claim 1.
Step 1: The claim recites a method, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 1 are incorporated.
wherein the structured text comprises hierarchical text. (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
The claim does not recite any additional limitations. Therefore, there are no additional elements to integrate the abstract ideas into a practical applications. (Merely asserting that a judicial exception is to be carried out on a generic computer (i.e., “identifying data” of base claim 1) cannot meaningfully integrate the judicial exceptions into a practical application. See MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception. Mere instructions to apply an exception (i.e., “identifying data” of base claim 1) cannot provide an inventive concept. The claim is not patent eligible.

Claim 5 incorporates the rejections of claim 1.
Step 1: The claim recites a method, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 1 are incorporated.
wherein the structured text comprises a data table, the method further comprising: (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
generating a hierarchical representation of the data-table by converting a row of the data table to an entry in the hierarchical representation (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
wherein the row of the data table comprises at plurality of values (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
wherein the entry in the hierarchical representation comprises key-value pairs that represent the plurality of values. (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
The claim does not recite any additional limitations. Therefore, there are no additional elements to integrate the abstract ideas into a practical applications. (Merely asserting that a judicial exception is to be carried out on a generic computer (i.e., “identifying data on a table” of base claim 1) cannot meaningfully integrate the judicial exceptions into a practical application. See MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception. Mere instructions to apply an exception (i.e., “identifying data on a table” of base claim 1) cannot provide an inventive concept. The claim is not patent eligible.

Claim 6 incorporates the rejections of claim 1.
Step 1: The claim recites a method, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 1 are incorporated.
wherein the structured text comprises flat text (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
and wherein the flat text is converted to hierarchical text that includes a single branch of tokens. (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
The claim does not recite any additional limitations. Therefore, there are no additional elements to integrate the abstract ideas into a practical applications. (Merely asserting that a judicial exception is to be carried out on a generic computer (i.e., “grouping input data” of base claim 1) cannot meaningfully integrate the judicial exceptions into a practical application. See MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception. Mere instructions to apply an exception (i.e., “grouping input data” of base claim 1) cannot provide an inventive concept. The claim is not patent eligible.

Claim 7 incorporates the rejections of claim 1.
Step 1: The claim recites a method, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 1 are incorporated.
wherein the structured text comprises hierarchical text, (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
wherein the structure-aware attention weight is computed based on attention weights of tokens along branches from the root of the hierarchical text to the pair of the plurality of tokens. (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
The claim does not recite any additional limitations. Therefore, there are no additional elements to integrate the abstract ideas into a practical applications. (Merely asserting that a judicial exception is to be carried out on a generic computer (i.e., “modifying weights” of base claim 1) cannot meaningfully integrate the judicial exceptions into a practical application. See MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception. Mere instructions to apply an exception (i.e., “modifying weights” of base claim 1) cannot provide an inventive concept. The claim is not patent eligible.

Claim 8 incorporates the rejections of claim 7.
Step 1: The claim recites a method, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 7 are incorporated.
identifying a cartesian product of tokens in a first of the branches from the root of the hierarchical text and tokens in a second of the branches from the root of the hierarchical text (Mathematical Concepts: are defined as mathematical relationships, mathematical formulas or equations, or mathematical calculations.)
wherein the structure-aware attention weight is computed by averaging attention weights of pairs of tokens in the cartesian product. (Mathematical Concepts: are defined as mathematical relationships, mathematical formulas or equations, or mathematical calculations.)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
The claim does not recite any additional limitations. Therefore, there are no additional elements to integrate the abstract ideas into a practical applications. (Merely asserting that a judicial exception is to be carried out on a generic computer (i.e., “calculating weights” of base claim 1) cannot meaningfully integrate the judicial exceptions into a practical application. See MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception. Mere instructions to apply an exception (i.e., “calculating weights”  of base claim 1) cannot provide an inventive concept. The claim is not patent eligible.

Claim 9
Step 1: The claim recites a computer-readable storage medium, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1:
tokenize the structured text into a plurality of tokens (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
determine embedding vectors for the plurality of tokens; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
augment the embedding vectors with location vectors that represent locations of the plurality of tokens within the structured text; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
augment the embedding vectors with metadata vectors that indicate a type of metadata associated with individual tokens; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
compute a structure-aware attention weight for a pair of the plurality of tokens based on the computed attention weights; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
receive structured text; (Mere data gathering, Insignificant extra solution activity in MPEP § 2106.05(g))
compute attention weights for pairs of the plurality of tokens using a self-attention mechanism; (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
use the structure-aware attention weight to compute a hidden representation of an individual input token. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
receive structured text; (receiving or transmitting data, using components and functions claimed at a high level of generality have been determined by the courts as being well-understood, routine, and conventional activities in the field of computer functions (See MPEP § 2106.05(d)(II)(i))
compute attention weights for pairs of the plurality of tokens using a self-attention mechanism; (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
use the structure-aware attention weight to compute a hidden representation of an individual input token. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))
As an ordered whole, the claim is directed to method of tokenizing input data, this is nothing more than using machine learning models to group the provided data. Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.

Claim 10 incorporates the rejections of claim 9.
Step 1: The claim recites a computer-readable storage medium, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 9 are incorporated.
wherein the structured text comprises hierarchical text (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
wherein the embedding vectors were trained in part based on structured masked-language modeling. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
wherein the embedding vectors were trained in part based on structured masked-language modeling. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))

Claim 11 incorporates the rejections of claim 10.
Step 1: The claim recites a computer-readable storage medium, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 10 are incorporated. Please see the analysis of
claim 10 above. Regarding the method steps recited in claim 10, these steps cover mental processes based on data prediction and selection.

Therefore, claim 11 is directed to an abstract idea – Mental processes (i.e., can performed in the
human mind, or by a human using a pen and paper, making observations, evaluations and
judgements as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
wherein structured masked- language modeling masks tokens based on structure information derived from the hierarchical text. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
wherein structured masked- language modeling masks tokens based on structure information derived from the hierarchical text. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))

Claim 12 incorporates the rejections of claim 9.
Step 1: The claim recites a computer-readable storage medium, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 9 are incorporated. Please see the analysis of
claim 9 above. Regarding the method steps recited in claim 9, these steps cover mental processes based on data prediction and selection.

Therefore, claim 12 is directed to an abstract idea – Mental processes (i.e., can performed in the
human mind, or by a human using a pen and paper, making observations, evaluations and
judgements as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
wherein location vectors encode a series of offsets from tokens in the hierarchy. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
wherein location vectors encode a series of offsets from tokens in the hierarchy. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))

Claim 13 incorporates the rejections of claim 9.
Step 1: The claim recites a computer-readable storage medium, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 9 are incorporated. Please see the analysis of
claim 9 above. Regarding the method steps recited in claim 9, these steps cover mental processes based on data prediction and selection.

Therefore, claim 12 is directed to an abstract idea – Mental processes (i.e., can performed in the
human mind, or by a human using a pen and paper, making observations, evaluations and
judgements as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
wherein metadata vectors encode a series of indications whether a token is a key of a key-value pair or a value of a key-value pair. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
wherein metadata vectors encode a series of indications whether a token is a key of a key-value pair or a value of a key-value pair. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))

Claim 14
Step 1: The claim recites a processing system, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1:
tokenize the structured text into a plurality of tokens; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
determine embedding vectors for the plurality of tokens; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
augment the embedding vectors with location vectors that represent locations of the plurality of tokens within the structured text; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
augment the embedding vectors with metadata vectors that indicate a type of metadata associated with individual tokens; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
compute a matrix of structure-aware attention weights for a every pair of the plurality of tokens based on the computed attention weights; (Mental Processes: Can be performed in the human mind, or by a human using a pen and paper, making observations, evaluations and judgments as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
receive structured text; (Mere data gathering, Insignificant extra solution activity in MPEP § 2106.05(g))
compute attention weights for pairs of the plurality of tokens using a self-attention mechanism; (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
use the matrix of structure-aware attention weights to compute a hidden representation of an individual input token. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
receive structured text; (receiving or transmitting data, using components and functions claimed at a high level of generality have been determined by the courts as being well-understood, routine, and conventional activities in the field of computer functions (See MPEP § 2106.05(d)(II)(i))
compute attention weights for pairs of the plurality of tokens using a self-attention mechanism; (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
use the matrix of structure-aware attention weights to compute a hidden representation of an individual input token. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))
As an ordered whole, the claim is directed to method of tokenizing input data, this is nothing more than using machine learning models to group the provided data. Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.

Claim 15 incorporates the rejections of claim 14.
Step 1: The claim recites a processing system, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 14 are incorporated. Please see the analysis of
claim 14 above. Regarding the method steps recited in claim 14, these steps cover mental processes based on data prediction and selection.

Therefore, claim 15 is directed to an abstract idea – Mental processes (i.e., can performed in the
human mind, or by a human using a pen and paper, making observations, evaluations and
judgements as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
wherein the location vectors and metadata vectors are added to the token embeddings with position vectors that represent locations within the plurality of tokens. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
wherein the location vectors and metadata vectors are added to the token embeddings with position vectors that represent locations within the plurality of tokens. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))

Claim 16 incorporates the rejections of claim 14.
Step 1: The claim recites a processing system, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 14 are incorporated. Please see the analysis of
claim 14 above. Regarding the method steps recited in claim 14, these steps cover mental processes based on data prediction and selection.

Therefore, claim 16 is directed to an abstract idea – Mental processes (i.e., can performed in the
human mind, or by a human using a pen and paper, making observations, evaluations and
judgements as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
wherein the hidden representation is generated by performing a matrix multiplication of the matrix of structure-aware attention weights and a value vector. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
wherein the hidden representation is generated by performing a matrix multiplication of the matrix of structure-aware attention weights and a value vector. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))

Claim 17 incorporates the rejections of claim 16.
Step 1: The claim recites a processing system, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 16 are incorporated. Please see the analysis of
claim 16 above. Regarding the method steps recited in claim 16, these steps cover mental processes based on data prediction and selection.

Therefore, claim 17 is directed to an abstract idea – Mental processes (i.e., can performed in the
human mind, or by a human using a pen and paper, making observations, evaluations and
judgements as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
wherein the value vector is trained using a feed-forward network (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
wherein the input of the feed-forward network is token embeddings that have been augmented to include location information. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
wherein the value vector is trained using a feed-forward network (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
wherein the input of the feed-forward network is token embeddings that have been augmented to include location information. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))

Claim 18 incorporates the rejections of claim 14.
Step 1: The claim recites a processing system, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 14 are incorporated. Please see the analysis of
claim 14 above. Regarding the method steps recited in claim 14, these steps cover mental processes based on data prediction and selection.

Therefore, claim 18 is directed to an abstract idea – Mental processes (i.e., can performed in the
human mind, or by a human using a pen and paper, making observations, evaluations and
judgements as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
wherein the hidden representation is used to train a machine learning model. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
wherein the hidden representation is used to train a machine learning model. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))

Claim 19 incorporates the rejections of claim 18.
Step 1: The claim recites a processing system, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 18 are incorporated. Please see the analysis of
claim 18 above. Regarding the method steps recited in claim 18, these steps cover mental processes based on data prediction and selection.

Therefore, claim 19 is directed to an abstract idea – Mental processes (i.e., can performed in the
human mind, or by a human using a pen and paper, making observations, evaluations and
judgements as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
wherein the machine learning model is trained with different types of structured text. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
wherein the machine learning model is trained with different types of structured text. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))

Claim 20 incorporates the rejections of claim 14.
Step 1: The claim recites a processing system, which is one of the four statutory categories of eligible matter.
Step 2A Prong 1: The judicial exceptions of claim 14 are incorporated. Please see the analysis of
claim 14 above. Regarding the method steps recited in claim 14, these steps cover mental processes based on data prediction and selection.

Therefore, claim 20 is directed to an abstract idea – Mental processes (i.e., can performed in the
human mind, or by a human using a pen and paper, making observations, evaluations and
judgements as claimed)
Step 2A Prong 2: The judicial exceptions are not integrated into a practical application. In
particular, the claim recites these additional elements:
wherein attention weights are computed by a self-attention mechanism of a transformer architecture (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
wherein the structure-aware weights are computed by a structure-aware attention mechanism that consumes attention weights computed by the self-attention mechanism. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
Step 2B: The claim does not include additional elements that are sufficient to amount to
significantly more than the judicial exception.
wherein attention weights are computed by a self-attention mechanism of a transformer architecture (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
wherein the structure-aware weights are computed by a structure-aware attention mechanism that consumes attention weights computed by the self-attention mechanism. (Mere instructions to apply an exception as it recites only the idea of a solution or outcome as discussed in MPEP § 2106.05(f))
The courts have found that adding the words "apply it" (or an equivalent) with the
judicial exception, or mere instructions to implement an abstract idea on a computer does not
qualify as “significantly more”. (See MPEP § 2106.05(I)(A))

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 – 3, 9, 13 – 20 are rejected under 35 U.S.C. 103 as being unpatentable over Hashimoto (US 20210397799 A1) in view of Lee (US 20240354504 A1) further in view of Naumov (US 11410015 B1)

	Regarding claim 1, Hashimoto teaches receiving structured text; (See e.g. [0038], At a process 410, structured source text is received.)
	tokenizing the structured text into a plurality of tokens; (See e.g. [0040], which breaks the structured source text into tokens x.sub.i, where each of the tokens x.sub.i may correspond to a word, a number, a tag, and/or the like.)
	determining embedding vectors for the plurality of tokens; (See e.g. [0041], The embeddings of each of the tokens x.sub.i are then combined in a vector as h.sub.0.sup.x=[h.sub.0.sup.x(x.sub.i), h.sub.0.sup.x(x.sub.2), . . . , h.sub.0.sup.x(x.sub.N)] where N is the number of tokens in the structured source data.)
	computing attention weights for pairs of the plurality of tokens using a self- attention mechanism; (See e.g. [0040], As shown in FIG. 5, structured text translator 500 receives structured source text x, such as the structured source text 140 and/or the structured source text received during process 410. The structured source text x is passed to an embedding module 510, which breaks the structured source text into tokens x.sub.i, where each of the tokens x.sub.i may correspond to a word, a number, a tag, and/or the like.)(See e.g. [0042], The output, h.sub.0.sup.x of embedding module 510 is then passed to a multi-stage encoder 520 of a multi-layer attention-based transformer.) (See e.g. [0047], In some examples, attention network 600 is a multi-layer neural network As shown in FIG. 6…Each of the q, k, and v are subject to respective weights W.sup.Q 610, W.sup.K 620, and W.sup.V 630 according to Equation 5. The weights W.sup.Q 610, W.sup.K 620, and W.sup.V 630 are altered during training using back propagation.) (See e.g. [0064], As shown by token pairs 1051a-1051b, 1052a-1052b, and 1053a-1053b, the tokens [pairs of the plurality of tokens] “200”, “<uicontrol>”, and “</uicontrol>” represent three of the tokens copied from structured source text 1010 to structured translated text 1040.)

Hashimoto does not teach augmenting the embedding vectors with location vectors that represent locations of the plurality of tokens within the structured text; computing a structure-aware attention weight for a pair of the plurality of tokens based on the computed attention weights; and using the structure-aware attention weight to compute a hidden representation of an individual input token.

	Lee teaches augmenting the embedding vectors with location vectors that represent locations of the plurality of tokens within the structured text; (See e.g. [0003], (a) generating a beta-skeleton graph based on a plurality of tokens, each given token of the plurality of tokens corresponding to a given string of text in the given document, and wherein the beta-skeleton graph comprises, for each given token: (i) a node comprising a vector based on content and location of the given string of text within the given document;… the beta-skeleton graph further comprises, for each given token: a given edge embedding corresponding to each given edge of the one or more edges, the given edge embedding being based on a spatial relationship in the given document between the given token and a token corresponding to the neighboring node to which the given edge is linked.)
	computing a structure-aware attention weight for a pair of the plurality of tokens based on the computed attention weights; (See e.g. [0002], In some aspects of the technology, the model uses a graph convolutional network (“GCN”) to generate contextualized “supertoken” embeddings for each token, and feeds them to a transformer that employs a sparse attention paradigm in which attention weights for at least some supertokens are modified based on differences between predicted and actual values of the order and distance between the attender and attendee [pair of the plurality of tokens] supertokens…Through the incorporation of GCN-generated supertokens, the structure-aware sequence models of the present technology can explicitly preserve local syntactic information that may otherwise be missed in the local attention calculations (e.g., for “long-long” pairings in ETC and BigBird) for a sequence that has not been properly serialized.)

	Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto and Lee before them, to include Lee’s structure-aware location vectors which would allow Hashimoto’s model to enhance semantic understanding and dimensionality reduction. One would have been motivated to make such a combination in order to improve language parsing and understanding across different types of documents, as suggested by Lee (US 20240354504 A1) (0001)

Hashimoto and Lee do not teach and using the [structure-aware] attention weight to compute a hidden representation of an individual input token.

Naumov teaches and using the [structure-aware] attention weight to compute a hidden representation of an individual input token. (See e.g. [C12:L46 – 48], iteratively updating a previous version of the attention matrix with the context vector generated [hidden representation] (Examiner’s Notes: This mapping is made based upon applicant’s specification (05/04/2023) of this application, See e.g. [0027], “Hidden representation 140 may also be referred to as a context vector.”) from each excess input token yielding a final attention matrix at the last excess input token)

	Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee and Naumov before them, to include Naumov’s hidden representation which would allow Hashimoto and Lee’s model to have improved performance and interpretability. One would have been motivated to make such a combination in order to improve translations and correlation of tokens in training, as suggested by Naumov (US 11410015 B1) (C8:L38 – 44)

Regarding claim 2, Hashimoto, Lee and Naumov teach the method of claim 1. Hashimoto teaches augmenting the embedding vectors with metadata vectors that indicate a type of metadata associated with individual tokens. (See e.g. [0045], The embeddings of each of the tokens [individual tokens] y.sub.j are then combined in a vector [embedding vectors]) (See e.g. [0047 – 0048], As shown in FIG. 6, attention network 600 receives a query q∈custom-character.sup.d.sup.q, a key k∈custom-character.sup.d.sup.k, and a value v∈custom-character.sup.d.sup.v. Each of the q, k, and v are subject to respective weights W.sup.Q 610, W.sup.K 620, and W.sup.V 630 according to Equation 5…The resulting Q, K, and V vectors [metadata vectors] are passed through an attention transfer function 640, which generates a dot product of Q and K, which is then applied to V according to Equation 6.) (See e.g. [0052], Decoder 720 receives layer input (e.g., from an input network for a first layer in a decoding stack, such as embedding module 530, or from layer output of a next lowest layer, such as any of the attention decoders 541-549 except for attention decoder 549, for all other layers of the decoding stack) and provides it to all three (q, k, and v) inputs of a multi-head attention network 721, thus multi-head attention network 721 is configured as a self-attention network. Each head of multi-head attention network 721 is consistent with attention network 600.)

 Regarding claim 3, Hashimoto, Lee and Naumov teach the method of claim 2. Hashimoto teaches wherein the metadata vectors indicate that a token is a key of a key-value pair, a value of a key-value pair, a keyword, a column name, or a data type. (See e.g. [0047], As shown in FIG. 6, attention network 600 receives a query q∈custom-character.sup.d.sup.q, a key k∈custom-character.sup.d.sup.k, [a key of a key-value pair] and a value v∈custom-character.sup.d.sup.v. [a value of a key-value pair] Each of the q, k, and v are subject to respective weights W.sup.Q 610, W.sup.K 620, and W.sup.V 630 according to Equation 5…The resulting Q, K, and V vectors [metadata vectors] are passed through an attention transfer function 640, which generates a dot product of Q and K, which is then applied to V according to Equation 6.)

Regarding claim 9, Hashimoto teaches A computer-readable storage medium having computer-executable instructions stored thereupon that, when executed by a processing system, cause the processing system to: (See e.g. [Claim 8], a memory storing a plurality of processor-executable instructions; a processor executing the plurality of processor-executable instructions to perform operations comprising)
receive structured text; (See e.g. [0038], At a process 410, structured source text is received.)
tokenize the structured text into a plurality of tokens; (See e.g. [0040], which breaks the structured source text into tokens x.sub.i, where each of the tokens x.sub.i may correspond to a word, a number, a tag, and/or the like.)
determine embedding vectors for the plurality of tokens; (See e.g. [0041], The embeddings of each of the tokens x.sub.i are then combined in a vector as h.sub.0.sup.x=[h.sub.0.sup.x(x.sub.i), h.sub.0.sup.x(x.sub.2), . . . , h.sub.0.sup.x(x.sub.N)] where N is the number of tokens in the structured source data.)
augment the embedding vectors with metadata vectors that indicate a type of metadata associated with individual tokens; (See e.g. [0045], The embeddings of each of the tokens [individual tokens] y.sub.j are then combined in a vector [embedding vectors]) (See e.g. [0047 – 0048], As shown in FIG. 6, attention network 600 receives a query q∈custom-character.sup.d.sup.q, a key k∈custom-character.sup.d.sup.k, and a value v∈custom-character.sup.d.sup.v. Each of the q, k, and v are subject to respective weights W.sup.Q 610, W.sup.K 620, and W.sup.V 630 according to Equation 5…The resulting Q, K, and V vectors [metadata vectors] are passed through an attention transfer function 640, which generates a dot product of Q and K, which is then applied to V according to Equation 6.) (See e.g. [0052], Decoder 720 receives layer input (e.g., from an input network for a first layer in a decoding stack, such as embedding module 530, or from layer output of a next lowest layer, such as any of the attention decoders 541-549 except for attention decoder 549, for all other layers of the decoding stack) and provides it to all three (q, k, and v) inputs of a multi-head attention network 721, thus multi-head attention network 721 is configured as a self-attention network. Each head of multi-head attention network 721 is consistent with attention network 600.)
compute attention weights for pairs of the plurality of tokens using a self-attention mechanism; (See e.g. [0040], As shown in FIG. 5, structured text translator 500 receives structured source text x, such as the structured source text 140 and/or the structured source text received during process 410. The structured source text x is passed to an embedding module 510, which breaks the structured source text into tokens x.sub.i, where each of the tokens x.sub.i may correspond to a word, a number, a tag, and/or the like.)(See e.g. [0042], The output, h.sub.0.sup.x of embedding module 510 is then passed to a multi-stage encoder 520 of a multi-layer attention-based transformer.) (See e.g. [0047], In some examples, attention network 600 is a multi-layer neural network As shown in FIG. 6…Each of the q, k, and v are subject to respective weights W.sup.Q 610, W.sup.K 620, and W.sup.V 630 according to Equation 5. The weights W.sup.Q 610, W.sup.K 620, and W.sup.V 630 are altered during training using back propagation.) (See e.g. [0064], As shown by token pairs 1051a-1051b, 1052a-1052b, and 1053a-1053b, the tokens [pairs of the plurality of tokens] “200”, “<uicontrol>”, and “</uicontrol>” represent three of the tokens copied from structured source text 1010 to structured translated text 1040.)

Hashimoto does not teach augment the embedding vectors with location vectors that represent locations of the plurality of tokens within the structured text; compute a structure-aware attention weight for a pair of the plurality of tokens based on the computed attention weights; and use the structure-aware attention weight to compute a hidden representation of an individual input token.

Lee teaches augment the embedding vectors with location vectors that represent locations of the plurality of tokens within the structured text; (See e.g. [0003], (a) generating a beta-skeleton graph based on a plurality of tokens, each given token of the plurality of tokens corresponding to a given string of text in the given document, and wherein the beta-skeleton graph comprises, for each given token: (i) a node comprising a vector based on content and location of the given string of text within the given document;… the beta-skeleton graph further comprises, for each given token: a given edge embedding corresponding to each given edge of the one or more edges, the given edge embedding being based on a spatial relationship in the given document between the given token and a token corresponding to the neighboring node to which the given edge is linked.)
compute a structure-aware attention weight for a pair of the plurality of tokens based on the computed attention weights (See e.g. [0002], In some aspects of the technology, the model uses a graph convolutional network (“GCN”) to generate contextualized “supertoken” embeddings for each token, and feeds them to a transformer that employs a sparse attention paradigm in which attention weights for at least some supertokens are modified based on differences between predicted and actual values of the order and distance between the attender and attendee [pair of the plurality of tokens] supertokens…Through the incorporation of GCN-generated supertokens, the structure-aware sequence models of the present technology can explicitly preserve local syntactic information that may otherwise be missed in the local attention calculations (e.g., for “long-long” pairings in ETC and BigBird) for a sequence that has not been properly serialized.)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto and Lee before them, to include Lee’s structure-aware location vectors which would allow Hashimoto’s model to enhance semantic understanding and dimensionality reduction. One would have been motivated to make such a combination in order to improve language parsing and understanding across different types of documents, as suggested by Lee (US 20240354504 A1) (0001)

Hashimoto and Lee do not teach and use the [structure-aware] attention weight to compute a hidden representation of an individual input token.

Naumov teaches and use the [structure-aware] attention weight to compute a hidden representation of an individual input token. (See e.g. [C12:L46 – 48], iteratively updating a previous version of the attention matrix with the context vector generated [hidden representation] (Examiner’s Notes: This mapping is made based upon applicant’s specification (05/04/2023) of this application, See e.g. [0027], “Hidden representation 140 may also be referred to as a context vector.”) from each excess input token yielding a final attention matrix at the last excess input token)

	Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee and Naumov before them, to include Naumov’s hidden representation which would allow Hashimoto and Lee’s model to have improved performance and interpretability. One would have been motivated to make such a combination in order to improve translations and correlation of tokens in training, as suggested by Naumov (US 11410015 B1) (C8:L38 – 44)

Regarding claim 13, Hashimoto, Lee and Naumov teach the processing computer-readable storage medium of claim 9. Hashimoto teaches wherein metadata vectors encode a series of indications whether a token is a key of a key-value pair or a value of a key-value pair. (See e.g. [0047], As shown in FIG. 6, attention network 600 receives a query q∈custom-character.sup.d.sup.q, a key k∈custom-character.sup.d.sup.k, [a key-value pair] and a value v∈custom-character.sup.d.sup.v. [value of a key-value pair] Each of the q, k, and v are subject to respective weights W.sup.Q 610, W.sup.K 620, and W.sup.V 630 according to Equation 5…The resulting Q, K, and V vectors [metadata vectors] are passed through an attention transfer function 640, which generates a dot product of Q and K, which is then applied to V according to Equation 6.)

Regarding claim 14, Hashimoto teaches A processing system, comprising: a processor; and a computer-readable storage medium having computer-executable instructions stored thereupon that, when executed by the processor, cause the processing system to: (See e.g. [Claim 8], a memory storing a plurality of processor-executable instructions; a processor executing the plurality of processor-executable instructions to perform operations comprising)
receive structured text; (See e.g. [0038], At a process 410, structured source text is received.)
tokenize the structured text into a plurality of tokens; (See e.g. [0040], which breaks the structured source text into tokens x.sub.i, where each of the tokens x.sub.i may correspond to a word, a number, a tag, and/or the like.)
determine embedding vectors for the plurality of tokens; (See e.g. [0041], The embeddings of each of the tokens x.sub.i are then combined in a vector as h.sub.0.sup.x=[h.sub.0.sup.x(x.sub.i), h.sub.0.sup.x(x.sub.2), . . . , h.sub.0.sup.x(x.sub.N)] where N is the number of tokens in the structured source data.)
augment the embedding vectors with metadata vectors that indicate a type of metadata associated with individual tokens; (See e.g. [0045], The embeddings of each of the tokens [individual tokens] y.sub.j are then combined in a vector [embedding vectors]) (See e.g. [0047 – 0048], As shown in FIG. 6, attention network 600 receives a query q∈custom-character.sup.d.sup.q, a key k∈custom-character.sup.d.sup.k, and a value v∈custom-character.sup.d.sup.v. Each of the q, k, and v are subject to respective weights W.sup.Q 610, W.sup.K 620, and W.sup.V 630 according to Equation 5…The resulting Q, K, and V vectors [metadata vectors] are passed through an attention transfer function 640, which generates a dot product of Q and K, which is then applied to V according to Equation 6.) (See e.g. [0052], Decoder 720 receives layer input (e.g., from an input network for a first layer in a decoding stack, such as embedding module 530, or from layer output of a next lowest layer, such as any of the attention decoders 541-549 except for attention decoder 549, for all other layers of the decoding stack) and provides it to all three (q, k, and v) inputs of a multi-head attention network 721, thus multi-head attention network 721 is configured as a self-attention network. Each head of multi-head attention network 721 is consistent with attention network 600.)
compute attention weights for pairs of the plurality of tokens using a self-attention mechanism; (See e.g. [0040], As shown in FIG. 5, structured text translator 500 receives structured source text x, such as the structured source text 140 and/or the structured source text received during process 410. The structured source text x is passed to an embedding module 510, which breaks the structured source text into tokens x.sub.i, where each of the tokens x.sub.i may correspond to a word, a number, a tag, and/or the like.)(See e.g. [0042], The output, h.sub.0.sup.x of embedding module 510 is then passed to a multi-stage encoder 520 of a multi-layer attention-based transformer.) (See e.g. [0047], In some examples, attention network 600 is a multi-layer neural network As shown in FIG. 6…Each of the q, k, and v are subject to respective weights W.sup.Q 610, W.sup.K 620, and W.sup.V 630 according to Equation 5. The weights W.sup.Q 610, W.sup.K 620, and W.sup.V 630 are altered during training using back propagation.) (See e.g. [0064], As shown by token pairs 1051a-1051b, 1052a-1052b, and 1053a-1053b, the tokens [pairs of the plurality of tokens] “200”, “<uicontrol>”, and “</uicontrol>” represent three of the tokens copied from structured source text 1010 to structured translated text 1040.)

Hashimoto does not teach augment the embedding vectors with location vectors that represent locations of the plurality of tokens within the structured text; compute a matrix of structure-aware attention weights for a every pair of the plurality of tokens based on the computed attention weights; and use the matrix of structure-aware attention weights to compute a hidden representation of an individual input token.

Lee teaches augment the embedding vectors with location vectors that represent locations of the plurality of tokens within the structured text; (See e.g. [0003], (a) generating a beta-skeleton graph based on a plurality of tokens, each given token of the plurality of tokens corresponding to a given string of text in the given document, and wherein the beta-skeleton graph comprises, for each given token: (i) a node comprising a vector based on content and location of the given string of text within the given document;… the beta-skeleton graph further comprises, for each given token: a given edge embedding corresponding to each given edge of the one or more edges, the given edge embedding being based on a spatial relationship in the given document between the given token and a token corresponding to the neighboring node to which the given edge is linked.)
compute [a matrix of] structure-aware attention weights for a every pair of the plurality of tokens based on the computed attention weights; (See e.g. [0002], In some aspects of the technology, the model uses a graph convolutional network (“GCN”) to generate contextualized “supertoken” embeddings for each token, and feeds them to a transformer that employs a sparse attention paradigm in which attention weights for at least some supertokens are modified based on differences between predicted and actual values of the order and distance between the attender and attendee [pair of the plurality of tokens] supertokens…Through the incorporation of GCN-generated supertokens, the structure-aware sequence models of the present technology can explicitly preserve local syntactic information that may otherwise be missed in the local attention calculations (e.g., for “long-long” pairings in ETC and BigBird) for a sequence that has not been properly serialized.)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto and Lee before them, to include Lee’s structure-aware location vectors which would allow Hashimoto’s model to enhance semantic understanding and dimensionality reduction. One would have been motivated to make such a combination in order to improve language parsing and understanding across different types of documents, as suggested by Lee (US 20240354504 A1) (0001)

Hashimoto and Lee do not teach and use the matrix of [structure-aware] attention weights to compute a hidden representation of an individual input token.

Naumov teaches and use the matrix of [structure-aware] attention weights to compute a hidden representation of an individual input token. (See e.g. [C12:L46 – 48], iteratively updating a previous version of the attention matrix with the context vector generated [hidden representation] (Examiner’s Notes: This mapping is made based upon applicant’s specification (05/04/2023) of this application, See e.g. [0027], “Hidden representation 140 may also be referred to as a context vector.”) from each excess input token yielding a final attention matrix at the last excess input token)

	Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee and Naumov before them, to include Naumov’s hidden representation which would allow Hashimoto and Lee’s model to have improved performance and interpretability. One would have been motivated to make such a combination in order to improve translations and correlation of tokens in training, as suggested by Naumov (US 11410015 B1) (C8:L38 – 44)

Regarding claim 15, Hashimoto, Lee and Naumov teach the processing system of claim 14. 

Hashimoto and Lee do not teach wherein the location vectors and metadata vectors are added to the token embeddings with position vectors that represent locations within the plurality of tokens.

	Naumov teaches wherein the location vectors and metadata vector are added to the token embeddings with position vectors that represent locations within the plurality of tokens. (See e.g. [C9:L65 – 4], The mathematical formalism of attention may be exemplarily parameterized by a set of fixed-sized context vectors, c.sub.i=(c.sub.1, c.sub.2, . . . , c.sub.L), where L may be dependent upon the input sequence, such as the number of input tokens. Each c.sub.i may be localized to a certain spatial, temporal, locational position [location vectors] of the input.)

	Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee and Naumov before them, to include Naumov’s hidden representation which would allow Hashimoto and Lee’s model to have improved performance and interpretability. One would have been motivated to make such a combination in order to improve translations and correlation of tokens in training, as suggested by Naumov (US 11410015 B1) (C8:L38 – 44)

Regarding claim 16, Hashimoto, Lee and Naumov teach the processing system of claim 14. 

Hashimoto and Naumov do not teach matrix multiplication

Lee teaches matrix multiplication (See e.g. [0076], The resulting the query and key vectors q.sub.i and k.sub.j, will then be subjected to matrix multiplication (function 734) to generate an initial pre-SoftMax attention score.)
structure-aware attention (See e.g. [0002], Through the incorporation of GCN-generated supertokens, the structure-aware sequence models of the present technology can explicitly preserve local syntactic information that may otherwise be missed in the local attention calculations (e.g., for “long-long” pairings in ETC and BigBird) for a sequence that has not been properly serialized.)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee and Naumov before them, to include Lee’s structure-aware location vectors which would allow Hashimoto and Naumov’s model to enhance semantic understanding and dimensionality reduction. One would have been motivated to make such a combination in order to improve language parsing and understanding across different types of documents, as suggested by Lee (US 20240354504 A1) (0001)

Hashimoto and Lee do not teach wherein the hidden representation is generated [by performing a matrix multiplication of the] matrix of [structure-aware] attention weights and a value vector. 

Naumov teaches wherein the hidden representation is generated [by performing a matrix multiplication of the] matrix of [structure-aware] attention weights and a value vector. (See e.g. [C12:L46 – 48], iteratively updating a previous version of the attention matrix with the context vector generated [hidden representation] (Examiner’s Notes: This mapping is made based upon applicant’s specification (05/04/2023) of this application, See e.g. [0027], “Hidden representation 140 may also be referred to as a context vector.”) from each excess input token yielding a final attention matrix at the last excess input token) (See e.g. [C11:L43 – 47], With the calculation of the attention weights, these may be used to compute an updated or current context vector using a function that returns the weighted context vector summarizing the whole context set c according to the attention weights [matrix of attention weights and a value vector.] for a particular decoder RNN stage 108.)

	Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee and Naumov before them, to include Naumov’s hidden representation which would allow Hashimoto and Lee’s model to have improved performance and interpretability. One would have been motivated to make such a combination in order to improve translations and correlation of tokens in training, as suggested by Naumov (US 11410015 B1) (C8:L38 – 44)

Regarding claim 17, Hashimoto, Lee and Naumov teach the processing system of claim 16. Hashimoto teaches wherein the value vector is trained using a feed-forward network, and (See e.g. [0052],  The output of multi-head attention network 711 is provided to a feed forward network 712 with both the input and output of feed forward network 712 being provided to an addition and normalization module 713, which generates the layer output for encoder 710.)
wherein the input of the feed-forward network is token embeddings that have been augmented [to include location information.] (See e.g. [0052],  The output of multi-head attention network 711 is provided to a feed forward network 712 with both the input and output of feed forward network 712 being provided to an addition and normalization module 713, which generates the layer output for encoder 710.) (See e.g. [0052], Encoder 710 receives layer input (e.g., from an input network for a first layer in an encoding stack, such as embedding module 510, or from layer output of a next lowest layer, such as any of the attention encoders 521-529 except for attention encoder 529, for all other layers of the encoding stack) and provides it to all three (q, k, and v) inputs of a multi-head attention network 711, thus multi-head attention network 711 is configured as a self-attention network.) (See e.g. [0042], The output, h.sub.0.sup.x [token] of embedding module 510 is then passed to a multi-stage encoder 520 of a multi-layer attention-based transformer.)

	Hashimoto and Lee do not teach to include location information.

	Naumov teaches to include location information. (See e.g. [C9:L65 – 4], The mathematical formalism of attention may be exemplarily parameterized by a set of fixed-sized context vectors, c.sub.i=(c.sub.1, c.sub.2, . . . , c.sub.L), where L may be dependent upon the input sequence, such as the number of input tokens. Each c.sub.i may be localized to a certain spatial, temporal, locational position [location vectors] of the input.)

	Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee and Naumov before them, to include Naumov’s hidden representation which would allow Hashimoto and Lee’s model to have improved performance and interpretability. One would have been motivated to make such a combination in order to improve translations and correlation of tokens in training, as suggested by Naumov (US 11410015 B1) (C8:L38 – 44)

	Regarding claim 18, Hashimoto, Lee and Naumov teach the processing system of claim 14.

	Hashimoto and Lee do not teach wherein the hidden representation is used to train a machine learning model.

	Naumov teaches wherein the hidden representation is used to train a machine learning model. (See e.g. [C12:L46 – 48], iteratively updating a previous version of the attention matrix with the context vector generated [hidden representation] (Examiner’s Notes: This mapping is made based upon applicant’s specification (05/04/2023) of this application, See e.g. [0027], “Hidden representation 140 may also be referred to as a context vector.”) from each excess input token yielding a final attention matrix at the last excess input token) (See e.g. [C1:L11 – 13], Neural machine translation attempts to build and train a single, large neural network system that inputs a sentence and outputs a correct translation.)

	Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee and Naumov before them, to include Naumov’s hidden representation which would allow Hashimoto and Lee’s model to have improved performance and interpretability. One would have been motivated to make such a combination in order to improve translations and correlation of tokens in training, as suggested by Naumov (US 11410015 B1) (C8:L38 – 44)

	Regarding claim 19, Hashimoto, Lee and Naumov teach the processing system of claim 18. Hashimoto teaches wherein the machine learning model is trained with different types of structured text. (See e.g. [0066], A text only translator (“OT”) is shown as a baseline for the displayed metrics. The text only translator is a natural language translator trained and tested on the same training and testing pairs, but without using additional structures or knowledge to address the embedded XML tags in the source and translated text. A first structured text translator (“X”) is based on structured text translator 500. A second structured text translator (“Xrs”) is based on structured text translator 900 with support for both copying from the structured source text and retrieved from the structured reference text.) (See e.g. [0040], In some examples, structured text translator 500 is a multi-layer neural network.)

Regarding claim 20, Hashimoto, Lee and Naumov teach the processing system of claim 14. Hashimoto teaches wherein attention weights are computed by a self-attention mechanism of a transformer architecture (See e.g. [0053], Referring back to FIG. 5, in addition to the multi-layer attention-based transformer, structured text translator 500 further includes a beam module 550 for processing the output of decoder 540.) (See e.g. [0052], Decoder 720 receives layer input (e.g., from an input network for a first layer in a decoding stack, such as embedding module 530, or from layer output of a next lowest layer, such as any of the attention decoders 541-549 except for attention decoder 549, for all other layers of the decoding stack) and provides it to all three (q, k, and v) inputs of a multi-head attention network 721, thus multi-head attention network 721 is configured as a self-attention network.) (See e.g. [0047], The weights W.sup.Q 610, W.sup.K 620, and W.sup.V 630 are altered during training using back propagation.)
wherein the structure-aware weights are computed by a structure-aware attention mechanism that consumes attention weights computed by the self-attention mechanism. (See e.g. [0050], The second variant form is a self-attention network that is a multi-head attention network where the q, k, and v inputs are the same for each head of the attention network.) (See e.g. [0048], The resulting Q, K, and V vectors are passed through an attention transfer function 640, which generates a dot product of Q and K, which is then applied to V according to Equation 6.)
 
Claims 4 – 7 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Hashimoto (US 20210397799 A1) in view of Lee (US 20240354504 A1) further in view of Naumov (US 11410015 B1) further in view of Sawarkar (US 20240193191 A1)

Regarding claim 4, Hashimoto, Lee and Naumov teach the method of claim 1.

Hashimoto, Lee and Naumov do not teach wherein the structured text comprises hierarchical text.

Sawarkar teaches wherein the structured text comprises hierarchical text. (See e.g. [0108], Multidimensional inferential system maps (1202) each of the at least one structured table and associated hierarchical structure using an agglomerative clustering technique… Multidimensional inferential system outputs (1206) the generated inferential natural language text.)

	Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee,  Naumov and Sawarkar before them, to include Sawarkar’s hierarchical text which would allow Hashimoto, Lee and Naumov’s model to improve classification and reduction dependency on annotation. One would have been motivated to make such a combination in order to improve relationship identification, as suggested by Sawarkar (US 20240193191 A1) (0002)

Regarding claim 5, Hashimoto, Lee and Naumov teach the method of claim 1. Hashimoto teaches wherein the structured text comprises a data table, the method further comprising: (See e.g. [0005], Natural language processing and the ability of a system to translate natural language that is in a structured form that includes embedded tags (e.g., XML [a data table], HTML, and/or the like) is an important machine translation task.)
key-value pairs (See e.g. [0047], As shown in FIG. 6, attention network 600 receives a query q∈custom-character.sup.d.sup.q, a key k∈custom-character.sup.d.sup.k, [a key of a key-value pair] and a value v∈custom-character.sup.d.sup.v.)

	Hashimoto, Lee and Naumov do not teach generating a hierarchical representation of the data-table by converting a row of the data table to an entry in the hierarchical representation, wherein the row of the data table comprises at plurality of values, and wherein the entry in the hierarchical representation comprises key[-value pairs] that represent the plurality of values.

	Sawarkar teaches generating a hierarchical representation of the data-table by converting a row of the data table to an entry in the hierarchical representation (See e.g. [0074], The textification module 204 may be operative to convert one or more rows of the multidimensional table into a structure comprising full sentences [converting a row of the data table] and assign a row identifier and unique token for each row. Further, textification module creates indexes for each row identifier and unique token. As discussed in more detail with reference to FIG. 5, textification module 204 may map each table (e.g., a unidimensional table or a multidimensional table) and its associated hierarchical structure [generating a hierarchical representation] using an agglomerative clustering method)
	wherein the row of the data table comprises at plurality of values, and wherein the entry in the hierarchical representation comprises key[-value pairs] that represent the plurality of values. (See e.g. [0080], Further, textification module 303 may generate a row identifier (e.g., ‘Trx1’) [key] for each row and then assign a unique token 602.) (See e.g. [0051], The table structure (e.g., a multidimensional table) for a clinical trial report is a complex structure with multiple rows and multiple columns [plurality of values] and requires a multidimensional inferential system) (Examiner’s notes: Fig. 6 also depicts the plurality of values per row with a key/ID/unique identifier)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee,  Naumov and Sawarkar before them, to include Sawarkar’s hierarchical text which would allow Hashimoto, Lee and Naumov’s model to improve classification and reduction dependency on annotation. One would have been motivated to make such a combination in order to improve relationship identification, as suggested by Sawarkar (US 20240193191 A1) (0002)

Regarding claim 6, Hashimoto, Lee and Naumov teach the method of claim 1. 

Hashimoto, Lee and Naumov do not teach wherein the structured text comprises flat text, and wherein the flat text is converted to hierarchical text that includes a single branch of tokens.

Sawarkar teaches wherein the structured text comprises flat text, (See e.g. [0108], a plurality of rows of at least one structured table (e.g., table 400, 601) into structures of full sentences with assigned rows and unique tokens. Multidimensional inferential system maps (1202) each of the at least one structured table and associated hierarchical structure using an agglomerative clustering technique) (Examiner’s notes: Per applicants’ specification (05/04/2023) of a “flat text”, [0076], “A flat text file may be interpreted as a hierarchical text file with a single branch of all keys.”, we are treating the “at least one structured table” as an “hierarchical text file” since we can also consider XML having hierarchical structure per your specification [0079], “With reference to FIG. 8, routine 800 begins at operation 802, where structured text is received (e.g., structured text 602). Structured text may be hierarchical, such as XML or JSON.” And with it being a single table all keys would be present.)
and wherein the flat text is converted to hierarchical text that includes a single branch of tokens. (See e.g. [0108], a plurality of rows of at least one structured table (e.g., table 400, 601) into structures of full sentences with assigned rows and unique tokens [a single branch of tokens.]… Multidimensional inferential system generates (1205) inferential natural language test in a sequence format based on the identified relationships. Multidimensional inferential system outputs (1206) the generated inferential natural language text. [converted to hierarchical text])

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee,  Naumov and Sawarkar before them, to include Sawarkar’s hierarchical text which would allow Hashimoto, Lee and Naumov’s model to improve classification and reduction dependency on annotation. One would have been motivated to make such a combination in order to improve relationship identification, as suggested by Sawarkar (US 20240193191 A1) (0002)

Regarding claim 7, Hashimoto, Lee and Naumov teach the method of claim 1. 

Hashimoto, Lee and Naumov do not teach wherein the structured text comprises hierarchical text.

Sawarkar teaches wherein the structured text comprises hierarchical text. (See e.g. [0108], Multidimensional inferential system maps (1202) each of the at least one structured table and associated hierarchical structure using an agglomerative clustering technique… Multidimensional inferential system outputs (1206) the generated inferential natural language text.)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee,  Naumov and Sawarkar before them, to include Sawarkar’s hierarchical text which would allow Hashimoto, Lee and Naumov’s model to improve classification and reduction dependency on annotation. One would have been motivated to make such a combination in order to improve relationship identification, as suggested by Sawarkar (US 20240193191 A1) (0002)

Hashimoto, Naumov and Sawarkar do not teach wherein the structure-aware attention weight is computed based on attention weights of tokens along branches from the root of the hierarchical text to the pair of the plurality of tokens.

	Lee teaches wherein the structure-aware attention weight is computed based on attention weights of tokens along branches from the root of the hierarchical text to the pair of the plurality of tokens. (See e.g. [0002], The present technology concerns systems and methods for providing a structure-aware sequence model that can interpret a document's text without first inferring the proper reading order of the document [from the root of the hierarchical text]…In some aspects of the technology, the model uses a graph convolutional network (“GCN”) to generate contextualized “supertoken” embeddings for each token, and feeds them to a transformer that employs a sparse attention paradigm in which attention weights for at least some supertokens are modified based on differences between predicted and actual values of the order and distance between the attender and attendee [pair of the plurality of tokens] supertokens…Through the incorporation of GCN-generated supertokens, the structure-aware sequence models of the present technology can explicitly preserve local syntactic information that may otherwise be missed in the local attention calculations (e.g., for “long-long” pairings in ETC and BigBird) for a sequence that has not been properly serialized.)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee, Naumov and Sawarkar before them, to include Lee’s structure-aware location vectors which would allow Hashimoto, Naumov and Sawarkar’s model to enhance semantic understanding and dimensionality reduction. One would have been motivated to make such a combination in order to improve language parsing and understanding across different types of documents, as suggested by Lee (US 20240354504 A1) (0001)


Regarding claim 10, Hashimoto, Lee and Naumov teach the computer-readable storage medium of claim 9.

Hashimoto, Lee and Naumov do not teach wherein the structured text comprises hierarchical text,

Sawarkar teaches wherein the structured text comprises hierarchical text. (See e.g. [0108], Multidimensional inferential system maps (1202) each of the at least one structured table and associated hierarchical structure using an agglomerative clustering technique… Multidimensional inferential system outputs (1206) the generated inferential natural language text.)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee,  Naumov and Sawarkar before them, to include Sawarkar’s hierarchical text which would allow Hashimoto, Lee and Naumov’s model to improve classification and reduction dependency on annotation. One would have been motivated to make such a combination in order to improve relationship identification, as suggested by Sawarkar (US 20240193191 A1) (0002)

Hashimoto, Naumov and Sawarkar do not teach wherein the embedding vectors were trained in part based on structured masked-language modeling.

Lee teaches wherein the embedding vectors were trained in part based on structured masked-language modeling. (See e.g. [0026], For example, the beta-skeleton vector [embedding vectors] for each token may generated by concatenating a text embedding based on the text of the token) (See e.g. [0078], The structure-aware sequence models of the present technology may be trained in any suitable way. In that regard, in some aspects of the technology, a structure-aware sequence model may be pretrained using one more sets of masked-language modeling tasks)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee, Naumov and Sawarkar before them, to include Lee’s structured masked-language modeling which would allow Hashimoto, Naumov and Sawarkar’s model to improve contextual word representations and model versatility . One would have been motivated to make such a combination in order to improve language parsing and understanding across different types of documents, as suggested by Lee (US 20240354504 A1) (0001)

Claims 11 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Hashimoto (US 20210397799 A1) in view of Lee (US 20240354504 A1) further in view of Naumov (US 11410015 B1) further in view of Sawarkar (US 20240193191 A1) further in view of Saleh (US 10885436 B1)

Regarding claim 11, Hashimoto, Lee, Naumov and Sawarkar teach the computer-readable storage medium of claim 10. 

Hashimoto, Naumov and Sawarkar do not teach wherein structured masked- language modeling [masks tokens] based on structure information [derived from the hierarchical text.] 

Lee teaches wherein structured masked- language modeling [masks tokens] based on structure information [derived from the hierarchical text.] (See e.g. [0078], The structure-aware sequence models of the present technology may be trained in any suitable way. In that regard, in some aspects of the technology, a structure-aware sequence model may be pretrained using one more sets of masked-language modeling tasks)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee, Naumov and Sawarkar before them, to include Lee’s structured masked-language modeling which would allow Hashimoto, Naumov and Sawarkar’s model to improve contextual word representations and model versatility . One would have been motivated to make such a combination in order to improve language parsing and understanding across different types of documents, as suggested by Lee (US 20240354504 A1) (0001)

Hashimoto, Lee and Naumov do not teach derived from the hierarchical text.

Sawarkar teaches derived from the hierarchical text. (See e.g. [0108], Multidimensional inferential system maps (1202) each of the at least one structured table and associated hierarchical structure using an agglomerative clustering technique… Multidimensional inferential system outputs (1206) the generated inferential natural language text.)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee,  Naumov and Sawarkar before them, to include Sawarkar’s hierarchical text which would allow Hashimoto, Lee and Naumov’s model to improve classification and reduction dependency on annotation. One would have been motivated to make such a combination in order to improve relationship identification, as suggested by Sawarkar (US 20240193191 A1) (0002)

Hashimoto, Lee, Naumov and Sawarkar do not teach masks tokens

Saleh teaches masks tokens (See e.g. [C4:L60 – 62], The system can generate the masked first text document by replacing the one or more selected segments in the unlabeled first text document with first mask tokens.)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee,  Naumov, Sawarkar and Saleh before them, to include Saleh’s masks tokens and encoding offset which would allow Hashimoto, Lee, Naumov and Sawarkar’s model to improve contextual understanding and performance. One would have been motivated to make such a combination in order to improve the effective of training the model, as suggested by Saleh (US 10885436 B1) (C3:L21 – 41)

Regarding claim 12, Hashimoto, Lee, Naumov and Sawarkar teach the computer-readable storage medium of claim 9. 

Hashimoto, Lee, and Sawarkar do not teach wherein location vectors [encode a series of offsets from tokens in the hierarchy.]

Naumov teaches wherein location vectors (See e.g. [C9:L65 – 4], The mathematical formalism of attention may be exemplarily parameterized by a set of fixed-sized context vectors, c.sub.i=(c.sub.1, c.sub.2, . . . , c.sub.L), where L may be dependent upon the input sequence, such as the number of input tokens. Each c.sub.i may be localized to a certain spatial, temporal, locational position [location vectors] of the input.)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee, Naumov and Sawarkar before them, to include Naumov’s hidden representation which would allow Hashimoto, Lee and Sawarkar’s model to have improved performance and interpretability. One would have been motivated to make such a combination in order to improve translations and correlation of tokens in training, as suggested by Naumov (US 11410015 B1) (C8:L38 – 44)

Hashimoto, Lee and Naumov do not teach tokens in the hierarchy

Sawarkar teaches tokens in the hierarchy (See e.g. [0080],  Textification module 303 of natural language processing generation module 302 may generate a unique token for each row of the multidimensional table) (See e.g. [0081], Textification module 303 may generate a vector hierarchy based in part on the generated agglomerative structure.)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee,  Naumov and Sawarkar before them, to include Sawarkar’s hierarchical text which would allow Hashimoto, Lee and Naumov’s model to improve classification and reduction dependency on annotation. One would have been motivated to make such a combination in order to improve relationship identification, as suggested by Sawarkar (US 20240193191 A1) (0002)

Hashimoto, Lee, Naumov and Sawarkar do not teach encode a series of offsets

Saleh teaches encode a series of offsets (See e.g. [C7:L22 – 32], Specifically, the system can shift the data representing the selected sentences right by one decoder input order position, e.g., by introducing a one position offset, so that the decoder network cannot “see” the actual content that it is currently predicting. The system then processes (i) the right shifted data representing the selected segments and (ii) the already generated encoder network output using the decoder network and in accordance with current values of the plurality of decoder network parameters to generate a decoder network output that specifies a decoder prediction of the one or more selected segments.)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Hashimoto, Lee,  Naumov, Sawarkar and Saleh before them, to include Saleh’s masks tokens and encoding offset which would allow Hashimoto, Lee, Naumov and Sawarkar’s model to improve contextual understanding and performance. One would have been motivated to make such a combination in order to improve the effective of training the model, as suggested by Saleh (US 10885436 B1) (C3:L21 – 41)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KYLE ALLMAN THOMPSON whose telephone number is (571)272-3671. The examiner can normally be reached Monday - Thursday, 6 a.m. - 3 p.m. ET..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached at (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/K.A.T./Examiner, Art Unit 2125                                                                                                                                                                                                        


/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125
Read full office action
STRUCTURE AWARE TRANSFORMERS FOR NATURAL LANGUAGE PROCESSING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

STRUCTURE AWARE TRANSFORMERS FOR NATURAL LANGUAGE PROCESSING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email