Last updated: April 19, 2026
Application No. 18/301,514
GENERATING A QUESTION ANSWERING SYSTEM FOR FLOWCHARTS

Non-Final OA §101§102§103§112
Filed
Apr 17, 2023
Examiner
LU, HWEI-MIN
Art Unit
2142
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
1 (Non-Final)
Interview Optional

— +39.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 217 resolved cases, 2023–2026
Examiner Intelligence

LU, HWEI-MIN View full profile →
Grants 62% of resolved cases
Career Allow Rate
134 granted / 217 resolved
+6.8% vs TC avg
Strong +40% interview lift
Without
With
+39.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
37 currently pending
Career history
254
Total Applications
across all art units
Statute-Specific Performance

§101
11.2%
-28.8% vs TC avg
§103
43.8%
+3.8% vs TC avg
§102
9.4%
-30.6% vs TC avg
§112
33.0%
-7.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 217 resolved cases
Office Action

§101 §102 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This office action is in responsive to communication(s): original application filed on 04/17/2023.  Claims 1-20 are pending. Claims 1 and 18-19 are independent.

Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they do not include the following reference sign(s) mentioned in the description: (1) 340 in ¶ [0059]; (2) 670 in ¶ [0069]; and (3) 680 in ¶ [0069].  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Specification
The disclosure is objected to because it contains an embedded hyperlink and/or other form of browser-executable code (e.g., in ¶ [0028]). Applicant is required to delete the embedded hyperlink and/or other form of browser-executable code; references to websites should be limited to the top-level domain name without any prefix such as http:// or other browser-executable code. See MPEP § 608.01.
The use of the term "Bluetooth" in ¶ [0045] and "Wi-Fi" in ¶¶ [0046]-[0047], which is a trade name or a mark used in commerce, has been noted in this application. The term should be accompanied by the generic terminology; furthermore the term should be capitalized wherever it appears or, where appropriate, include a proper symbol indicating use in commerce such as ™, SM , or ® following the term.
Although the use of trade names and marks used in commerce (i.e., trademarks, service marks, certification marks, and collective marks) are permissible in patent applications, the proprietary nature of the marks should be respected and every effort made to prevent their use in any manner which might adversely affect their validity as commercial marks.
The disclosure is objected to because of the following informalities: 
in ¶ [0070], "... generated a benchmark dataset of 5,964,647 questions and 992,057 images for training, 610,309 questions and 99,284 images for validation and 585,179 questions and 99,139 images." appears to be "... generated a benchmark dataset of 5,964,647 questions and 992,057 images for training, 610,309 questions and 99,284 images for validation and 585,179 questions and 99,139 images for testing.".  
Appropriate correction is required.

Claim Objections
Claims 3-5, 7, 9-10, 13-16, 18, and 20 are objected to because of the following informalities: 
in Claim 3, lines 1-2, "... wherein question-answer pairs for each of the graph-like charts include ..." appears to be "... wherein the plurality of question-answer pairs for each of the graph-like chart images include ..." according to Claim 1;
in Claim 4, lines 1-2, "… wherein the plurality of question-answer pairs for each of the plurality of graph-like charts include …" appears to be "… wherein the plurality of question-answer pairs for each of the graph-like chart images include …" according to Claim 1;
in Claim 5, lines 1-2, "… wherein the plurality of question-answer pairs for each of the plurality of graph-like charts include …" appears to be "… wherein the plurality of question-answer pairs for each of the graph-like chart images include …" according to Claim 1;
in Claim 7, lines 1-2, "… wherein the training of the vision-language architecture on the synthetic dataset to answer questions about the graph-like chart images comprises …" appears to be "… wherein the training of the vision-language architecture on the synthetic dataset to answer the questions about the graph-like chart images comprises …";
in Claim 9, line 1, "… wherein the rendering of the plurality of graph-like charts from a plurality of associated input files comprises …" appears to be "… wherein the rendering of the graph-like chart images from the plurality of associated graph data comprises …" according to Claim 1 (see also 112(b) rejection to Claim 1);
in Claim 10, line 2, "… receiving, from an end user via a user interface, a question about the graph-like charts …" appears to be "… receiving, from an end user via a user interface, a question about the graph-like chart images …" (see also 112(b) rejection to Claim 2); 
in Claim 13, lines 1-2, "… wherein generating a synthetic dataset of graph-like chart images further comprises …" appears to be "… wherein generating the synthetic dataset of the graph-like chart images further comprises …" according to Claim 1;
in Claim 14, lines 1-2, "… wherein generating the plurality of question-answer pairs for each of the plurality of graph-like charts comprises …" appears to be "… wherein generating the plurality of question-answer pairs for each of the graph-like chart images comprises …" according to Claim 1;
in Claim 15, lines 1-2, "… wherein the generating of the synthetic dataset of graph-like chart images comprise  …" appears to be "… wherein the generating of the synthetic dataset of the graph-like chart images comprise  …" according to Claim 1;
in Claim 15, lines 3-8, "… receiving a real world graph-like chart dataset, wherein the real world graph-like chart dataset comprises textual labels having a semantic distribution … generating, using a pretrained language model, a plurality of labels matching the semantic distribution of provided labels …" appears to be "… receiving a real world graph-like chart dataset, wherein the real world graph-like chart dataset comprises textual labels having a semantic distribution … generating, using a pretrained language model, a plurality of labels matching the semantic distribution of the textual labels …";
in Claim 15, lines 11-13, "… rendering the plurality of graph-like chart images … filtering of the graph-like chart images …" appears to be "… rendering the graph-like chart images … filtering of the graph-like chart images …" (see 112(b) rejection to Claim 1);
in Claim 16, lines 1-2, "… wherein the training of the vision-language architecture on the synthetic dataset to answer questions about the graph-like charts comprises …" appears to be "… wherein the training of the vision-language architecture on the synthetic dataset to answer the questions about the graph-like chart images comprises …" according to Claim 1;
in Claim 18, lines 8-9, "… generating a plurality of question-answer pairs for … wherein question-answer pairs for …" appears to be "… generating a plurality of question-answer pairs for … wherein the plurality of question-answer pairs for …";
in Claim 18, lines 16-20, "… train …to answer questions about the graph-like chart images … wherein the training of … to answer questions about the graph-like chart images comprises …" appears to be "… train …to answer questions about the graph-like chart images … wherein the training of … to answer the questions about the graph-like chart images comprises …";
in Claim 20, line 2-3, "… adjust vision-language machine learning model  …" appears to be "… adjust the vision-language machine learning model  …" according to Claim 1.  
Appropriate correction is required.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: "a synthetic dataset generation module" in Claim 19 and "an adaptation module" in Claim 20.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites the limitation "... generating a synthetic dataset of graph-like chart images, the generating comprising: rendering a plurality of graph-like chart images from … generating a plurality of question-answer pairs for each of the graph-like chart images … answer questions about the graph-like chart images" in lines 2-, which rendering the claim indefinite because (1) it is unclear whether the first two instances of ".
Claim 1 recites the limitation "... rendering a plurality of graph-like chart images from a plurality of associated graph data; generating a plurality of question-answer pairs for each of the graph-like chart images; and calculating a plurality of ground truth annotations for each of the plurality of question-answer pairs and associated graph-like chart images from the plurality of associated graph data ..." in lines 3-9, which rendering the claim indefinite because (1) it.
Claims 2-17 are rejected for fully incorporating the deficiency of their respective base claims.
Claim 2 recites the limitation "the graph-like charts" in line 1.  There is insufficient antecedent basis for this limitation in the claim.  For examination purposes, "the graph-like chart images" is considered.
Claims 3-17 are rejected for fully incorporating the deficiency of their respective base claims.
Claim 4 recites the limitation "the associated graph-like chart" in lines 2-3.  There is insufficient antecedent basis for this limitation in the claim.  Clarification is required.
Claim 5 recites the limitation "the associated graph-like chart" in lines 2-3.  There is insufficient antecedent basis for this limitation in the claim.  Clarification is required.
Claim 11 recites the limitation "...  wherein the graph data comprises ..." in line 1, which rendering the claim indefinite because ".
Claim 11 recites the limitation "the graph-like chart" in line 2.  There is insufficient antecedent basis for this limitation in the claim.  Clarification is required.
Claim 12 recites the limitation "... wherein each set of questions and answers comprises ..." in line 1, which rendering the claim indefinite because ".
Claim 13 is rejected for fully incorporating the deficiency of their respective base claims.
Claim 13 recites the limitation "... balancing the set of questions to remove trivial question and answer pairs" in lines 2-3, which rendering the claim indefinite because "... wherein each set of questions and answers comprises a set of possible answers and one correct answer" is also recited in its based claim (i.e., indicating "each of multiple sets of questions and answers") and it is unclear which "set of questions" is referred by "the set of questions".  Clarification is required.
Claim 14 recites the limitation "the graph-like chart" in lines 3-4.  There is insufficient antecedent basis for this limitation in the claim.  Clarification is required.
Claim 14 recites the limitation "... generating one or more topological questions  … producing one or more geometrical questions … producing answers for the one or more questions using ..." in lines 3-7, which rendering the claim indefinite because "... answer questions about the graph-like chart images" is also recited in its based claim and it is unclear whether "the one or more questions" is referred to "one or more topological questions" recited here, "one or more geometrical questions" recited here, or  "questions" recited in its based claim.  Clarification is required.
Claim 17 recites the limitation "… augmenting the real world graph-like chart dataset with synthetic data … " in lines 1-2, which rendering the claim indefinite because "… wherein the generating of the synthetic dataset of graph-like chart images comprises …" is also recited in its based claim and it is unclear whether "synthetic data" recited here is the same or different to "synthetic dataset" recited in its based claim.  Clarification is required".
Claim 18 recites the limitation "... generate a synthetic dataset of flowcharts images, the generating comprising: rendering a plurality of flowcharts images from … generating a plurality of question-answer pairs for each of the flowchart chart images, wherein question-answer pairs for each of the graph-like charts include … train … to answer questions about the graph-like chart images … wherein the training of … to answer questions about the graph-like chart images comprises generating a representation of the graph-like chart images using …" in lines 5-21, which rendering the claim indefinite because (1) it is unclear whether the first two instances of "flowcharts images" are the same or different; (2) if they are different, which instance of "flowcharts images" is referred by the "the flowchart chart images" of the third instance; and (3) there is insufficient antecedent basis for "the graph-like charts" and "the graph-like chart images" in the claim.  For examination purposes, "... generate a synthetic dataset of flowcharts images, the generating comprising: rendering the flowcharts images from … generating a plurality of question-answer pairs for each of the flowcharts images, wherein the plurality of question-answer pairs for each of the flowcharts images include … train … to answer questions about the flowcharts images … wherein the training of … to answer the questions about the flowcharts images comprises generating a representation of the flowcharts images using …" is considered (see also claim objections to Claim 18).
Claim 18 recites the limitation "... rendering a plurality of flowcharts images from a plurality of associated graph data; generating a plurality of question-answer pairs for each of the flowchart chart images … calculating a plurality of ground truth annotations for each of the plurality of question-answer pairs and associated flowcharts images from the plurality of associated graph data ..." in lines 6-15, which rendering the claim indefinite because (1) it is unclear "associated flowcharts images" is associated with "each of the plurality of question-answer pairs" or "the plurality of associated graph data"; (2) if it is associated with "each of the plurality of question-answer pairs", it is unclear how multiple "flowcharts images" can be associated with "each of the plurality of question-answer pairs" when "a plurality of question-answer pairs" is generated for "each of the flowcharts images"; and (3) if it is associated with "the plurality of associated graph data", it is unclear whether "associated flowcharts images" and "a plurality of flowcharts images" are the same or different .  Clarification is required.
Claim 19 recites the limitation "the synthetic dataset" in line 6.  There is insufficient antecedent basis for this limitation in the claim.  Clarification is required
Claim 19 recites the limitation "… providing answers to questions posed about flowcharts, wherein the flowcharts … generate a plurality of synthetic flowchart images and a plurality of questions … a vision-language machine learning model trained … to answer questions about input flowcharts" in lines 1-7, which rendering the claim indefinite because it is unclear (1) whether two instances "questions" and one instance "a plurality of questions" are the same or different and (2) whether the instance of "input flowcharts" is the same as or different to two instance of "flowcharts" recited earlier. Clarification is required.
Claim 20 is rejected for fully incorporating the deficiency of their respective base claims.
Claim 20 recites the limitation "... answer questions about similar flowcharts" in lines 2-3, which rendering the claim indefinite because "… providing answers to questions posed about flowcharts, wherein the flowcharts … generate a plurality of synthetic flowchart images and a plurality of questions … a vision-language machine learning model trained … to answer questions about input flowcharts" is also recited in its based claim and it is unclear (1) whether "questions" recited here is the same as or different to "questions" and "a plurality of questions" recited in its based claim; and (2) whether "flowcharts" recited here is the same as, different to, or related to "flowcharts" and "input flowcharts" recited in its based claim.  Clarification is required.
Claim limitation “a synthetic dataset generation module” in Claim 19 and “an adaptation module” in Claim 20 invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. Although the "synthetic dataset generation module" is recited in ¶ [0008] and the "unsupervised training data generation module 310" is recited in ¶ [0057] for performing the same function, no association between the structure and the function can be found in the specification. Also, the specification is silent on the "adaptation module", and no association between the structure and similar function cited in ¶ [0059] can be found in the specification. Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph.
Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 19-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  
Step 1: The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because "A system … comprising: a synthetic dataset generation module … and a vision-language machine learning model …" is recited in Claim 19 and "The system of claim 19, further comprising an adaptation module …" is recited in Claim 20, and there are no structure associated with "a synthetic dataset generation module" and "an adaptation module" described in the specifications; therefore, "a synthetic dataset generation module", "a vision-language machine learning model", and "an adaptation module" can be considered as software per se. for performing corresponding functions, which do not fall within at least one of the four categories of patent eligible subject matter.     

Claims 1-18 are rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract idea without significantly more. 

Independent Claim 1
Step 1: Claim 1 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) recite(s) "generating a synthetic dataset of graph-like chart images" (i.e., flow charts can be generated in the human mind with a pen and papers as images), "generating a plurality of question-answer pairs for each of the graph-like chart images", and "calculating a plurality of ground truth annotations for each of the plurality of question-answer pairs and associated graph-like chart images from the plurality of associated graph data" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper") or mathematical concepts/calculations/algorithms. 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) recite(s) additional elements/limitations of "a Question Answering (QA) system", "rendering a plurality of graph-like chart images from a plurality of associated graph data", and "training a vision-language architecture on the synthetic dataset to answer questions about the graph-like chart images" which only amount to "apply it" with the use of generic computer components or insignificant extra solution activities.  None of the additional elements/limitations, taken alone or in combination, integrate the abstract idea into a practical application. 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because (a) the additional limitation/element of "rendering a plurality of graph-like chart images from a plurality of associated graph data" is well-understood, routine and conventional (WURC) activity similar to "presenting offers and gathering statistics" (see MPEP 2106.05(d), "Presenting offers and gathering statistics, OIP Techs., 788 F.3d at 1362-63, 115 USPQ2d at 1092-93"); and (b) the additional limitation/element of "training a vision-language architecture on the synthetic dataset to answer questions about the graph-like chart images" is also well-understood, routine and conventional (WURC) activity similar to "performing repetitive calculation" (see MPEP 2106.05(d), "Performing repetitive calculations, Flook, 437 U.S. at 594, 198 USPQ2d at 199 (recomputing or readjusting alarm limit values)").  Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 2
Step 1: Claim 2 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) "generating a synthetic dataset of graph-like chart images, wherein the graph-like charts are flowcharts" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper").
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) does/do not further recite(s) additional elements/limitations.
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 3
Step 1: Claim 3 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) "generating a plurality of question-answer pairs for each of the graph-like chart images, wherein question-answer pairs for each of the graph-like charts include topological questions about an associated underlying graph" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper").
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) does/do not further recite(s) additional elements/limitations.
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 4
Step 1: Claim 4 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) "generating a plurality of question-answer pairs for each of the graph-like chart images, wherein the plurality of question-answer pairs for each of the plurality of graph-like charts include geometric questions about spatial relations in the associated graph-like chart" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper").
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) does/do not further recite(s) additional elements/limitations.
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 5
Step 1: Claim 5 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) "generating a plurality of question-answer pairs for each of the graph-like chart images, wherein the plurality of question-answer pairs for each of the plurality of graph-like charts include semantic questions about a content of an element in the associated graph-like chart" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper").
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) does/do not further recite(s) additional elements/limitations.
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 6
Step 1: Claim 6 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) do/does not further recite(s) elements/limitations which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper") or mathematical concepts/calculations/algorithms.
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) further recite(s) additional element/limitation of "training a vision-language architecture … wherein the vision-language architecture comprises a Bidirectional Encoder Representations from Transformers (BERT) model and a Vision Transformer (ViT)" which only amount to "apply it" with the use of generic computer components or insignificant extra solution activity.  None of the additional elements/limitations, taken alone or in combination, integrate the abstract idea into a practical application. 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the additional limitation/element of "training a vision-language architecture … … wherein the vision-language architecture comprises a Bidirectional Encoder Representations from Transformers (BERT) model and a Vision Transformer (ViT)" is also well-understood, routine and conventional (WURC) activity similar to "performing repetitive calculation" (see MPEP 2106.05(d), "Performing repetitive calculations, Flook, 437 U.S. at 594, 198 USPQ2d at 199 (recomputing or readjusting alarm limit values)").  Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 7
Step 1: Claim 7 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) do/does not further recite(s) elements/limitations which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper") or mathematical concepts/calculations/algorithms.
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) further recite(s) additional element/limitation of "generating a representation of the graph-like chart images using the ViT" which only amount to "apply it" with the use of generic computer components or insignificant extra solution activity.  None of the additional elements/limitations, taken alone or in combination, integrate the abstract idea into a practical application. 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the additional limitation/element of "generating a representation of the graph-like chart images using the ViT" is also well-understood, routine and conventional (WURC) activity similar to "performing repetitive calculation" (see MPEP 2106.05(d), "Performing repetitive calculations, Flook, 437 U.S. at 594, 198 USPQ2d at 199 (recomputing or readjusting alarm limit values)").  Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 8
Step 1: Claim 8 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) "generating edge annotations using heat maps" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper") or mathematical concepts/calculations/algorithms.
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) does/do not further recite(s) additional elements/limitations.
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 9
Step 1: Claim 9 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) "generating one or more bounding box annotations for each of the random graph-like charts" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper").
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) further recite(s) additional element/limitation of "rendering a plurality of images of random graph-like charts" which only amount to "apply it" with the use of generic computer components or insignificant extra solution activity.  None of the additional elements/limitations, taken alone or in combination, integrate the abstract idea into a practical application. 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the additional limitation/element of "rendering a plurality of images of random graph-like charts" is well-understood, routine and conventional (WURC) activity similar to "presenting offers and gathering statistics" (see MPEP 2106.05(d), "Presenting offers and gathering statistics, OIP Techs., 788 F.3d at 1362-63, 115 USPQ2d at 1092-93").  Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 10
Step 1: Claim 10 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) "generating an answer to the question" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper").
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) further recite(s) additional elements/limitations of "receiving, from an end user via a user interface, a question about the graph-like charts", "trained vision-language architecture", and "presenting the generated answer to the end user via the user interface" which only amount to "apply it" with the use of generic computer components or insignificant extra solution activities.  None of the additional elements/limitations, taken alone or in combination, integrate the abstract idea into a practical application. 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because (a) the additional limitation/element of "receiving, from an end user via a user interface, a question about the graph-like charts" is also well-understood, routine and conventional (WURC) activity similar to "receiving or transmitting data over a network" (see MPEP 2106.05(d), "Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network)"); (b) the additional limitation/element of "trained vision-language architecture" is also well-understood, routine and conventional (WURC) activity similar to "performing repetitive calculation" (see MPEP 2106.05(d), "Performing repetitive calculations, Flook, 437 U.S. at 594, 198 USPQ2d at 199 (recomputing or readjusting alarm limit values)"); and (c) the additional limitation/element of "presenting the generated answer to the end user via the user interface" is also well-understood, routine and conventional (WURC) activity similar to "presenting offers and gathering statistics" (see MPEP 2106.05(d), "Presenting offers and gathering statistics, OIP Techs., 788 F.3d at 1362-63, 115 USPQ2d at 1092-93").  Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 11
Step 1: Claim 11 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) do/does not further recite(s) elements/limitations which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper") or mathematical concepts/calculations/algorithms.
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) further recite(s) additional elements/limitations of "rendering a plurality of graph-like chart images from a plurality of associated graph data, wherein the graph data comprises nodes, edges, labels, and style settings for the graph-like chart" which only amount to "apply it" with the use of generic computer components or insignificant extra solution activity.  None of the additional elements/limitations, taken alone or in combination, integrate the abstract idea into a practical application. 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the additional limitation/element of "rendering a plurality of graph-like chart images from a plurality of associated graph data, wherein the graph data comprises nodes, edges, labels, and style settings for the graph-like chart" is also well-understood, routine and conventional (WURC) activity similar to "presenting offers and gathering statistics" (see MPEP 2106.05(d), "Presenting offers and gathering statistics, OIP Techs., 788 F.3d at 1362-63, 115 USPQ2d at 1092-93").  Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 12
Step 1: Claim 12 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) "generating a plurality of question-answer pairs for each of the graph-like chart images, wherein each set of questions and answers comprises a set of possible answers and one correct answer" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper").
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) does/do not further recite(s) additional elements/limitations.
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 13
Step 1: Claim 13 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) "balancing the set of questions to remove trivial question and answer pairs" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper").
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) does/do not further recite(s) additional elements/limitations.
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 14
Step 1: Claim 14 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) "generating one or more topological questions pertaining to a graph structure of the graph-like chart by value assignment in a predefined structure template", "producing one or more geometrical questions pertaining to a graphical rendering of the graph-like chart by value assignment in a predefined graphical template", "producing answers for the one or more questions using ground truth data for the graph-like chart by analyzing underlying graph and spatial locations using a graphing algorithm" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper") or mathematical concepts/calculations/algorithms.
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) does/do not further recite(s) additional elements/limitations.
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 15
Step 1: Claim 15 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) "computing statistics of the real-world graph-like chart dataset, including a distribution of nodes and edges characteristics and a distribution of graphical styles", " generating a plurality of labels matching the semantic distribution of provided labels", and " generating graph data matching the computed distribution of nodes and edge characteristics and the computed distribution of graphical styles" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper") or mathematical concepts/calculations/algorithms.
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) further recite(s) additional elements/limitations of "receiving a real world graph-like chart dataset, wherein the real world graph-like chart dataset comprises textual labels having a semantic distribution", "a pretrained language model", and "rendering the plurality of graph-like chart images and the question-answer pairs using the graph data" which only amount to "apply it" with the use of generic computer components or insignificant extra solution activities.  None of the additional elements/limitations, taken alone or in combination, integrate the abstract idea into a practical application. 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because (a) the additional limitation/element of "receiving a real world graph-like chart dataset, wherein the real world graph-like chart dataset comprises textual labels having a semantic distribution" is also well-understood, routine and conventional (WURC) activity similar to "receiving or transmitting data over a network" (see MPEP 2106.05(d), "Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information); buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network)"); (b) the additional limitation/element of "a pretrained language model" is also well-understood, routine and conventional (WURC) activity similar to "performing repetitive calculation" (see MPEP 2106.05(d), "Performing repetitive calculations, Flook, 437 U.S. at 594, 198 USPQ2d at 199 (recomputing or readjusting alarm limit values)"); and (c) the additional limitation/element of "rendering the plurality of graph-like chart images and the question-answer pairs using the graph data" is also well-understood, routine and conventional (WURC) activity similar to "presenting offers and gathering statistics" (see MPEP 2106.05(d), "Presenting offers and gathering statistics, OIP Techs., 788 F.3d at 1362-63, 115 USPQ2d at 1092-93").  Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 16
Step 1: Claim 16 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) "iteratively adapting the vision-language architecture using the synthetic dataset and adapting the synthetic dataset using the current vision-language architecture and the real world graph-like chart data" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper") or mathematical concepts/calculations/algorithms.
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) does/do not further recite(s) additional elements/limitations.
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 17
Step 1: Claim 17 is a process claim which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) further recite(s) " augmenting the real world graph-like chart dataset with synthetic data" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper") or mathematical concepts/calculations/algorithms.
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) does/do not further recite(s) additional elements/limitations.
Step 2B: The claim(s) does/do not further include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim 18
Step 1: Claim 18 recites "a computer program product …. comprising a computer readable storage medium …", wherein "a computer readable storage medium" is not to be construed as storage in the form of transitory signals per se. described in ¶ [0035] of the specification.  Therefore, Claim 18 is a claim for a non-transitory computer readable storage medium which fall within at least one of the four categories of patent eligible subject matter.
Step 2A Prong 1: The claim(s) recite(s) "generate a synthetic dataset of flowcharts images" (i.e., flow charts can be generated in the human mind with a pen and papers as images), "generating a plurality of question-answer pairs for each of the flowchart chart images, wherein question-answer pairs for each of the graph-like charts include topological questions about an associated flowchart, geometric questions about spatial relations in the associated flowchart, and semantic questions about a content of an element in the associated flowchart", and "calculating a plurality of ground truth annotations for each of the plurality of question-answer pairs and associated flowcharts images from the plurality of associated graph data" which can be reasonably considered as mental processes (i.e., which "can be performed in the human mind, or by a human using a pen and paper") or mathematical concepts/calculations/algorithms. 
Step 2A Prong 2: This judicial exception is not integrated into a practical application because the claim(s) recite(s) additional elements/limitations of "a Question Answering (QA) system", "a computer readable storage medium", "a processor", "rendering a plurality of flowcharts images from a plurality of associated graph data", "train a vision-language architecture on the synthetic dataset to answer questions about the graph-like chart images, wherein the vision-language architecture comprises a Bidirectional Encoder Representations from Transformers (BERT) model and a Vision Transformer (ViT)", and "generating a representation of the graph-like chart images using the ViT " which only amount to "apply it" with the use of generic computer components or insignificant extra solution activities.  None of the additional elements/limitations, taken alone or in combination, integrate the abstract idea into a practical application. 
Step 2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because (a) the additional limitation/element of "rendering a plurality of flowcharts images from a plurality of associated graph data" is well-understood, routine and conventional (WURC) activity similar to "presenting offers and gathering statistics" (see MPEP 2106.05(d), "Presenting offers and gathering statistics, OIP Techs., 788 F.3d at 1362-63, 115 USPQ2d at 1092-93"); and (b) the additional limitations/elements of "train a vision-language architecture on the synthetic dataset to answer questions about the graph-like chart images, wherein the vision-language architecture comprises a Bidirectional Encoder Representations from Transformers (BERT) model and a Vision Transformer (ViT)", and "generating a representation of the graph-like chart images using the ViT " are also well-understood, routine and conventional (WURC) activity similar to "performing repetitive calculation" (see MPEP 2106.05(d), "Performing repetitive calculations, Flook, 437 U.S. at 594, 198 USPQ2d at 199 (recomputing or readjusting alarm limit values)").  Thus, none of the additional limitations, taken either alone or combined, amount to significantly more than the abstract idea.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim 1 is rejected under 35 U.S.C. 102(a)(1) as being anticipated by Masry et al.  ("ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning", ARXIV ID: 2203.10244, Mar. 19, 2022, pp. 1-17), hereinafter Masry.

Independent Claim 1
Masry discloses a method of implementing a Question Answering (QA) system (Masry, 2nd paragraph of Section 1 in Page 1: the goal of a Chart Question Answering (ChartQA) system is to help users by taking a chart and a natural language question as input and predicting the answer; Section 4.1 with FIG. 2 in Page 5: the overall process of the ChartQA system is shown in Fig. 2), comprising: 
generating a synthetic dataset of graph-like chart images (Masry, Section 1 in Pages 1-2: the charts are created automatically using a programming tool like Matplotlib (Singh and Shekhar, 2020); benchmark consists of 20,882 charts which are curated from four different online sources to ensure variety in visual styles and topics; propose an approach that combines visual features and extracted data from the chart image; pipeline first extracts the underlying data table from the chart image by adapting the ChartOCR model (Luo et al., 2021) as well as the visual features from the chart image using neural models; then, adapt two transformer-based QA models where utilize both the extracted data table and visual features of the chart in a unified way; "Chart Data Extraction" of Section 2 in pages 2-3: (Siegel et al., 2016) proposed fully automatic chart data extraction pipelines; Luo et al. (2021) also automatically extract data from real-world charts with high accuracy; extend their pipeline to extract the fully-structured data table to pass it to models; Section 3.1 in Page 3: crawled charts from four different sources: (i) Statista (statista.com) is an online platform that presents charts covering a variety of topics including economy, politics, and industry; (ii) the Pew research (pewresearch.org) publishes report about social and economic issues, demographic trends and public opinion with a wide variety of charts; (iii) Our World In Data or OWID (ourworldindata.org) is another platform that contains thousands of charts about different global issues such as economy, finance, and society; (iv) Organisation for Economic Co-operation and Development or OECD (oecd.org) is a global organization which shares reports and data analysis for policymaking; for the Pew dataset, only crawled chart images since the underlying data tables are not available; for the other three, extracted the underlying data tables, metadata (e.g., title, chart type), SVG file and associate text description; finally, extracted the bounding boxes information of the different chart elements (e.g., x-axis labels) from the SVG files to train our data extraction models; Section 3.3 with Table 3 in Pages 4-5: dataset has three commonly used chart types: bar, line, and pie charts (Table 3); categorize the bar and line charts into simple vs complex where data tables of simple charts have only two columns where complex charts involve multiple columns (e.g., stacked or grouped bars and multi-line charts); "Model" of Section A.3 in Page 12 with FIGS. 7-8 in Page 14: extend ChartOCR (Luo et al., 2021) which relies on both deep-learning models and rule-based techniques to parse the chart image into the underlying data table; key-point detection networks, adapted from (Law and Deng, 2019), locates the chart visual marks (e.g. bars, plot area, line points); ideally, the network locates the top-left point and bottom-right points for the rectangular objects (e.g. bar, plot area); in line charts, the detection network locates the coordinates of the points connecting the line segments; in pie charts, the network locates the intersection points between the pie segments along the pie perimeter; extend their detection networks to also locate the chart textual elements (e.g. x-axis-label, legend-label ) as shown in Figure 7a and utilize the CRAFT model (Baek et al., 2019) to read their underlying texts; the chart scale is estimated using the y-axis-labels value for line and bar charts, Figure 7b; for pie charts, the value of each segment is estimated by calculating the angle between its borderlines; finally, the model aggregates the extracted data values (using color and proximity heuristics) to output the final raw data values.; extend their approach to extract the fully-structured data table with the textual labels (e.g. column headers); as shown in Figure 7, associate the estimated bars data values (e.g., ‘17.13’, ‘40.14’) with their closest x-axis-label (’Snapchat’); moreover, if the chart has more than one data series (dark bars or blue bars values), each data series is matched with its legend-label (e.g., ‘2016’, ‘2014’) based on the color of the legend mark and data-encoding marks (e.g., bars); if cannot match data values with legends by colors (e.g., when all legend marks have the same color or there are no legend marks), use other criteria that associate data-encoding marks with legend marks (e.g., proximity, alignment); e.g., in Figure 8b, ’More’ is matched with ’17’ and ’29’ since they are vertically aligned; similarly, for line charts if there is no explicit legend mark for a line series associate the legend labels with the points of their closest lines as shown in Figure 8a), the generating comprising: rendering a plurality of graph-like chart images from a plurality of associated graph data (Masry. 1st paragraph of Section 1 with FIG. 1 in Page 1: Figure 1 displays a line chart with complex reasoning questions about the line chart involving arithmetic and logical operations; "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: FigureQA and DVQA use synthetically-generated data to plot the charts; LEAF-QA and LEAFQA++ use real-world data to plot the charts; the charts are plotted using a software in PlotQA (Methani et al., 2020); Section A.1 in Page 12 with FIG. 5 in Page 13: the data collection interface is shown in Figure 5; while presenting the chart in the user interface for annotation task, ensure that the data labels of chart elements are visible to workers so that they can accurately perform the necessary arithmetic and logical operations to provide and answer the questions successfully); 
generating a plurality of question-answer pairs for each of the graph-like chart images (Masry, Abstract in Page 1: present a large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries; Section 1 with FIG. 1 in Pages 1-2: Figure 1 illustrates sample questions in the benchmark; e.g., the question Q1 in Fig. 1 requires the user to compute the differences between the two lines for each year and find the year with the highest difference; Q2 in Fig. 1 refers to the color of a mark (‘line’) and its attribute (‘peak’) in the chart; the questions are generated automatically using pre-defined templates (Kahou et al., 2017; Kafle et al., 2018; Chaudhry et al., 2020; Singh and Shekhar, 2020); present a largescale benchmark covering 9,608 human-written questions focusing on logical and visual reasoning questions; generate another 23,111 questions automatically from human-written chart summaries using a T5 model (Raffel et al., 2020) and manually validated a subset of it for quality assurance; collect a large number of questions automatically while maintaining rich variations in language as they were generated from human-written summaries; "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: ChartQA differs from previous datasets in two main aspects: the questions’ types (human-authored vs. template-based) and the chart source (real-world vs. generated using a tool); a detailed comparison is shown in Table 1; earlier datasets such as FigureQA (Kahou et al., 2017), DVQA (Kafle et al., 2018), LEAF-QA (Chaudhry et al., 2020) and LEAF-QA++ (Singh and Shekhar, 2020) are mostly synthetic where the questions are generated using a small number of templates and the answers come from a fixed set of vocabulary (e.g. ‘yes’, ‘no’); PlotQA (Methani et al., 2020) is the only dataset with open-vocabulary questions that require applying aggregation operations on the underlying chart data; Section 3.2 with Table 2 in Pages 3-4: two main annotations procedures: (i) collect human-authored QA pairs using Amazon Mechanical Turk (AMT) and (ii) generate QA pairs from the Statista human-written summaries; to create human-authored QA pairs, designed an AMT task (see A.1 for details) in which the crowdworkers are asked to focus on two types of questions for each chart image: compositional and visual questions; compositional questions contain at least two mathematical/logical operations like sum, difference and average, while visual questions refer to the visual attributes such as color, height, and length of graphical marks (e.g., bars) in the chart; prior work on QA has performed data augmentation by either creating template-based or machine generated questions; fine-tune a pre-trained T5 model on the SQuAD QA dataset (Rajpurkar et al., 2016) and apply to the human-written chart summaries that come with the charts from Statista to automatically generate questions that are human-like with sufficient lexical and syntactic variations; applying two T5 models: one for answer extraction and the other for answer-aware question generation; for question generation, the proposed answer is first concatenated with the summary in the format: Answer: Answer Context: Chart Summary; then, generate a question from the given question using the chart summary; since the summaries are human-written, the generated questions are similar to the human-authored questions (see example questions in A.7); to filter out invalid questions, developed a simple heuristic where filter out the question if the answer cannot be found in the chart data table; randomly split both of the human-written (ChartQA-H) and machine generated (ChartQA-M) QA pairs into train, validation, and test sets as shown in Table 2; Section 3.3 with Table 4 in Pages 4-5: analyzed the basic linguistic statistics about our benchmark (see A.2) which has more unique tokens on both types of QA pair; observe that questions cover a variety of syntactic structure and sometimes exhibit informal languages and typos; the topic distribution in our data is quite diverse as it is constructed from four different sources; to analyze the nature of questions, randomly selected 300 QA pairs from the benchmark and categorized them into four types (Table 4); the vast majority of questions (76.33% in total) are either compositional or both visual and compositional, which reflects the real-world scenarios where people ask complex reasoning questions; people make visual references to a variety of visual attributes of marks (see A.2), most commonly to color (e.g., ‘orange line’) and length (e.g., ‘tallest bar’) followed by size (e.g., ‘largest slice’) and position (e.g., ‘leftmost bar’); Section A.2 with Tables 7-8 in Page 12 and FIG. 6 in Page 14: Table 7 shows some linguistic statistics about the benchmark; Figure 6 shows the distribution of topics in the dataset for each of the four sources; analyzed how people make visual references to charts in their questions; Table 8 shows the usage of visual references made in the randomly selected 300 QA pairs; Section A.7 with Table 12 in Pages 16-17: sample machine-generated questions with the human-written summaries are shown in Table 12); and 
calculating a plurality of ground truth annotations for each of the plurality of question-answer pairs and associated graph-like chart images from the plurality of associated graph data (Masry, Section 1 with FIG. 1 in Pages 1-2: data visualizations such as bar charts and line charts have become popular in analyzing data and making informed decisions; to analyze data, often people ask complex reasoning questions about charts involving arithmetic and logical operations (Kim et al., 2020); answering such questions requires a significant amount of perceptual and cognitive efforts as people need to combine multiple operations such as retrieving values, comparing values, finding maximum, calculating sums and differences of values; the answer for a complex reasoning question is derived through various mathematical operations such as aggregation and comparison for complex reasoning questions; Figure 1 illustrates a line chart with the correct answer to each question; Section 3.2 in Pages 3-4: for each chart, the workers provide two questions with the answers; the same questions are then answered by another annotator; if both workers’ answers exactly match, consider the answer to be correct; otherwise, manually check the answers to select the final correct answer; Section 5.1 in Page 7: following Methani et al. (2020), use a relaxed accuracy measure for the numeric answers to allow a minor inaccuracy that may result from the automatic data extraction process; consider an answer to be correct if it is within 5% of the gold answer; for non-numeric answers, still need an exact match to consider an answer to be correct; Section A.1 in Page 12: in each HIT (Human Intelligent Task), the workers verify two previously asked questions by other workers and also provide two new QA pairs; to ensure quality, selected workers with an acceptance rate of 95% and total accomplished HITs of 5000; filtered the workers by giving them a pretest to select the best qualified workers for this task; "Evaluation Metric" of Section A.3 with Table 9 in Pages 12 and 14-15 : evaluation metric is adapted from ChartOCR (Luo et al., 2021).; the distance between any two data values is estimated as follows:                         
                            D
                            
                                    g
                                    t
                                    ,
                                     
                                    p
                                    r
                                
                            =
                            m
                            i
                            n
                            
                                    1
                                    ,
                                    
                                                    g
                                                    t
                                                    -
                                                    p
                                                    r
                                                
                                                    g
                                                    t
                                                
                    , where gt is the ground truth value and pr is the predicted value; for each chart, the cost matrix C, where Cn, m = D(gtn; prm) is computed and the total minimum cost is calculated by solving the following linear sum assignment problem                         
                            C
                            o
                            s
                            t
                            =
                            
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        K
                                    
                                            ∑
                                            
                                                j
                                                =
                                                1
                                            
                                                K
                                            
                                                    C
                                                
                                                    i
                                                    ,
                                                    j
                                                
                                                    X
                                                
                                                    i
                                                    ,
                                                     
                                                    j
                                                
                    , where K = max(N, M) and X is a binary assignment matrix; the final overall score is then estimated as follows:                         
                            O
                            v
                            e
                            r
                            a
                            l
                            l
                             
                            S
                            c
                            o
                            r
                            e
                            =
                            
                                    1
                                
                                    L
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        L
                                    
                                    1
                                    -
                                    
                                            c
                                            o
                                            s
                                            t
                                        
                                                    K
                                                
                                                    i
                                                
                    , where L is the total number of charts; evaluation results are shown in Table 9; Section A.7 with FIG. 10 in Pages 16-17: sample predictions from the model VisionTaPas on ChartQA test set are shown in Figure 10); and 
training a vision-language architecture on the synthetic dataset to answer questions about the graph-like chart images (Masry, Abstract in Page 1: to address the unique challenges in the benchmark involving visual and logical reasoning over charts, present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions; achieve the state-of-the-art results on the previous datasets as well as on the benchmark, the evaluation also reveals several challenges in answering complex reasoning questions; Section 1 in Pages 1-2: generate a large-scale ChartQA dataset with real-world charts and human-authored question-answer pairs for training; a pipeline approach that combines visual features and automatically extracted data from charts to utilize in transformer-based QA models that provide state-of-the-art results; implement an extensive analysis and evaluation of the performance of our models; Section 3.1 in Page 3: extracted the bounding boxes information of the different chart elements (e.g., x-axis labels) from the SVG files to train  data extraction models; Section 3.2 in Pages 3-4: large-scale language models like T5 (Raffel et al., 2020) which are trained on very large data from various web sources can learn general linguistic properties and variations (Brown et al., 2020); the process involves training and applying two T5 models: one for answer extraction and the other for answer-aware question generation; the process involves training and applying two T5 models: one for answer extraction and the other for answer-aware question generation; for answer extraction, the T5 model is trained to generate possible answers separated by [SEP] token given the textual summary as input (i.e., trained on SQuAD’s passage → answer pairs); for question generation, the proposed answer is first concatenated with the summary in the format: Answer: Answer Context: Chart Summary; then, the T5 model is trained to generate a question from the given question using the chart summary; this model is trained on SQuAD’s (passage, answer) → question pairs; Section 4.1 with FIG. 2 in Page 5: consider two problem settings for ChartQA; the first setting assumes that the underlying data table of the chart image is available; formally, given a dataset with N examples D =                         
                            
                                                    c
                                                
                                                    i
                                                
                                            ,
                                             
                                                    t
                                                
                                                    i
                                                
                                            ,
                                             
                                                    q
                                                
                                                    i
                                                
                                            ,
                                             
                                                    a
                                                
                                                    i
                                                
                                    i
                                    =
                                    1
                                
                                    N
                                
                    , where ci represents a chart image, ti represents the underlying data table, qi represents a question over ci, and ai represents the answer to the question; the ChartQA models learn to predict the answer ai given ci, ti and qi; the gold data tables are not generally accessible in most real-world scenarios; thus consider the second setup where the underlying data table ti for chart image ci is extracted by adapting a state-of-the-art ChartOCR (Luo et al., 2021); ChartOCR first locates the main elements of the chart image (e.g., plot area, title) as well as data-encoding marks (e.g., bars ) using key-point detection networks; it then uses the detected keypoints of each mark along with axis-labels to estimate the data value of that mark; however, it does not associate the predicted data values with corresponding text labels (e.g., x-axis-label); hence, extend their approach to output the fully-structured data tables; utilize the CRAFT (Baek et al., 2019) model to recognize the texts in the chart elements; then, associate the data values with their text labels using positional and color information (see A.3 for details); ; Section 4.2 with FIG. 3 in Pages 5-6: the approach to ChartQA builds on two of the state-of-the-art TableQA models: T5 (Raffel et al., 2020; Nan et al., 2021) and TaPas (Herzig et al., 2020); the input to these models consists of the question qi and the data table ti; different from TableQA, ChartQA often involves extracting visual information from chart images; for this, also experiment with the visual counterparts of the TableQA models that also take the chart image features into account; while T5 has a visual variant, VL-T5 (Cho et al., 2021), TaPas does not; in this work, extend TaPas to consider the image features and call it VisionTaPas; more details on models are provided in A.5; T5 (Raffel et al., 2020) is an encoder-decoder model which unifies the NLP tasks as text-to-text generation using the same architecture and loss function; it has been pre-trained on massive amount of unlabelled data with a self-supervised denoising objective; to fine-tune T5 on the ChartQA task, flatten the data table and feed it along with the question as: "Question: Question tokens Table: Flattened table tokens", and the model is trained to generate the answer directly; VL-T5 (Cho et al., 2021) is an extension of T5 that unifies the Vision-Language (VL) tasks as text generation conditioned on multimodal inputs; the input consists of both textual tokens and visual features of the objects extracted from the image using Faster R-CNN (Ren et al., 2015); the model is pre-trained on multiple multimodal tasks such as language modeling, visual QA, and visual grounding; utilize VL-T5 for the ChartQA task in the following manner: (a) for the textual input, do the same as T5 where flatten the data table of the chart image and concatenate it with the question text; (b) for the visual input, extract the visual features of different marks in the chart image (e.g., bars, lines) using Mask R-CNN (He et al., 2017) with Resnet-101 as its backbone (see A.4 for details); unlike the original VL-T5 where a fixed number of objects is provided (36), the number of elements varies from one chart to another; to account for this, pad the extracted visual features with zeros to have a fixed length of 36; TaPas (Herzig et al., 2020) extends a BERT (Devlin et al., 2019) architecture with additional positional embeddings for rows and columns to encode a table; as shown in Fig. 3a, the input to the model has the following format: [CLS] Question tokens [SEP] Flattened table tokens; the tokens are encoded with the table-specific positional embeddings in addition to BERT’s segment and positional embeddings; the model has two output heads: aggregation operation head and cell selection head; the aggregation operation head predicts an operation (e.g., COUNT, SUM, AVERAGE, NONE) which is then applied to the cell values selected by the cell selection head; depending on the operation type, the selected cells can constitute the final answer or the input used to infer the final answer; TaPas is first pre-trained on masked language modeling objective using table-text pairs crawled from Wikipedia where table cells are randomly masked and the model is trained to predict them; it is then fine-tuned in a weakly-supervised manner (using answers as the only supervision) with end-to-end differentiable objectives; VisionTaPas is our extension of TaPas for QA over charts; it consists of three main components: a vision transformer encoder for encoding the chart image, a TaPas encoder for encoding the question and data table and a cross-modal encoder (Fig. 3b); Vision Transformer or ViT (Dosovitskiy et al., 2021) utilizes the transformer encoder architecture (Vaswani et al., 2017) in vision tasks; given a 2D chart image, the image is divided into a sequence of 2D patches {p1, …, pn}; each patch is then flattened and linearly projected into a d-dimensional embedding vector; to incorporate the positional information of the patches, 1D learnable positional embeddings are added to the image features; an L-layer ViT encoder produces a sequence of embeddings H =                         
                            
                                            h
                                        
                                            c
                                            l
                                            s
                                        
                                            L
                                        
                                    ,
                                     
                                            h
                                        
                                            1
                                        
                                            L
                                        
                                    ,
                                     
                                    …
                                    ,
                                     
                                            h
                                        
                                            n
                                        
                                            L
                                        
                     representing the special [CLS] token and the image patches; initialize the ViT module with the pre-trained weights from (Dosovitskiy et al., 2021); the Cross-modality Encoder takes the output of ViT and TaPas encoders (H and Z) and compute multimodal encodings; it has four blocks, each containing a visual branch and a textual-tabular branch; the input first passes through the multiheaded cross attention layers in parallel, where in the visual branch the query vectors are the visual features, and the key and context vectors are the textual-tabular features and vice versa in the textual-tabular branch; the cross-attended features are then passed through a self-attention layer followed by a fully connected layer; similar to the transformer model, each layer applies layer normalization (Ba et al., 2016) and is wrapped with a residual connection; finally, append the aggregation operation and the cell selection heads of TaPas to the final layer at the textual-tabular branch; many questions in the ChartQA dataset require performing a subtraction or ratio operation, which the original TaPas model does not support; thus extend the operation head to add those two operations (Fig. 3b); however, instead of training them in a weakly-supervised manner based on the final answer (as done in TaPas), find it more effective when provided with more direct but potentially noisy supervision on the cells to consider; rely on some heuristics to generate such supervision in the training data; to handle the fixed vocabulary answers (e.g. ‘Yes’, ‘No’), further extend the operation head to include those classes; "ChartQA Dataset" of Section 5.2 with Table 6 in Pages 7-8:  evaluate the transferability of the models and the datasets, where first pretrain the two top performing models (VisionTaPas and VL-T5) on the PlotQA dataset and then fine-tune them on ChartQA; large datasets like PlotQA can be useful for pretraining the model even if the questions are generated from a small number of templates; Section A.4 in Page 15: train the model to detect the following 15 objects: ’Legend’, ’yAxisTitle’, ’ChartTitle’, ’xAxisTitle’, ’LegendPreview’, ’PlotArea’, ’yAxisLabel’, ’xAxisLabel’, ’LegendLabel’, ’PieLabel’, ’bar’, ’pie’, ’pieSlice’, ’line’, and ’dotLine’; for the bounding boxes annotations, use the available bboxes; for the masks, generate them easily using the bounding boxes for all the rectangular objects; for ’pieSlice’ and ’pie’, follow a similar approach to (Singh and Shekhar, 2020) where generate the masks by projecting the radius along the pie perimeter from the starting to the ending points of each slice; use the detectron2 library (Wu et al., 2019) and initialize the model with pre-trained wights on the COCO dataset (Lin et al., 2014); fine-tune the model with a batch size of 8 and an initial learning rate 0f 0.00025 for 50K iterations; Section A.5 with FIG. 9 in Pages 15-16: T5 and VL-T5 fine-tuning process setup is shown in Figure 9).  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 2-20 are rejected under 35 U.S.C. 103 as being unpatentable over Masry in view of SUN et al. ("FR-DETR: End-to-End Flowchart Recognition With Precision and Robustness", IEEE Access, June 14, 2022, pp. 64292-64301), hereinafter SUN.

Claim 2
Masry discloses all the elements as stated in Claim 1 and except failing to explicitly disclose wherein the graph-like charts are flowcharts.
SUN teaches a system and a method relating to recognize charts using transformer (SUN, ABSTRACT), wherein the graph-like charts are flowcharts (SUN, ABSTRACT in Page 64292: propose an end-to-end multi-task network FR-DETR (Flowchart Recognition DEtection TRansformer) and a new dataset for precise and robust flowchart recognition; FR-DETR comprises a CNN backbone and a shared multi-scale Transformer structure to perform symbol detection and edge detection using shared feature maps and respective prediction heads in a coarse-to-fine refinement process; the coarse stage analyzes features with low resolution and suggests candidate regions that contain potential targets for the fine stage to produce accurate predictions using features with high resolution; meanwhile, a new dataset is constructed to provide more symbol types and complex backgrounds for network training and evaluation, which contains more than 1000 machine-generated flowchart images, 25K+ symbol instances with nine categories, and 20K+ line segments; Section I with FIG. 1 in Pages 64292-64293: flowchart recognition is an essential sub-task in research on document analysis and recognition; the critical problem of flowchart recognition is to recognize and refine the structural semantics of flowcharts; there are two study areas for flowchart recognition: handwritten flowchart recognition and machine-generated flowchart recognition; understanding the structural semantics of machine-generated flowchart images is crucial for many structural-semantic-based tasks, such as patent retrieval, automatic code generation, and task-oriented dialogue systems; existing methods for recognizing machine-generated flowchart mainly extract the entire structure by analyzing the connected components in images and then identify specific structures using manually chosen features; flowchart structures now have more varied and colorful symbols and more complex backgrounds, leading to problems such as decrements in recognition accuracy as well as incompatibility and inflexibility of manually chosen features; deep-learning-based computer vision technologies are capable of focusing on desired targets, which can be summarized into symbols and edges in flowcharts; the flowchart recognition task can be divided into recognizing symbols using object detection and detecting edges using line segment detection; as shown in Fig. 1, symbols and arrows are indicated by bounding boxes and edges are indicated by line segments; in flowchart recognition, information processed by different tasks is inherently correlated, such as symbols are connected by edges and edges are connecting symbols; believe a multi-task model is more suitable for flowchart recognition because (a) it enables information to be shared between two tasks of object detection and line segment detection, which can simplify the network structure; and (b) it reduces the training and analysis process by managing two tasks simultaneously; proposes an end-to-end multi-task network architecture named FR-DETR, the first deep learning system for machine-generated flowchart recognition to the best of our knowledge; by fusing DETR (DEtection TRansformer) and LETR (LinE segment TRansformers), FR-DETR has a CNN backbone and a multi-scale Transformer structure to perform fine-grained symbol detection and edge detection respectively; to satisfy the requirements of data volume and complexity for model training, a new flowchart dataset is also constructed; the machine-generated flowchart recognition can be accomplished using deep-learning-based object detection and line segment detection to handle the increasing symbol diversity and background complexity of flowchart images; propose an end-to-end multi-task learning network that combines object detection and line segment detection to reduce the costs caused by separate models; the model jointly detects symbols and edges in flowcharts and employs a multi-scale Transformer structure to improve the recognition accuracy of both tasks; the proposed method achieves a better recognition accuracy that outperforms the prior machine-generated flowchart recognition methods; a new machine-generated flowchart dataset is established to address data shortages for deep learning model training; Section II.A in Pages 64293-64294: Rusiñol et al. [7], [8] summarized flowcharts as structure layer and text layer, and then performed recognition after layer separation; Zhang [23] proposed a corner-based structural model (CBSM) based on the analysis of different corners and symbol shapes; the CBSM recognizes symbols by defining corner classification and corner-based spatial constraints for each kind of graphic shapes; Carton et al. [27] fused structural and statistical information to compute grammatical descriptions for each type of symbol; Lemaitre et al. [28] analyzed flowchart structure based on the description and modification of segment (DMOS) and structural information; Bresler et al. [29] solved a max-sum model optimization task to obtain the best symbol description; Schäfer et al. [30] proposed Arrow R-CNN to detect symbols and connecting edges by adding a keypoint head to Faster R-CNN; the keypoint head is designed to detect the arrow and tail belonging to a connecting edge as two keypoints; arrow R-CNN was the first deep-learning-based flowchart recognition approach; Section II.B in Page 64294: Carion et al. [14] introduced a Transformer based end-to-end object detection network DETR that removed pre-designed anchor and non-maximum suppression (NMS); it detects objects using interactions between a fixed number of queries and encoded image features; following the basic query concept of DETR, Xu et al. [19] proposed a Transformer based line segment detector LETR; unlike the prior line segment detection approaches consisting of heuristics-guided steps, the LETR detector directly regressed the endpoints of a line segment and achieved state-of-the-art performance on relative line segment detection datasets; Section II.C in Page 64294: deep-learning-based MTL (multitask learning) models aim to improve network generalization and the capability of jointly learning shared information; compared with single-task models, multi-task networks have advantages such as a reasonable reduction in model size and increment in inference speeds by sharing inherent parts of network structure; Section III.A-D with FIGS. 2-3 in Pages 64294-64296: as shown in Fig. 2, although the representation of connecting edges is adjusted to an entire connection between two symbols, errors in detecting edges and keypoints still occur in Arrow R-CNN; based on the structural analysis of connecting edges, using line segment representation can better handle the dilemma in which Arrow R-CNN is stuck; recently, a Transformer-based line segment detector LETR [19] suggested a model that directly regresses the endpoints coordinates of each line segment; the attention mechanism introduced by Transformer perfectly meets the need to distinguish line segments between wanted and unwanted ones; inspired by the aforementioned methods, with the aim of improving the flowchart  recognition accuracy and reducing the costs of using isolated models, modify LETR into a multi-task model to jointly accomplish the two detection tasks; the model selected to perform symbol detection is DETR; the reasons can be concluded as follows: (a) DETR has a similar Transformer-based structure to LETR, which maximizes structure sharing between the two models and the overall cost reduction; (b) Other CNN-based object detection models and LETR have few shareable parts, which unavoidably results in insufficient structure sharing and difficulties in jointly analyzing features; the overall architecture of FR-DETR is designed based on a multi-scale Transformer encoder-decoder structure as depicted in Fig. 3; the proposed flowchart recognition process can be divided into four sub-tasks: feature extraction, feature encoding, feature decoding and target prediction; FR-DETR uses a CNN backbone to extract a feature map from an input image; the channel dimension of the feature map f0 is then reduced from C to a smaller dimension d to obtain a new feature map f by 1 × 1 convolution; to meet the encoder's requirement, which expects input in the sequence format, the feature map f is flattened to create another feature map z; the Transformer encoder is stacked with multiple encoding layers; each layer consists of a multi-head self-attention module and a feed-forward network (FFN); the encoding layers receive processed features from their predecessor layer and deliver the output features to the corresponding FFN after learning the pairwise relations between the input and output; in general, the flattened feature map z is encoded into a new feature map z'; the positional encoding of f is added to guarantee the flattened feature map z not to lose the spatial relations; following the standard architecture of Transformer, the decoder transforms each N embeddings of symbol and line segment using the multi-head self-attention and cross-attention module; like the positional encoding of the encoder, the input embeddings are learnable positional information that is added to the input of each layer and named as target queries; each decoding layer receives z' from the last encoding layer and two types of target queries b and l, namely symbol queries and line queries, from its predecessor decoding layer; both types of queries are first processed by the self-attention model, and then, each entity in the queries is assigned to different regions of positional encoding by the cross-attention module; the output of the decoder is then used to predict the final results using an FFN; the final results of symbols and line segments are predicted by an FFN; specifically, the coordinates of symbols and line segments are computed by a multi-layer perceptron (MLP) with three layers, and the confidence of the predicted targets is produced by a linear projection layer; in contrast to object detection, which mainly focuses on local and neighborhood regions, line segment detection needs to consider the fine-grained local features and the global information; an efficient way to tackle the problem is a sequential two-stage structure, whose former component produces suggested regions for the other component to perform exact detection; following this idea, FR-DETR performs both the desired tasks in a refinement process using a coarse-to-fine strategy; this strategy enables FR-DETR to learn from multi-scale image features and produce precise predictions; in general, the model first analyzes the global information to locate possible targets coarsely and then uses the location information to examine local features and perform fine-grained recognition; in the coarse stage, FR-DETR studies a low resolution feature map to identify potential regions containing symbols and line segments; the low resolution feature map sent into the coarse encoder is the output of the ResNet [33] C5 layer, and its size is 1/32 of the original image resolution; after the encoding process, the encoded features and init-target queries are then passed into each decoding layer's cross-attention module and self-attention module; the predictions produced by the coarse stage are considered as potential target regions and received by fine decoding layers as mid-target queries; the coarse stage is important for improving the accuracy of the fine stage and can reduce the computation cast compared with directly processing high resolution features; in the fine stage, based on the suggested potential regions, FR-DETR makes detailed predictions using a feature map with 1/16 of the original image resolution, which is the output of the ResNet C4 layer and is twice the size of the feature map used in the coarse stage; in general, fine decoding is similar to that in the coarse stage; the main difference is that it processes information with more details and focuses on the suggested regions to conclude predictions, making the fine stage crucial for accomplishing precise fine-grained detection; in the prediction process, each type of final queries is fed into its corresponding FFN that consists of a classifier and a regressor to predict the category confidence p and the coordinates of every target; if a final query belongs to symbol detection, the coordinates prediction is in the format of bounding box b = (cx, cy, w, h), which denotes the center point, width and height of the box; otherwise, a line segment prediction l is represented by two endpoints (p1; p2), where p1 = (x1; y1), p2 = (x2; y2); the set of predictions                         
                            
                                    t
                                
                                ^
                            
                     has N targets, and the set of ground truth t has M elements, normally N > M; to assign a bipartite matching between the predictions and ground truth, t is assumed as a set that padded with no object (Ø) to meet the size of                         
                            
                                    t
                                
                                ^
                            
                    ; in this case, an optimization for the bipartite matching is used to find the permutation with the lowest cost using eqns. (1)-(4); for the two detection tasks, each task loss must also evaluate the results of classification; by adding a cross-entropy loss term, each task loss can be represented as eqns. (5)-(6); the total loss of FR-DETR is formulated as eqn. (7); Section IV.A with Table 1 in Page 64297: the original CLEF-IP dataset released in 2011 contains machine-generated flowchart images and other structural diagrams; after removing non-flowcharts, approximately 60 images remain, most of which are provided by the European Patent Office (EPO) for the patent retrieval study; these flowcharts have a simple white background, and their structures are drawn in black; the dataset has three main symbols: rectangle for processing action, diamond for decision, and oval for terminator; to enrich the symbol category and structural complexity, public flowchart images are collected through the Internet by using image search engines, such as Google Image, Bing Image and Baidu Image; after filtering low quality images and removing duplicates, a dataset containing more than 1,000 images is constructed; the new dataset includes 25K+ symbol instances and 20K+ line segments; statistical details are shown in Table 1; the FR-DETR model is trained on the new dataset, which is randomly divided into 800 training images and 200 testing images; data augmentation methods including random resize, random flip, and random crop, are taken through all experiments).
Masry and SUN are analogous art because they are from the same field of endeavor, a system and a method relating to recognize charts using transformer.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to apply the teaching of SUN to Masry.  Motivation for doing so would extend capability of document analysis and recognition for different types of charts (SUN, 1st para. of Section I in Page 64292).

Claim 3
Masry in view of SUN discloses all the elements as stated in Claim 2 and further discloses wherein question-answer pairs for each of the graph-like charts include topological questions about an associated underlying graph (Masry, Section 1 with FIG. 1 in Pages 1-2: Figure 1 illustrates sample questions in the benchmark; e.g., the question Q1 in Fig. 1 requires the user to compute the differences between the two lines for each year and find the year with the highest difference; Q2 in Fig. 1 refers to the color of a mark (‘line’) and its attribute (‘peak’) in the chart; present a largescale benchmark covering 9,608 human-written questions focusing on logical and visual reasoning questions; generate another 23,111 questions automatically from human-written chart summaries using a T5 model (Raffel et al., 2020) and manually validated a subset of it for quality assurance; collect a large number of questions automatically while maintaining rich variations in language as they were generated from human-written summaries; benchmark consists of 20,882 charts which are curated from four different online sources to ensure variety in visual styles and topics; propose an approach that combines visual features and extracted data from the chart image; pipeline first extracts the underlying data table from the chart image by adapting the ChartOCR model (Luo et al., 2021) as well as the visual features from the chart image using neural models; then, adapt two transformer-based QA models where utilize both the extracted data table and visual features of the chart in a unified way; "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: ChartQA differs from previous datasets in two main aspects: the questions’ types (human-authored vs. template-based) and the chart source (real-world vs. generated using a tool); a detailed comparison is shown in Table 1; earlier datasets such as FigureQA (Kahou et al., 2017), DVQA (Kafle et al., 2018), LEAF-QA (Chaudhry et al., 2020) and LEAF-QA++ (Singh and Shekhar, 2020) are mostly synthetic where the questions are generated using a small number of templates and the answers come from a fixed set of vocabulary (e.g. ‘yes’, ‘no’); PlotQA (Methani et al., 2020) is the only dataset with open-vocabulary questions that require applying aggregation operations on the underlying chart data; however, they do not have visual reasoning questions while their questions are still template-based; build a new Chart QA dataset involving visual and logical reasoning questions written by humans on real-worlds charts; "Chart Data Extraction" of Section 2 in pages 2-3: (Siegel et al., 2016) proposed fully automatic chart data extraction pipelines; Luo et al. (2021) also automatically extract data from real-world charts with high accuracy; extend their pipeline to extract the fully-structured data table to pass it to models; Section 3.2 in Pages 3-4: focus on two types of questions for each chart image: compositional and visual questions; compositional questions contain at least two mathematical/logical operations like sum, difference and average, while visual questions refer to the visual attributes such as color, height, and length of graphical marks (e.g., bars) in the chart; Section 3.3 with Table 4 in Pages 4-5: analyzed the basic linguistic statistics about our benchmark (see A.2) which has more unique tokens on both types of QA pair; observe that questions cover a variety of syntactic structure and sometimes exhibit informal languages and typos; the topic distribution in our data is quite diverse as it is constructed from four different sources; to analyze the nature of questions, randomly selected 300 QA pairs from the benchmark and categorized them into four types (Table 4); the vast majority of questions (76.33% in total) are either compositional or both visual and compositional, which reflects the real-world scenarios where people ask complex reasoning questions; people make visual references to a variety of visual attributes of marks (see A.2), most commonly to color (e.g., ‘orange line’) and length (e.g., ‘tallest bar’) followed by size (e.g., ‘largest slice’) and position (e.g., ‘leftmost bar’); Section A.2 with Tables 7-8 in Page 12 and FIG. 6 in Page 14: Table 7 shows some linguistic statistics about the benchmark; Figure 6 shows the distribution of topics in the dataset for each of the four sources; analyzed how people make visual references to charts in their questions; Table 8 shows the usage of visual references made in the randomly selected 300 QA pairs; Section A.7 with Table 12 and FIG.10 in Pages 16-17: sample machine-generated questions with the human-written summaries are shown in Table 12; sample predictions from the model VisionTaPas on ChartQA test set are shown in Figure 10) (SUN, 1st para. of Section I with FIG. 1 in Pages 64292-64293: the critical problem of flowchart recognition is to recognize and refine the structural semantics of flowcharts; understanding the structural semantics of machine-generated flowchart images is crucial for many structural-semantic-based tasks, such as patent retrieval, automatic code generation, and task-oriented dialogue systems; Section III.A with FIG. 2 in Page 64294: based on the structural analysis of connecting edges, using line segment representation can better handle the dilemma in which Arrow R-CNN is stuck).
 
Claim 4
Masry in view of SUN discloses all the elements as stated in Claim 2 and further discloses wherein the plurality of question-answer pairs for each of the plurality of graph-like charts include geometric questions about spatial relations in the associated graph-like chart (Masry, Section 1 with FIG. 1 in Pages 1-2: Figure 1 illustrates sample questions in the benchmark; e.g., the question Q1 in Fig. 1 requires the user to compute the differences between the two lines for each year and find the year with the highest difference; Q2 in Fig. 1 refers to the color of a mark (‘line’) and its attribute (‘peak’) in the chart; present a largescale benchmark covering 9,608 human-written questions focusing on logical and visual reasoning questions; generate another 23,111 questions automatically from human-written chart summaries using a T5 model (Raffel et al., 2020) and manually validated a subset of it for quality assurance; collect a large number of questions automatically while maintaining rich variations in language as they were generated from human-written summaries; benchmark consists of 20,882 charts which are curated from four different online sources to ensure variety in visual styles and topics; propose an approach that combines visual features and extracted data from the chart image; pipeline first extracts the underlying data table from the chart image by adapting the ChartOCR model (Luo et al., 2021) as well as the visual features from the chart image using neural models; then, adapt two transformer-based QA models where utilize both the extracted data table and visual features of the chart in a unified way; "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: ChartQA differs from previous datasets in two main aspects: the questions’ types (human-authored vs. template-based) and the chart source (real-world vs. generated using a tool); a detailed comparison is shown in Table 1; earlier datasets such as FigureQA (Kahou et al., 2017), DVQA (Kafle et al., 2018), LEAF-QA (Chaudhry et al., 2020) and LEAF-QA++ (Singh and Shekhar, 2020) are mostly synthetic where the questions are generated using a small number of templates and the answers come from a fixed set of vocabulary (e.g. ‘yes’, ‘no’); PlotQA (Methani et al., 2020) is the only dataset with open-vocabulary questions that require applying aggregation operations on the underlying chart data; however, they do not have visual reasoning questions while their questions are still template-based; build a new Chart QA dataset involving visual and logical reasoning questions written by humans on real-worlds charts; "Chart Data Extraction" of Section 2 in pages 2-3: (Siegel et al., 2016) proposed fully automatic chart data extraction pipelines; Luo et al. (2021) also automatically extract data from real-world charts with high accuracy; extend their pipeline to extract the fully-structured data table to pass it to models; Section 3.2 in Pages 3-4: focus on two types of questions for each chart image: compositional and visual questions; compositional questions contain at least two mathematical/logical operations like sum, difference and average, while visual questions refer to the visual attributes such as color, height, and length of graphical marks (e.g., bars) in the chart; Section 3.3 with Table 4 in Pages 4-5: analyzed the basic linguistic statistics about our benchmark (see A.2) which has more unique tokens on both types of QA pair; observe that questions cover a variety of syntactic structure and sometimes exhibit informal languages and typos; the topic distribution in our data is quite diverse as it is constructed from four different sources; to analyze the nature of questions, randomly selected 300 QA pairs from the benchmark and categorized them into four types (Table 4); the vast majority of questions (76.33% in total) are either compositional or both visual and compositional, which reflects the real-world scenarios where people ask complex reasoning questions; people make visual references to a variety of visual attributes of marks (see A.2), most commonly to color (e.g., ‘orange line’) and length (e.g., ‘tallest bar’) followed by size (e.g., ‘largest slice’) and position (e.g., ‘leftmost bar’); Section A.2 with Tables 7-8 in Page 12 and FIG. 6 in Page 14: Table 7 shows some linguistic statistics about the benchmark; Figure 6 shows the distribution of topics in the dataset for each of the four sources; analyzed how people make visual references to charts in their questions; Table 8 shows the usage of visual references made in the randomly selected 300 QA pairs; Section A.7 with Table 12 and FIG.10 in Pages 16-17: sample machine-generated questions with the human-written summaries are shown in Table 12; sample predictions from the model VisionTaPas on ChartQA test set are shown in Figure 10) (SUN, 1st para. of Section I with FIG. 1 in Pages 64292-64293: the critical problem of flowchart recognition is to recognize and refine the structural semantics of flowcharts; understanding the structural semantics of machine-generated flowchart images is crucial for many structural-semantic-based tasks, such as patent retrieval, automatic code generation, and task-oriented dialogue systems; Section III.A with FIG. 2 in Page 64294: based on the structural analysis of connecting edges, using line segment representation can better handle the dilemma in which Arrow R-CNN is stuck).  

Claim 5
Masry in view of SUN discloses all the elements as stated in Claim 2 and further discloses wherein the plurality of question-answer pairs for each of the plurality of graph-like charts include semantic questions about a content of an element in the associated graph-like chart (Masry, Section 1 with FIG. 1 in Pages 1-2: Figure 1 illustrates sample questions in the benchmark; e.g., the question Q1 in Fig. 1 requires the user to compute the differences between the two lines for each year and find the year with the highest difference; Q2 in Fig. 1 refers to the color of a mark (‘line’) and its attribute (‘peak’) in the chart; present a largescale benchmark covering 9,608 human-written questions focusing on logical and visual reasoning questions; generate another 23,111 questions automatically from human-written chart summaries using a T5 model (Raffel et al., 2020) and manually validated a subset of it for quality assurance; collect a large number of questions automatically while maintaining rich variations in language as they were generated from human-written summaries; benchmark consists of 20,882 charts which are curated from four different online sources to ensure variety in visual styles and topics; propose an approach that combines visual features and extracted data from the chart image; pipeline first extracts the underlying data table from the chart image by adapting the ChartOCR model (Luo et al., 2021) as well as the visual features from the chart image using neural models; then, adapt two transformer-based QA models where utilize both the extracted data table and visual features of the chart in a unified way; "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: ChartQA differs from previous datasets in two main aspects: the questions’ types (human-authored vs. template-based) and the chart source (real-world vs. generated using a tool); a detailed comparison is shown in Table 1; earlier datasets such as FigureQA (Kahou et al., 2017), DVQA (Kafle et al., 2018), LEAF-QA (Chaudhry et al., 2020) and LEAF-QA++ (Singh and Shekhar, 2020) are mostly synthetic where the questions are generated using a small number of templates and the answers come from a fixed set of vocabulary (e.g. ‘yes’, ‘no’); PlotQA (Methani et al., 2020) is the only dataset with open-vocabulary questions that require applying aggregation operations on the underlying chart data; however, they do not have visual reasoning questions while their questions are still template-based; build a new Chart QA dataset involving visual and logical reasoning questions written by humans on real-worlds charts; "Chart Data Extraction" of Section 2 in pages 2-3: (Siegel et al., 2016) proposed fully automatic chart data extraction pipelines; Luo et al. (2021) also automatically extract data from real-world charts with high accuracy; extend their pipeline to extract the fully-structured data table to pass it to models; Section 3.2 in Pages 3-4: focus on two types of questions for each chart image: compositional and visual questions; compositional questions contain at least two mathematical/logical operations like sum, difference and average, while visual questions refer to the visual attributes such as color, height, and length of graphical marks (e.g., bars) in the chart; Section 3.3 with Table 4 in Pages 4-5: analyzed the basic linguistic statistics about our benchmark (see A.2) which has more unique tokens on both types of QA pair; observe that questions cover a variety of syntactic structure and sometimes exhibit informal languages and typos; the topic distribution in our data is quite diverse as it is constructed from four different sources; to analyze the nature of questions, randomly selected 300 QA pairs from the benchmark and categorized them into four types (Table 4); the vast majority of questions (76.33% in total) are either compositional or both visual and compositional, which reflects the real-world scenarios where people ask complex reasoning questions; people make visual references to a variety of visual attributes of marks (see A.2), most commonly to color (e.g., ‘orange line’) and length (e.g., ‘tallest bar’) followed by size (e.g., ‘largest slice’) and position (e.g., ‘leftmost bar’); Section A.2 with Tables 7-8 in Page 12 and FIG. 6 in Page 14: Table 7 shows some linguistic statistics about the benchmark; Figure 6 shows the distribution of topics in the dataset for each of the four sources; analyzed how people make visual references to charts in their questions; Table 8 shows the usage of visual references made in the randomly selected 300 QA pairs; Section A.7 with Table 12 and FIG.10 in Pages 16-17: sample machine-generated questions with the human-written summaries are shown in Table 12; sample predictions from the model VisionTaPas on ChartQA test set are shown in Figure 10) (SUN, 1st para. of Section I with FIG. 1 in Pages 64292-64293: the critical problem of flowchart recognition is to recognize and refine the structural semantics of flowcharts; understanding the structural semantics of machine-generated flowchart images is crucial for many structural-semantic-based tasks, such as patent retrieval, automatic code generation, and task-oriented dialogue systems; Section III.A with FIG. 2 in Page 64294: based on the structural analysis of connecting edges, using line segment representation can better handle the dilemma in which Arrow R-CNN is stuck).  

Claim 6
Masry in view of SUN discloses all the elements as stated in Claim 2 and further discloses wherein the vision-language architecture comprises a Bidirectional Encoder Representations from Transformers (BERT) model and a Vision Transformer (ViT) (Masry, Section 4.2 with FIG. 3 in Pages 5-6: the approach to ChartQA builds on two of the state-of-the-art TableQA models: T5 (Raffel et al., 2020; Nan et al., 2021) and TaPas (Herzig et al., 2020); the input to these models consists of the question qi and the data table ti; different from TableQA, ChartQA often involves extracting visual information from chart images; for this, also experiment with the visual counterparts of the TableQA models that also take the chart image features into account; while T5 has a visual variant, VL-T5 (Cho et al., 2021), TaPas does not; in this work, extend TaPas to consider the image features and call it VisionTaPas; more details on models are provided in A.5; T5 (Raffel et al., 2020) is an encoder-decoder model which unifies the NLP tasks as text-to-text generation using the same architecture and loss function; it has been pre-trained on massive amount of unlabelled data with a self-supervised denoising objective; to fine-tune T5 on the ChartQA task, flatten the data table and feed it along with the question as: "Question: Question tokens Table: Flattened table tokens", and the model is trained to generate the answer directly; VL-T5 (Cho et al., 2021) is an extension of T5 that unifies the Vision-Language (VL) tasks as text generation conditioned on multimodal inputs; the input consists of both textual tokens and visual features of the objects extracted from the image using Faster R-CNN (Ren et al., 2015); the model is pre-trained on multiple multimodal tasks such as language modeling, visual QA, and visual grounding; utilize VL-T5 for the ChartQA task in the following manner: (a) for the textual input, do the same as T5 where flatten the data table of the chart image and concatenate it with the question text; (b) for the visual input, extract the visual features of different marks in the chart image (e.g., bars, lines) using Mask R-CNN (He et al., 2017) with Resnet-101 as its backbone (see A.4 for details); unlike the original VL-T5 where a fixed number of objects is provided (36), the number of elements varies from one chart to another; to account for this, pad the extracted visual features with zeros to have a fixed length of 36; TaPas (Herzig et al., 2020) extends a BERT (Devlin et al., 2019) architecture with additional positional embeddings for rows and columns to encode a table; as shown in Fig. 3a, the input to the model has the following format: [CLS] Question tokens [SEP] Flattened table tokens; the tokens are encoded with the table-specific positional embeddings in addition to BERT’s segment and positional embeddings; the model has two output heads: aggregation operation head and cell selection head; the aggregation operation head predicts an operation (e.g., COUNT, SUM, AVERAGE, NONE) which is then applied to the cell values selected by the cell selection head; depending on the operation type, the selected cells can constitute the final answer or the input used to infer the final answer; TaPas is first pre-trained on masked language modeling objective using table-text pairs crawled from Wikipedia where table cells are randomly masked and the model is trained to predict them; it is then fine-tuned in a weakly-supervised manner (using answers as the only supervision) with end-to-end differentiable objectives; VisionTaPas is our extension of TaPas for QA over charts; it consists of three main components: a vision transformer encoder for encoding the chart image, a TaPas encoder for encoding the question and data table and a cross-modal encoder (Fig. 3b); Vision Transformer or ViT (Dosovitskiy et al., 2021) utilizes the transformer encoder architecture (Vaswani et al., 2017) in vision tasks; given a 2D chart image, the image is divided into a sequence of 2D patches {p1, …, pn}; each patch is then flattened and linearly projected into a d-dimensional embedding vector; to incorporate the positional information of the patches, 1D learnable positional embeddings are added to the image features; an L-layer ViT encoder produces a sequence of embeddings H =                         
                            
                                            h
                                        
                                            c
                                            l
                                            s
                                        
                                            L
                                        
                                    ,
                                     
                                            h
                                        
                                            1
                                        
                                            L
                                        
                                    ,
                                     
                                    …
                                    ,
                                     
                                            h
                                        
                                            n
                                        
                                            L
                                        
                     representing the special [CLS] token and the image patches; initialize the ViT module with the pre-trained weights from (Dosovitskiy et al., 2021); the Cross-modality Encoder takes the output of ViT and TaPas encoders (H and Z) and compute multimodal encodings; it has four blocks, each containing a visual branch and a textual-tabular branch; the input first passes through the multiheaded cross attention layers in parallel, where in the visual branch the query vectors are the visual features, and the key and context vectors are the textual-tabular features and vice versa in the textual-tabular branch; the cross-attended features are then passed through a self-attention layer followed by a fully connected layer; similar to the transformer model, each layer applies layer normalization (Ba et al., 2016) and is wrapped with a residual connection; finally, append the aggregation operation and the cell selection heads of TaPas to the final layer at the textual-tabular branch; many questions in the ChartQA dataset require performing a subtraction or ratio operation, which the original TaPas model does not support; thus extend the operation head to add those two operations (Fig. 3b); however, instead of training them in a weakly-supervised manner based on the final answer (as done in TaPas), find it more effective when provided with more direct but potentially noisy supervision on the cells to consider; rely on some heuristics to generate such supervision in the training data; to handle the fixed vocabulary answers (e.g. ‘Yes’, ‘No’), further extend the operation head to include those classes).  

Claim 7
Masry in view of SUN discloses all the elements as stated in Claim 6 and further discloses wherein the training of the vision-language architecture on the synthetic dataset to answer questions about the graph-like chart images comprises generating a representation of the graph-like chart images using the ViT (Masry, Section 4.2 with FIG. 3 in Pages 5-6: the approach to ChartQA builds on two of the state-of-the-art TableQA models: T5 (Raffel et al., 2020; Nan et al., 2021) and TaPas (Herzig et al., 2020); the input to these models consists of the question qi and the data table ti; different from TableQA, ChartQA often involves extracting visual information from chart images; for this, also experiment with the visual counterparts of the TableQA models that also take the chart image features into account; while T5 has a visual variant, VL-T5 (Cho et al., 2021), TaPas does not; in this work, extend TaPas to consider the image features and call it VisionTaPas; more details on models are provided in A.5; T5 (Raffel et al., 2020) is an encoder-decoder model which unifies the NLP tasks as text-to-text generation using the same architecture and loss function; it has been pre-trained on massive amount of unlabelled data with a self-supervised denoising objective; to fine-tune T5 on the ChartQA task, flatten the data table and feed it along with the question as: "Question: Question tokens Table: Flattened table tokens", and the model is trained to generate the answer directly; VL-T5 (Cho et al., 2021) is an extension of T5 that unifies the Vision-Language (VL) tasks as text generation conditioned on multimodal inputs; the input consists of both textual tokens and visual features of the objects extracted from the image using Faster R-CNN (Ren et al., 2015); the model is pre-trained on multiple multimodal tasks such as language modeling, visual QA, and visual grounding; utilize VL-T5 for the ChartQA task in the following manner: (a) for the textual input, do the same as T5 where flatten the data table of the chart image and concatenate it with the question text; (b) for the visual input, extract the visual features of different marks in the chart image (e.g., bars, lines) using Mask R-CNN (He et al., 2017) with Resnet-101 as its backbone (see A.4 for details); unlike the original VL-T5 where a fixed number of objects is provided (36), the number of elements varies from one chart to another; to account for this, pad the extracted visual features with zeros to have a fixed length of 36; TaPas (Herzig et al., 2020) extends a BERT (Devlin et al., 2019) architecture with additional positional embeddings for rows and columns to encode a table; as shown in Fig. 3a, the input to the model has the following format: [CLS] Question tokens [SEP] Flattened table tokens; the tokens are encoded with the table-specific positional embeddings in addition to BERT’s segment and positional embeddings; the model has two output heads: aggregation operation head and cell selection head; the aggregation operation head predicts an operation (e.g., COUNT, SUM, AVERAGE, NONE) which is then applied to the cell values selected by the cell selection head; depending on the operation type, the selected cells can constitute the final answer or the input used to infer the final answer; TaPas is first pre-trained on masked language modeling objective using table-text pairs crawled from Wikipedia where table cells are randomly masked and the model is trained to predict them; it is then fine-tuned in a weakly-supervised manner (using answers as the only supervision) with end-to-end differentiable objectives; VisionTaPas is our extension of TaPas for QA over charts; it consists of three main components: a vision transformer encoder for encoding the chart image, a TaPas encoder for encoding the question and data table and a cross-modal encoder (Fig. 3b); Vision Transformer or ViT (Dosovitskiy et al., 2021) utilizes the transformer encoder architecture (Vaswani et al., 2017) in vision tasks; given a 2D chart image, the image is divided into a sequence of 2D patches {p1, …, pn}; each patch is then flattened and linearly projected into a d-dimensional embedding vector; to incorporate the positional information of the patches, 1D learnable positional embeddings are added to the image features; an L-layer ViT encoder produces a sequence of embeddings H =                         
                            
                                            h
                                        
                                            c
                                            l
                                            s
                                        
                                            L
                                        
                                    ,
                                     
                                            h
                                        
                                            1
                                        
                                            L
                                        
                                    ,
                                     
                                    …
                                    ,
                                     
                                            h
                                        
                                            n
                                        
                                            L
                                        
                     representing the special [CLS] token and the image patches; initialize the ViT module with the pre-trained weights from (Dosovitskiy et al., 2021); the Cross-modality Encoder takes the output of ViT and TaPas encoders (H and Z) and compute multimodal encodings; it has four blocks, each containing a visual branch and a textual-tabular branch; the input first passes through the multiheaded cross attention layers in parallel, where in the visual branch the query vectors are the visual features, and the key and context vectors are the textual-tabular features and vice versa in the textual-tabular branch; the cross-attended features are then passed through a self-attention layer followed by a fully connected layer; similar to the transformer model, each layer applies layer normalization (Ba et al., 2016) and is wrapped with a residual connection; finally, append the aggregation operation and the cell selection heads of TaPas to the final layer at the textual-tabular branch; many questions in the ChartQA dataset require performing a subtraction or ratio operation, which the original TaPas model does not support; thus extend the operation head to add those two operations (Fig. 3b); however, instead of training them in a weakly-supervised manner based on the final answer (as done in TaPas), find it more effective when provided with more direct but potentially noisy supervision on the cells to consider; rely on some heuristics to generate such supervision in the training data; to handle the fixed vocabulary answers (e.g. ‘Yes’, ‘No’), further extend the operation head to include those classes).  

Claim 8
Masry in view of SUN discloses all the elements as stated in Claim 7 and further discloses generating edge annotations using heat maps (SUN, ABSTRACT in Page 64292: propose an end-to-end multi-task network FR-DETR (Flowchart Recognition DEtection TRansformer) and a new dataset for precise and robust flowchart recognition; FR-DETR comprises a CNN backbone and a shared multi-scale Transformer structure to perform symbol detection and edge detection using shared feature maps and respective prediction heads in a coarse-to-fine refinement process; the coarse stage analyzes features with low resolution and suggests candidate regions that contain potential targets for the fine stage to produce accurate predictions using features with high resolution; meanwhile, a new dataset is constructed to provide more symbol types and complex backgrounds for network training and evaluation, which contains more than 1000 machine-generated flowchart images, 25K+ symbol instances with nine categories, and 20K+ line segments; Section I with FIG. 1 in Pages 64292-64293: deep-learning-based computer vision technologies are capable of focusing on desired targets, which can be summarized into symbols and edges in flowcharts; the flowchart recognition task can be divided into recognizing symbols using object detection and detecting edges using line segment detection; as shown in Fig. 1, symbols and arrows are indicated by bounding boxes and edges are indicated by line segments; in flowchart recognition, information processed by different tasks is inherently correlated, such as symbols are connected by edges and edges are connecting symbols; believe a multi-task model is more suitable for flowchart recognition because (a) it enables information to be shared between two tasks of object detection and line segment detection, which can simplify the network structure; and (b) it reduces the training and analysis process by managing two tasks simultaneously; proposes an end-to-end multi-task network architecture named FR-DETR, the first deep learning system for machine-generated flowchart recognition to the best of our knowledge; by fusing DETR (DEtection TRansformer) and LETR (LinE segment TRansformers), FR-DETR has a CNN backbone and a multi-scale Transformer structure to perform fine-grained symbol detection and edge detection respectively; to satisfy the requirements of data volume and complexity for model training, a new flowchart dataset is also constructed; the machine-generated flowchart recognition can be accomplished using deep-learning-based object detection and line segment detection to handle the increasing symbol diversity and background complexity of flowchart images; propose an end-to-end multi-task learning network that combines object detection and line segment detection to reduce the costs caused by separate models; the model jointly detects symbols and edges in flowcharts and employs a multi-scale Transformer structure to improve the recognition accuracy of both tasks; the proposed method achieves a better recognition accuracy that outperforms the prior machine-generated flowchart recognition methods; a new machine-generated flowchart dataset is established to address data shortages for deep learning model training; Section II.A in Pages 64293-64294: Schäfer et al. [30] proposed Arrow R-CNN to detect symbols and connecting edges by adding a keypoint head to Faster R-CNN; the keypoint head is designed to detect the arrow and tail belonging to a connecting edge as two keypoints; arrow R-CNN was the first deep-learning-based flowchart recognition approach; Section II.B in Page 64294: following the basic query concept of DETR, Xu et al. [19] proposed a Transformer based line segment detector LETR; unlike the prior line segment detection approaches consisting of heuristics-guided steps, the LETR detector directly regressed the endpoints of a line segment and achieved state-of-the-art performance on relative line segment detection datasets; Section III.A-D with FIGS. 2-3 in Pages 64294-64296: as shown in Fig. 2, although the representation of connecting edges is adjusted to an entire connection between two symbols, errors in detecting edges and keypoints still occur in Arrow R-CNN; based on the structural analysis of connecting edges, using line segment representation can better handle the dilemma in which Arrow R-CNN is stuck; recently, a Transformer-based line segment detector LETR [19] suggested a model that directly regresses the endpoints coordinates of each line segment; the attention mechanism introduced by Transformer perfectly meets the need to distinguish line segments between wanted and unwanted ones; inspired by the aforementioned methods, with the aim of improving the flowchart  recognition accuracy and reducing the costs of using isolated models, modify LETR into a multi-task model to jointly accomplish the two detection tasks; the model selected to perform symbol detection is DETR; the reasons can be concluded as follows: (a) DETR has a similar Transformer-based structure to LETR, which maximizes structure sharing between the two models and the overall cost reduction; (b) Other CNN-based object detection models and LETR have few shareable parts, which unavoidably results in insufficient structure sharing and difficulties in jointly analyzing features; the overall architecture of FR-DETR is designed based on a multi-scale Transformer encoder-decoder structure as depicted in Fig. 3; the proposed flowchart recognition process can be divided into four sub-tasks: feature extraction, feature encoding, feature decoding and target prediction; FR-DETR architecture: (a) a convolution network is used as backbone to produce two feature maps with different resolutions from an input flowchart; (b) the coarse encoder-decoder structure first encodes the smaller feature map and then generates the mid-queries with the interactions between encoded features and the init-queries; (c) the fine encoder-decoder structure encodes the feature map with higher resolution and outputs final-queries based on the interaction between the corresponding encoded feature and mid-queries; and (d) in the end, the final-queries are fed into feed-forward networks to make the final predictions).  

Claim 9
Masry in view of SUN discloses all the elements as stated in Claim 2 and further discloses wherein the rendering of the plurality of graph-like charts from a plurality of associated input files comprises: rendering a plurality of images of random graph-like charts; and generating one or more bounding box annotations for each of the random graph-like charts (Masry, Section 1 in Pages 1-2: the charts are created automatically using a programming tool like Matplotlib (Singh and Shekhar, 2020); benchmark consists of 20,882 charts which are curated from four different online sources to ensure variety in visual styles and topics; propose an approach that combines visual features and extracted data from the chart image; pipeline first extracts the underlying data table from the chart image by adapting the ChartOCR model (Luo et al., 2021) as well as the visual features from the chart image using neural models; then, adapt two transformer-based QA models where utilize both the extracted data table and visual features of the chart in a unified way; "Chart Data Extraction" of Section 2 in pages 2-3: (Siegel et al., 2016) proposed fully automatic chart data extraction pipelines; Luo et al. (2021) also automatically extract data from real-world charts with high accuracy; extend their pipeline to extract the fully-structured data table to pass it to models; Section 3.1 in Page 3: extracted the underlying data tables, metadata (e.g., title, chart type), SVG file and associate text description; finally, extracted the bounding boxes information of the different chart elements (e.g., x-axis labels) from the SVG files to train our data extraction models; Section 3.2 in Pages 3-4: checked on 500 random samples; Section 3.3 with Tables 3-4 in Pages 4-5: dataset has three commonly used chart types: bar, line, and pie charts (Table 3); categorize the bar and line charts into simple vs complex where data tables of simple charts have only two columns where complex charts involve multiple columns (e.g., stacked or grouped bars and multi-line charts); randomly selected 300 QA pairs from our benchmark and categorized them into four types; Section 4.2 in Pages 5-7: TaPas is first pre-trained on masked language modeling objective using table-text pairs crawled from Wikipedia where table cells are randomly masked and the model is trained to predict them; Section A.2 with Table 8 in Page 12: Table 8 shows the usage of visual references made in the randomly selected 300 QA pairs; "Model" of Section A.3 in Page 12 with FIGS. 7-8 in Page 14: extend ChartOCR (Luo et al., 2021) which relies on both deep-learning models and rule-based techniques to parse the chart image into the underlying data table; key-point detection networks, adapted from (Law and Deng, 2019), locates the chart visual marks (e.g. bars, plot area, line points); ideally, the network locates the top-left point and bottom-right points for the rectangular objects (e.g. bar, plot area); in line charts, the detection network locates the coordinates of the points connecting the line segments; in pie charts, the network locates the intersection points between the pie segments along the pie perimeter; extend their detection networks to also locate the chart textual elements (e.g. x-axis-label, legend-label ) as shown in Figure 7a and utilize the CRAFT model (Baek et al., 2019) to read their underlying texts; the chart scale is estimated using the y-axis-labels value for line and bar charts, Figure 7b; for pie charts, the value of each segment is estimated by calculating the angle between its borderlines; finally, the model aggregates the extracted data values (using color and proximity heuristics) to output the final raw data values.; extend their approach to extract the fully-structured data table with the textual labels (e.g. column headers); as shown in Figure 7, associate the estimated bars data values (e.g., ‘17.13’, ‘40.14’) with their closest x-axis-label (’Snapchat’); moreover, if the chart has more than one data series (dark bars or blue bars values), each data series is matched with its legend-label (e.g., ‘2016’, ‘2014’) based on the color of the legend mark and data-encoding marks (e.g., bars); if cannot match data values with legends by colors (e.g., when all legend marks have the same color or there are no legend marks), use other criteria that associate data-encoding marks with legend marks (e.g., proximity, alignment); e.g., in Figure 8b, ’More’ is matched with ’17’ and ’29’ since they are vertically aligned; similarly, for line charts if there is no explicit legend mark for a line series associate the legend labels with the points of their closest lines as shown in Figure 8a; Section A.4 in Pages 15-16: for the bounding boxes annotations, use the available bboxes; for the masks, generate them easily using the bounding boxes for all the rectangular objects; Section A.6 with Table 10 in Page 16: Table 10 presents the results of two top-performing models in the benchmark by chart types; to analyze question types, randomly sampled 200 QA pairs from our ChartQA-H and classified them into four main categories) (SUN, Section I with FIG. 1 in Pages 64292-64293: as shown in Fig. 1, symbols and arrows are indicated by bounding boxes and edges are indicated by line segments; bounding boxes localize each symbol and arrow, line segments denote connecting edges, regardless of texture backgrounds; FIG. 2 in Page 64294: bounding boxes indicate entire connecting edges, and each red point and blue point represent the start and tail of belonged edge) (SUN, Section IV.A with Table 1 in Page 64297: the original CLEF-IP dataset released in 2011 contains machine-generated flowchart images and other structural diagrams; after removing non-flowcharts, approximately 60 images remain, most of which are provided by the European Patent Office (EPO) for the patent retrieval study; these flowcharts have a simple white background, and their structures are drawn in black; the dataset has three main symbols: rectangle for processing action, diamond for decision, and oval for terminator; to enrich the symbol category and structural complexity, public flowchart images are collected through the Internet by using image search engines, such as Google Image, Bing Image and Baidu Image; after filtering low quality images and removing duplicates, a dataset containing more than 1,000 images is constructed; the new dataset includes 25K+ symbol instances and 20K+ line segments; statistical details are shown in Table 1; the FR-DETR model is trained on the new dataset, which is randomly divided into 800 training images and 200 testing images; data augmentation methods including random resize, random flip, and random crop, are taken through all experiments; Section III.D in Page 64296: if a final query belongs to symbol detection, the coordinates prediction is in the format of bounding box b = (cx, cy, w, h), which denotes the center point, width and height of the box; otherwise, a line segment prediction l is represented by two endpoints (p1; p2), where p1 = (x1; y1), p2 = (x2; y2)).  

Claim 10
Masry in view of SUN discloses all the elements as stated in Claim 2 and further discloses receiving, from an end user via a user interface, a question about the graph-like charts; generating, by the trained vision-language architecture, an answer to the question; and presenting the generated answer to the end user via the user interface (Masry, 1st–2nd paragraphs of Section 1 with FIG. 1 in Page 1: Figure 1 displays a line chart with complex reasoning questions about the line chart involving arithmetic and logical operations and their answers; e.g., the question Q1 in Fig. 1 requires the user to compute the differences between the two lines for each year and find the year with the highest difference; Q2 in Fig. 1 refers to the color of a mark (‘line’) and its attribute (‘peak’) in the chart; Section 3.2 with Table 2 in Pages 3-4: collect human-authored QA pairs using Amazon Mechanical Turk (AMT); to create human-authored QA pairs, designed an AMT task (see A.1 for details) in which the crowdworkers are asked to focus on two types of questions for each chart image: compositional and visual questions; compositional questions contain at least two mathematical/logical operations like sum, difference and average, while visual questions refer to the visual attributes such as color, height, and length of graphical marks (e.g., bars) in the chart; Section 4.1 in Page 5: Section 4.1 with FIG. 2 in Page 5: given a dataset with N examples D =                         
                            
                                                    c
                                                
                                                    i
                                                
                                            ,
                                             
                                                    t
                                                
                                                    i
                                                
                                            ,
                                             
                                                    q
                                                
                                                    i
                                                
                                            ,
                                             
                                                    a
                                                
                                                    i
                                                
                                    i
                                    =
                                    1
                                
                                    N
                                
                    , where ci represents a chart image, ti represents the underlying data table, qi represents a question over ci, and ai represents the answer to the question; the ChartQA models learn to predict the answer ai given ci, ti and qi, wherein the underlying data table ti for chart image ci is extracted by adapting a state-of-the-art ChartOCR (Luo et al., 2021); Section A.1 in Page 12 with FIG. 5 in Page 13: in each HIT (Human Intelligent Task), the workers verify two previously asked questions by other workers and also provide two new QA pairs; the data collection interface is shown in Figure 5; while presenting the chart in the user interface for annotation task, ensure that the data labels of chart elements are visible to workers so that they can accurately perform the necessary arithmetic and logical operations to provide and answer the questions successfully; Section A.7 with FIG. 10 in Pages 16-17: sample machine-generated questions with the human-written summaries are shown in Table 12; sample predictions from the model VisionTaPas on ChartQA test set are shown in Figure 10, wherein Answers in green are correct and answers in red are incorrect)

Claim 11
Masry in view of SUN discloses all the elements as stated in Claim 2 and further discloses wherein the graph data comprises nodes, edges, labels, and style settings for the graph-like chart (Masry, Section 1 in Pages 1-2: the charts are created automatically using a programming tool like Matplotlib (Singh and Shekhar, 2020); benchmark consists of 20,882 charts which are curated from four different online sources to ensure variety in visual styles and topics; propose an approach that combines visual features and extracted data from the chart image; pipeline first extracts the underlying data table from the chart image by adapting the ChartOCR model (Luo et al., 2021) as well as the visual features from the chart image using neural models; then, adapt two transformer-based QA models where utilize both the extracted data table and visual features of the chart in a unified way; "Chart Data Extraction" of Section 2 in pages 2-3: (Siegel et al., 2016) proposed fully automatic chart data extraction pipelines; Luo et al. (2021) also automatically extract data from real-world charts with high accuracy; extend their pipeline to extract the fully-structured data table to pass it to models; Section 3.1 in Page 3: extracted the underlying data tables, metadata (e.g., title, chart type), SVG file and associate text description; finally, extracted the bounding boxes information of the different chart elements (e.g., x-axis labels) from the SVG files to train our data extraction models; Section 3.3 with Table 3 in Pages 4-5: dataset has three commonly used chart types: bar, line, and pie charts (Table 3); categorize the bar and line charts into simple vs complex where data tables of simple charts have only two columns where complex charts involve multiple columns (e.g., stacked or grouped bars and multi-line charts); "Model" of Section A.3 in Page 12 with FIGS. 7-8 in Page 14: extend ChartOCR (Luo et al., 2021) which relies on both deep-learning models and rule-based techniques to parse the chart image into the underlying data table; key-point detection networks, adapted from (Law and Deng, 2019), locates the chart visual marks (e.g. bars, plot area, line points); ideally, the network locates the top-left point and bottom-right points for the rectangular objects (e.g. bar, plot area); in line charts, the detection network locates the coordinates of the points connecting the line segments; in pie charts, the network locates the intersection points between the pie segments along the pie perimeter; extend their detection networks to also locate the chart textual elements (e.g. x-axis-label, legend-label ) as shown in Figure 7a and utilize the CRAFT model (Baek et al., 2019) to read their underlying texts; the chart scale is estimated using the y-axis-labels value for line and bar charts, Figure 7b; for pie charts, the value of each segment is estimated by calculating the angle between its borderlines; finally, the model aggregates the extracted data values (using color and proximity heuristics) to output the final raw data values.; extend their approach to extract the fully-structured data table with the textual labels (e.g. column headers); as shown in Figure 7, associate the estimated bars data values (e.g., ‘17.13’, ‘40.14’) with their closest x-axis-label (’Snapchat’); moreover, if the chart has more than one data series (dark bars or blue bars values), each data series is matched with its legend-label (e.g., ‘2016’, ‘2014’) based on the color of the legend mark and data-encoding marks (e.g., bars); if cannot match data values with legends by colors (e.g., when all legend marks have the same color or there are no legend marks), use other criteria that associate data-encoding marks with legend marks (e.g., proximity, alignment); e.g., in Figure 8b, ’More’ is matched with ’17’ and ’29’ since they are vertically aligned; similarly, for line charts if there is no explicit legend mark for a line series associate the legend labels with the points of their closest lines as shown in Figure 8a) (SUN, Section I with FIG. 1 in Pages 64292-64293: deep-learning-based computer vision technologies are capable of focusing on desired targets, which can be summarized into symbols and edges in flowcharts; although these approaches perform well in symbol detection, their bounding box representation makes it difficult to recognize line segments that are short or nearly parallel to the axes; some deep-learning-based line segment detectors that treat edges as endpoint pairs can be used to achieve good line segment detection performance; the flowchart recognition task can be divided into recognizing symbols using object detection and detecting edges using line segment detection; as shown in Fig. 1, symbols and arrows are indicated by bounding boxes and edges are indicated by line segments; in flowchart recognition, information processed by different tasks is inherently correlated, such as symbols are connected by edges and edges are connecting symbols; by fusing DETR (DEtection TRansformer) and LETR (LinE segment TRansformers), FR-DETR has a CNN backbone and a multi-scale Transformer structure to perform fine-grained symbol detection and edge detection respectively; the machine-generated flowchart recognition can be accomplished using deep-learning-based object detection and line segment detection to handle the increasing symbol diversity and background complexity of flowchart images; propose an end-to-end multi-task learning network that combines object detection and line segment detection to reduce the costs caused by separate models; the model jointly detects symbols and edges in flowcharts and employs a multi-scale; transformer structure to improve the recognition accuracy of both tasks; Section II.A in Pages 64293-64294: Schäfer et al. proposed Arrow R-CNN to detect symbols and connecting edges by adding a keypoint head to Faster R-CNN; the keypoint head is designed to detect the arrow and tail belonging to a connecting edge as two keypoints; arrow R-CNN was the first deep-learning-based flowchart recognition approach; Section II.A in Pages 64293-64294: Rusiñol et al. summarized flowcharts as structure layer and text layer, and then performed recognition after layer separation; Zhang proposed a corner-based structural model (CBSM) based on the analysis of different corners and symbol shapes; the CBSM recognizes symbols by defining corner classification and corner-based spatial constraints for each kind of graphic shapes; Carton et al. fused structural and statistical information to compute grammatical descriptions for each type of symbol; Lemaitre et al. analyzed flowchart structure based on the description and modification of segment (DMOS) and structural information; Bresler et al. solved a max-sum model optimization task to obtain the best symbol description; Schäfer et al. proposed Arrow R-CNN to detect symbols and connecting edges by adding a keypoint head to Faster R-CNN; the keypoint head is designed to detect the arrow and tail belonging to a connecting edge as two keypoints; arrow R-CNN was the first deep-learning-based flowchart recognition approach; Section III.A with FIG. 2 in Pages 64294-64295: every connecting edge detected by Arrow R-CNN must contain an arrow as a keypoint; however, several connecting edges within a machine-generated flowchart may share one arrow, which causes failures in detecting edges and keypoints when applying Arrow R-CNN; as shown in Fig. 2, although the representation of connecting edges is adjusted to an entire connection between two symbols, errors in detecting edges and keypoints still occur (e.g., due to the complex structure that machine-generated flowcharts have, errors occur when applying arrow R-CNN; bounding boxes indicate entire connecting edges, and each red point and blue point represent the start and tail of belonged edge; (a) shows the detection errors of box and keypoint when edges have no arrow;  (b) shows the detection errors when edges have a nested layout); based on the structural analysis of connecting edges, using line segment representation can better handle the dilemma in which Arrow R-CNN is stuck; line segments are not appropriately described with bounding boxes because of their highly variable aspect ratio and limited choices of anchors; many works follow the procedure of first producing junction proposals and then converting them into line segments, which causes the performance of these methods to rely heavily on junction detection; however, in a flowchart, junctions appear on both connecting edges and symbols, which brings numerous redundant; Section III.D in Page 64296: if a final query belongs to symbol detection, the coordinates prediction is in the format of bounding box b = (cx, cy, w, h), which denotes the center point, width and height of the box; otherwise, a line segment prediction l is represented by two endpoints (p1; p2), where p1 = (x1; y1), p2 = (x2; y2); Section IV.A with Table 1 in Page 64297: the original CLEF-IP dataset released in 2011 contains machine-generated flowchart images and other structural diagrams; after removing non-flowcharts, approximately 60 images remain, most of which are provided by the European Patent Office (EPO) for the patent retrieval study; these flowcharts have a simple white background, and their structures are drawn in black; the dataset has three main symbols: rectangle for processing action, diamond for decision, and oval for terminator; to enrich the symbol category and structural complexity, public flowchart images are collected through the Internet by using image search engines, such as Google Image, Bing Image and Baidu Image; after filtering low quality images and removing duplicates, a dataset containing more than 1,000 images is constructed; the new dataset includes 25K+ symbol instances and 20K+ line segments; statistical details are shown in Table 1; the FR-DETR model is trained on the new dataset, which is randomly divided into 800 training images and 200 testing images; data augmentation methods including random resize, random flip, and random crop, are taken through all experiments).  

Claim 12
Masry in view of SUN discloses all the elements as stated in Claim 2 and further discloses wherein each set of questions and answers comprises a set of possible answers and one correct answer (Masry, Section 3.2 in Pages 3-4: to create human-authored QA pairs, designed an AMT task (see A.1 for details) in which the crowdworkers are asked to focus on two types of questions for each chart image: compositional and visual questions; for each chart, the workers provide two questions with the answers; the same questions are then answered by another annotator; if both workers’ answers exactly match, consider the answer to be correct; otherwise, manually check the answers to select the final correct answer; fine-tune a pre-trained T5 model on the SQuAD QA dataset (Rajpurkar et al., 2016) and apply to the human-written chart summaries that come with the charts from Statista to automatically generate questions that are human-like with sufficient lexical and syntactic variations; the process involves training and applying two T5 models: one for answer extraction and the other for answer-aware question generation; for answer extraction, the T5 model is trained to generate possible answers separated by [SEP] token given the textual summary as input (i.e., trained on SQuAD’s passage → answer pairs); for question generation, the proposed answer is first concatenated with the summary in the format: Answer: Answer Context: Chart Summary; then, the T5 model is trained to generate a question from the given question using the chart summary; this model is trained on SQuAD’s (passage, answer) → question pairs; since the summaries are human-written, the generated questions are similar to the human-authored questions (see example questions in A.7); Section A.1 in Page 12: in each HIT (Human Intelligent Task), the workers verify two previously asked questions by other workers and also provide two new QA pairs; to ensure quality, selected workers with an acceptance rate of 95% and total accomplished HITs of 5000; filtered the workers by giving them a pretest to select the best qualified workers for this task).  

Claim 13
Masry in view of SUN discloses all the elements as stated in Claim 12 and further discloses wherein generating a synthetic dataset of graph-like chart images further comprises balancing the set of questions to remove trivial question and answer pairs (Masry, Section 3.3 with Table 4 in Pages 4-5: analyzed the basic linguistic statistics about our benchmark (see A.2) which has more unique tokens on both types of QA pair; observe that questions cover a variety of syntactic structure and sometimes exhibit informal languages and typos; the topic distribution in our data is quite diverse as it is constructed from four different sources; to analyze the nature of questions, randomly selected 300 QA pairs from the benchmark and categorized them into four types (Table 4); the vast majority of questions (76.33% in total) are either compositional or both visual and compositional, which reflects the real-world scenarios where people ask complex reasoning questions; people make visual references to a variety of visual attributes of marks (see A.2), most commonly to color (e.g., ‘orange line’) and length (e.g., ‘tallest bar’) followed by size (e.g., ‘largest slice’) and position (e.g., ‘leftmost bar’); Section 3.2 with Table 2 in Pages 3-4: the T5 question generation model may still generate invalid questions because of the mismatch in training and test domains; notice that some questions are either incomplete or not answerable from the chart (e.g., ‘What province includes Cape Town?’ is not answerable because it requires knowledge outside of the chart); to filter out invalid questions, developed a simple heuristic where filter out the question if the answer cannot be found in the chart data table; this heuristic was inspired by the fact that most answers to the generated questions were values/labels of chart elements; Section A.2 with Tables 7-8 in Page 12 and FIG. 6 in Page 14: Table 7 shows some linguistic statistics about the benchmark; Figure 6 shows the distribution of topics in the dataset for each of the four sources; analyzed how people make visual references to charts in their questions; Table 8 shows the usage of visual references made in the randomly selected 300 QA pairs).  

Claim 14
Masry in view of SUN discloses all the elements as stated in Claim 2 and further discloses wherein generating the plurality of question-answer pairs for each of the plurality of graph-like charts comprises: generating one or more topological questions pertaining to a graph structure of the graph-like chart by value assignment in a predefined structure template; producing one or more geometrical questions pertaining to a graphical rendering of the graph-like chart by value assignment in a predefined graphical template; and producing answers for the one or more questions using ground truth data for the graph-like chart by analyzing underlying graph and spatial locations using a graphing algorithm (Masry, Section 1 with FIG. 1 in Pages 1-2: Figure 1 illustrates sample questions in the benchmark; e.g., the question Q1 in Fig. 1 requires the user to compute the differences between the two lines for each year and find the year with the highest difference; Q2 in Fig. 1 refers to the color of a mark (‘line’) and its attribute (‘peak’) in the chart; the questions are generated automatically using pre-defined templates (Kahou et al., 2017; Kafle et al., 2018; Chaudhry et al., 2020; Singh and Shekhar, 2020); present a largescale benchmark covering 9,608 human-written questions focusing on logical and visual reasoning questions; generate another 23,111 questions automatically from human-written chart summaries using a T5 model (Raffel et al., 2020) and manually validated a subset of it for quality assurance; collect a large number of questions automatically while maintaining rich variations in language as they were generated from human-written summaries; benchmark consists of 20,882 charts which are curated from four different online sources to ensure variety in visual styles and topics; propose an approach that combines visual features and extracted data from the chart image; pipeline first extracts the underlying data table from the chart image by adapting the ChartOCR model (Luo et al., 2021) as well as the visual features from the chart image using neural models; then, adapt two transformer-based QA models where utilize both the extracted data table and visual features of the chart in a unified way; "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: ChartQA differs from previous datasets in two main aspects: the questions’ types (human-authored vs. template-based) and the chart source (real-world vs. generated using a tool); a detailed comparison is shown in Table 1; earlier datasets such as FigureQA (Kahou et al., 2017), DVQA (Kafle et al., 2018), LEAF-QA (Chaudhry et al., 2020) and LEAF-QA++ (Singh and Shekhar, 2020) are mostly synthetic where the questions are generated using a small number of templates and the answers come from a fixed set of vocabulary (e.g. ‘yes’, ‘no’); PlotQA (Methani et al., 2020) is the only dataset with open-vocabulary questions that require applying aggregation operations on the underlying chart data; however, they do not have visual reasoning questions while their questions are still template-based; build a new Chart QA dataset involving visual and logical reasoning questions written by humans on real-worlds charts; "Chart Data Extraction" of Section 2 in pages 2-3: (Siegel et al., 2016) proposed fully automatic chart data extraction pipelines; Luo et al. (2021) also automatically extract data from real-world charts with high accuracy; extend their pipeline to extract the fully-structured data table to pass it to models; Section 3.2 in Pages 3-4: focus on two types of questions for each chart image: compositional and visual questions; compositional questions contain at least two mathematical/logical operations like sum, difference and average, while visual questions refer to the visual attributes such as color, height, and length of graphical marks (e.g., bars) in the chart; Section 3.3 with Table 4 in Pages 4-5: analyzed the basic linguistic statistics about our benchmark (see A.2) which has more unique tokens on both types of QA pair; observe that questions cover a variety of syntactic structure and sometimes exhibit informal languages and typos; the topic distribution in our data is quite diverse as it is constructed from four different sources; to analyze the nature of questions, randomly selected 300 QA pairs from the benchmark and categorized them into four types (Table 4); the vast majority of questions (76.33% in total) are either compositional or both visual and compositional, which reflects the real-world scenarios where people ask complex reasoning questions; people make visual references to a variety of visual attributes of marks (see A.2), most commonly to color (e.g., ‘orange line’) and length (e.g., ‘tallest bar’) followed by size (e.g., ‘largest slice’) and position (e.g., ‘leftmost bar’); Section A.2 with Tables 7-8 in Page 12 and FIG. 6 in Page 14: Table 7 shows some linguistic statistics about the benchmark; Figure 6 shows the distribution of topics in the dataset for each of the four sources; analyzed how people make visual references to charts in their questions; Table 8 shows the usage of visual references made in the randomly selected 300 QA pairs; Section A.7 with Table 12 and FIG.10 in Pages 16-17: sample machine-generated questions with the human-written summaries are shown in Table 12; sample predictions from the model VisionTaPas on ChartQA test set are shown in Figure 10) (SUN, 1st para. of Section I with FIG. 1 in Pages 64292-64293: the critical problem of flowchart recognition is to recognize and refine the structural semantics of flowcharts; understanding the structural semantics of machine-generated flowchart images is crucial for many structural-semantic-based tasks, such as patent retrieval, automatic code generation, and task-oriented dialogue systems; Section III.A with FIG. 2 in Page 64294: based on the structural analysis of connecting edges, using line segment representation can better handle the dilemma in which Arrow R-CNN is stuck).  

Claim 15
Masry in view of SUN discloses all the elements as stated in Claim 2 and further discloses wherein the generating of the synthetic dataset of graph-like chart images comprises: receiving a real world graph-like chart dataset, wherein the real world graph-like chart dataset comprises textual labels having a semantic distribution; computing statistics of the real-world graph-like chart dataset, including a distribution of nodes and edges characteristics and a distribution of graphical styles; generating, using a pretrained language model, a plurality of labels matching the semantic distribution of provided labels; generating graph data matching the computed distribution of nodes and edge characteristics and the computed distribution of graphical styles; rendering the plurality of graph-like chart images and the question-answer pairs using the graph data; and filtering of the graph-like chart images based on a similarity to the real-world graph-like chart dataset (Masry, Section 1 in Pages 1-2: the charts are created automatically using a programming tool like Matplotlib (Singh and Shekhar, 2020); benchmark consists of 20,882 charts which are curated from four different online sources to ensure variety in visual styles and topics; propose an approach that combines visual features and extracted data from the chart image; pipeline first extracts the underlying data table from the chart image by adapting the ChartOCR model (Luo et al., 2021) as well as the visual features from the chart image using neural models; then, adapt two transformer-based QA models where utilize both the extracted data table and visual features of the chart in a unified way; "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: FigureQA and DVQA use synthetically-generated data to plot the charts; LEAF-QA and LEAFQA++ use real-world data to plot the charts; the charts are plotted using a software in PlotQA (Methani et al., 2020); "Chart Data Extraction" of Section 2 in pages 2-3: (Siegel et al., 2016) proposed fully automatic chart data extraction pipelines; Luo et al. (2021) also automatically extract data from real-world charts with high accuracy; extend their pipeline to extract the fully-structured data table to pass it to models; Section 3.1 in Page 3: crawled charts from four different sources: (i) Statista (statista.com) is an online platform that presents charts covering a variety of topics including economy, politics, and industry; (ii) the Pew research (pewresearch.org) publishes report about social and economic issues, demographic trends and public opinion with a wide variety of charts; (iii) Our World In Data or OWID (ourworldindata.org) is another platform that contains thousands of charts about different global issues such as economy, finance, and society; (iv) Organisation for Economic Co-operation and Development or OECD (oecd.org) is a global organization which shares reports and data analysis for policymaking; for the Pew dataset, only crawled chart images since the underlying data tables are not available; for the other three, extracted the underlying data tables, metadata (e.g., title, chart type), SVG file and associate text description; finally, extracted the bounding boxes information of the different chart elements (e.g., x-axis labels) from the SVG files to train our data extraction models; Section 3.3 with Table 3 in Pages 4-5: dataset has three commonly used chart types: bar, line, and pie charts (Table 3); categorize the bar and line charts into simple vs complex where data tables of simple charts have only two columns where complex charts involve multiple columns (e.g., stacked or grouped bars and multi-line charts); analyzed the basic linguistic statistics about our benchmark (see A.2); the richness of language variations which may introduce more challenges to the task; the topic distribution in our data is quite diverse as it is constructed from four different sources; Section A.1 in Page 12 with FIG. 5 in Page 13: the data collection interface is shown in Figure 5; while presenting the chart in the user interface for annotation task, ensure that the data labels of chart elements are visible to workers so that they can accurately perform the necessary arithmetic and logical operations to provide and answer the questions successfully; Section A.2 with Tables 7-8 in Page 12 and FIG. 6 in Page 14: Table 7 shows some linguistic statistics about the benchmark; Figure 6 shows the distribution of topics in the dataset for each of the four sources; analyzed how people make visual references to charts in their questions; Table 8 shows the usage of visual references made in the randomly selected 300 QA pairs; "Model" of Section A.3 in Page 12 with FIGS. 7-8 in Page 14: extend ChartOCR (Luo et al., 2021) which relies on both deep-learning models and rule-based techniques to parse the chart image into the underlying data table; key-point detection networks, adapted from (Law and Deng, 2019), locates the chart visual marks (e.g. bars, plot area, line points); ideally, the network locates the top-left point and bottom-right points for the rectangular objects (e.g. bar, plot area); in line charts, the detection network locates the coordinates of the points connecting the line segments; in pie charts, the network locates the intersection points between the pie segments along the pie perimeter; extend their detection networks to also locate the chart textual elements (e.g. x-axis-label, legend-label ) as shown in Figure 7a and utilize the CRAFT model (Baek et al., 2019) to read their underlying texts; the chart scale is estimated using the y-axis-labels value for line and bar charts, Figure 7b; for pie charts, the value of each segment is estimated by calculating the angle between its borderlines; finally, the model aggregates the extracted data values (using color and proximity heuristics) to output the final raw data values.; extend their approach to extract the fully-structured data table with the textual labels (e.g. column headers); as shown in Figure 7, associate the estimated bars data values (e.g., ‘17.13’, ‘40.14’) with their closest x-axis-label (’Snapchat’); moreover, if the chart has more than one data series (dark bars or blue bars values), each data series is matched with its legend-label (e.g., ‘2016’, ‘2014’) based on the color of the legend mark and data-encoding marks (e.g., bars); if cannot match data values with legends by colors (e.g., when all legend marks have the same color or there are no legend marks), use other criteria that associate data-encoding marks with legend marks (e.g., proximity, alignment); e.g., in Figure 8b, ’More’ is matched with ’17’ and ’29’ since they are vertically aligned; similarly, for line charts if there is no explicit legend mark for a line series associate the legend labels with the points of their closest lines as shown in Figure 8a) (SUN, 1st para. of Section I with FIG. 1 in Pages 64292-64293: the critical problem of flowchart recognition is to recognize and refine the structural semantics of flowcharts; understanding the structural semantics of machine-generated flowchart images is crucial for many structural-semantic-based tasks, such as patent retrieval, automatic code generation, and task-oriented dialogue systems; Section III.A with FIG. 2 in Page 64294: based on the structural analysis of connecting edges, using line segment representation can better handle the dilemma in which Arrow R-CNN is stuck; Section III.D in Page 64296: if a final query belongs to symbol detection, the coordinates prediction is in the format of bounding box b = (cx, cy, w, h), which denotes the center point, width and height of the box; otherwise, a line segment prediction l is represented by two endpoints (p1; p2), where p1 = (x1; y1), p2 = (x2; y2); Section IV.A with Table 1 in Page 64297: the original CLEF-IP dataset released in 2011 contains machine-generated flowchart images and other structural diagrams; after removing non-flowcharts, approximately 60 images remain, most of which are provided by the European Patent Office (EPO) for the patent retrieval study; these flowcharts have a simple white background, and their structures are drawn in black; the dataset has three main symbols: rectangle for processing action, diamond for decision, and oval for terminator; to enrich the symbol category and structural complexity, public flowchart images are collected through the Internet by using image search engines, such as Google Image, Bing Image and Baidu Image; after filtering low quality images and removing duplicates, a dataset containing more than 1,000 images is constructed; the new dataset includes 25K+ symbol instances and 20K+ line segments; statistical details are shown in Table 1; the FR-DETR model is trained on the new dataset, which is randomly divided into 800 training images and 200 testing images; data augmentation methods including random resize, random flip, and random crop, are taken through all experiments).  

Claim 16
Masry in view of SUN discloses all the elements as stated in Claim 15 and further discloses wherein the training of the vision-language architecture on the synthetic dataset to answer questions about the graph-like charts comprises: iteratively adapting the vision-language architecture using the synthetic dataset and adapting the synthetic dataset using the current vision-language architecture and the real world graph-like chart data (Masry, "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: ChartQA differs from previous datasets in two main aspects: the questions’ types (human-authored vs. template-based) and the chart source (real-world vs. generated using a tool); a detailed comparison is shown in Table 1; earlier datasets such as FigureQA (Kahou et al., 2017), DVQA (Kafle et al., 2018), LEAF-QA (Chaudhry et al., 2020) and LEAF-QA++ (Singh and Shekhar, 2020) are mostly synthetic where the questions are generated using a small number of templates and the answers come from a fixed set of vocabulary (e.g. ‘yes’, ‘no’); PlotQA (Methani et al., 2020) is the only dataset with open-vocabulary questions that require applying aggregation operations on the underlying chart data; FigureQA and DVQA use synthetically-generated data to plot the charts; LEAF-QA and LEAFQA++ use real-world data to plot the charts; the charts are plotted using a software in PlotQA (Methani et al., 2020); "Dataset Augmentation" of Section 3.2 in Pages 3-4: prior work on QA has performed data augmentation by either creating template-based or machine generated questions; fine-tune a pre-trained T5 model on the SQuAD QA dataset (Rajpurkar et al., 2016) and apply to the human-written chart summaries that come with the charts from Statista to automatically generate questions that are human-like with sufficient lexical and syntactic variations; "ChartQA Dataset" of Section 5.2 with Table 6 in Pages 7-8:  evaluate the transferability of the models and the datasets, where first pretrain the two top performing models (VisionTaPas and VL-T5) on the PlotQA dataset and then fine-tune them on ChartQA; large datasets like PlotQA can be useful for pretraining the model even if the questions are generated from a small number of templates; Section A.4 in Page 15: train the model to detect the following 15 objects: ’Legend’, ’yAxisTitle’, ’ChartTitle’, ’xAxisTitle’, ’LegendPreview’, ’PlotArea’, ’yAxisLabel’, ’xAxisLabel’, ’LegendLabel’, ’PieLabel’, ’bar’, ’pie’, ’pieSlice’, ’line’, and ’dotLine’; for the bounding boxes annotations, use the available bboxes; for the masks, generate them easily using the bounding boxes for all the rectangular objects; for ’pieSlice’ and ’pie’, follow a similar approach to (Singh and Shekhar, 2020) where generate the masks by projecting the radius along the pie perimeter from the starting to the ending points of each slice; use the detectron2 library (Wu et al., 2019) and initialize the model with pre-trained wights on the COCO dataset (Lin et al., 2014); fine-tune the model with a batch size of 8 and an initial learning rate of 0.00025 for 50K iterations) (SUN, ABSTRACT in Page 64292: propose an end-to-end multi-task network FR-DETR (Flowchart Recognition DEtection TRansformer) and a new dataset for precise and robust flowchart recognition; FR-DETR comprises a CNN backbone and a shared multi-scale Transformer structure to perform symbol detection and edge detection using shared feature maps and respective prediction heads in a coarse-to-fine refinement process; the coarse stage analyzes features with low resolution and suggests candidate regions that contain potential targets for the fine stage to produce accurate predictions using features with high resolution; meanwhile, a new dataset is constructed to provide more symbol types and complex backgrounds for network training and evaluation, which contains more than 1000 machine-generated flowchart images, 25K+ symbol instances with nine categories, and 20K+ line segments; Section IV.A with Table 1 in Page 64297: the original CLEF-IP dataset released in 2011 contains machine-generated flowchart images and other structural diagrams; after removing non-flowcharts, approximately 60 images remain, most of which are provided by the European Patent Office (EPO) for the patent retrieval study; these flowcharts have a simple white background, and their structures are drawn in black; the dataset has three main symbols: rectangle for processing action, diamond for decision, and oval for terminator; to enrich the symbol category and structural complexity, public flowchart images are collected through the Internet by using image search engines, such as Google Image, Bing Image and Baidu Image; after filtering low quality images and removing duplicates, a dataset containing more than 1,000 images is constructed; the new dataset includes 25K+ symbol instances and 20K+ line segments; statistical details are shown in Table 1; the FR-DETR model is trained on the new dataset, which is randomly divided into 800 training images and 200 testing images; data augmentation methods including random resize, random flip, and random crop, are taken through all experiments).  

Claim 17
Masry in view of SUN discloses all the elements as stated in Claim 15 and further discloses augmenting the real world graph-like chart dataset with synthetic data (Masry, "Dataset Augmentation" of Section 3.2 in Pages 3-4: prior work on QA has performed data augmentation by either creating template-based or machine generated questions; fine-tune a pre-trained T5 model on the SQuAD QA dataset (Rajpurkar et al., 2016) and apply to the human-written chart summaries that come with the charts from Statista to automatically generate questions that are human-like with sufficient lexical and syntactic variations) (SUN, Section IV.A with Table 1 in Page 64297: the original CLEF-IP dataset released in 2011 contains machine-generated flowchart images and other structural diagrams; after removing non-flowcharts, approximately 60 images remain, most of which are provided by the European Patent Office (EPO) for the patent retrieval study; these flowcharts have a simple white background, and their structures are drawn in black; the dataset has three main symbols: rectangle for processing action, diamond for decision, and oval for terminator; to enrich the symbol category and structural complexity, public flowchart images are collected through the Internet by using image search engines, such as Google Image, Bing Image and Baidu Image; after filtering low quality images and removing duplicates, a dataset containing more than 1,000 images is constructed; the new dataset includes 25K+ symbol instances and 20K+ line segments; statistical details are shown in Table 1; the FR-DETR model is trained on the new dataset, which is randomly divided into 800 training images and 200 testing images; data augmentation methods including random resize, random flip, and random crop, are taken through all experiments).  

Independent Claim 18
Masry discloses a computer program product for implementing a Question Answering (QA) system (Masry, 2nd paragraph of Section 1 in Page 1: the goal of a Chart Question Answering (ChartQA) system is to help users by taking a chart and a natural language question as input and predicting the answer; Section 4.1 with FIG. 2 in Page 5: the overall process of the ChartQA system is shown in Fig. 2), the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor (see ¶ [0035] of the specification to exclude signal per se.) (Masry, Section A.5 in Page 15: experiments were carried out on one 4-V100 GPU and one 4-A100 GPU machines; computer readable storage medium is inherited in GPU machines) to: 
generate a synthetic dataset of  (Masry, Section 1 in Pages 1-2: the charts are created automatically using a programming tool like Matplotlib (Singh and Shekhar, 2020); benchmark consists of 20,882 charts which are curated from four different online sources to ensure variety in visual styles and topics; propose an approach that combines visual features and extracted data from the chart image; pipeline first extracts the underlying data table from the chart image by adapting the ChartOCR model (Luo et al., 2021) as well as the visual features from the chart image using neural models; then, adapt two transformer-based QA models where utilize both the extracted data table and visual features of the chart in a unified way; "Chart Data Extraction" of Section 2 in pages 2-3: (Siegel et al., 2016) proposed fully automatic chart data extraction pipelines; Luo et al. (2021) also automatically extract data from real-world charts with high accuracy; extend their pipeline to extract the fully-structured data table to pass it to models; Section 3.1 in Page 3: crawled charts from four different sources: (i) Statista (statista.com) is an online platform that presents charts covering a variety of topics including economy, politics, and industry; (ii) the Pew research (pewresearch.org) publishes report about social and economic issues, demographic trends and public opinion with a wide variety of charts; (iii) Our World In Data or OWID (ourworldindata.org) is another platform that contains thousands of charts about different global issues such as economy, finance, and society; (iv) Organisation for Economic Co-operation and Development or OECD (oecd.org) is a global organization which shares reports and data analysis for policymaking; for the Pew dataset, only crawled chart images since the underlying data tables are not available; for the other three, extracted the underlying data tables, metadata (e.g., title, chart type), SVG file and associate text description; finally, extracted the bounding boxes information of the different chart elements (e.g., x-axis labels) from the SVG files to train our data extraction models; Section 3.3 with Table 3 in Pages 4-5: dataset has three commonly used chart types: bar, line, and pie charts (Table 3); categorize the bar and line charts into simple vs complex where data tables of simple charts have only two columns where complex charts involve multiple columns (e.g., stacked or grouped bars and multi-line charts); "Model" of Section A.3 in Page 12 with FIGS. 7-8 in Page 14: extend ChartOCR (Luo et al., 2021) which relies on both deep-learning models and rule-based techniques to parse the chart image into the underlying data table; key-point detection networks, adapted from (Law and Deng, 2019), locates the chart visual marks (e.g. bars, plot area, line points); ideally, the network locates the top-left point and bottom-right points for the rectangular objects (e.g. bar, plot area); in line charts, the detection network locates the coordinates of the points connecting the line segments; in pie charts, the network locates the intersection points between the pie segments along the pie perimeter; extend their detection networks to also locate the chart textual elements (e.g. x-axis-label, legend-label ) as shown in Figure 7a and utilize the CRAFT model (Baek et al., 2019) to read their underlying texts; the chart scale is estimated using the y-axis-labels value for line and bar charts, Figure 7b; for pie charts, the value of each segment is estimated by calculating the angle between its borderlines; finally, the model aggregates the extracted data values (using color and proximity heuristics) to output the final raw data values.; extend their approach to extract the fully-structured data table with the textual labels (e.g. column headers); as shown in Figure 7, associate the estimated bars data values (e.g., ‘17.13’, ‘40.14’) with their closest x-axis-label (’Snapchat’); moreover, if the chart has more than one data series (dark bars or blue bars values), each data series is matched with its legend-label (e.g., ‘2016’, ‘2014’) based on the color of the legend mark and data-encoding marks (e.g., bars); if cannot match data values with legends by colors (e.g., when all legend marks have the same color or there are no legend marks), use other criteria that associate data-encoding marks with legend marks (e.g., proximity, alignment); e.g., in Figure 8b, ’More’ is matched with ’17’ and ’29’ since they are vertically aligned; similarly, for line charts if there is no explicit legend mark for a line series associate the legend labels with the points of their closest lines as shown in Figure 8a), the generating comprising: rendering a plurality of  (Masry. 1st paragraph of Section 1 with FIG. 1 in Page 1: Figure 1 displays a line chart with complex reasoning questions about the line chart involving arithmetic and logical operations; "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: FigureQA and DVQA use synthetically-generated data to plot the charts; LEAF-QA and LEAFQA++ use real-world data to plot the charts; the charts are plotted using a software in PlotQA (Methani et al., 2020); Section A.1 in Page 12 with FIG. 5 in Page 13: the data collection interface is shown in Figure 5; while presenting the chart in the user interface for annotation task, ensure that the data labels of chart elements are visible to workers so that they can accurately perform the necessary arithmetic and logical operations to provide and answer the questions successfully); 
generating a plurality of question-answer pairs for each of the  (Masry, Abstract in Page 1: present a large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries; Section 1 with FIG. 1 in Pages 1-2: Figure 1 illustrates sample questions in the benchmark; e.g., the question Q1 in Fig. 1 requires the user to compute the differences between the two lines for each year and find the year with the highest difference; Q2 in Fig. 1 refers to the color of a mark (‘line’) and its attribute (‘peak’) in the chart; the questions are generated automatically using pre-defined templates (Kahou et al., 2017; Kafle et al., 2018; Chaudhry et al., 2020; Singh and Shekhar, 2020); present a largescale benchmark covering 9,608 human-written questions focusing on logical and visual reasoning questions; generate another 23,111 questions automatically from human-written chart summaries using a T5 model (Raffel et al., 2020) and manually validated a subset of it for quality assurance; collect a large number of questions automatically while maintaining rich variations in language as they were generated from human-written summaries; "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: ChartQA differs from previous datasets in two main aspects: the questions’ types (human-authored vs. template-based) and the chart source (real-world vs. generated using a tool); a detailed comparison is shown in Table 1; earlier datasets such as FigureQA (Kahou et al., 2017), DVQA (Kafle et al., 2018), LEAF-QA (Chaudhry et al., 2020) and LEAF-QA++ (Singh and Shekhar, 2020) are mostly synthetic where the questions are generated using a small number of templates and the answers come from a fixed set of vocabulary (e.g. ‘yes’, ‘no’); PlotQA (Methani et al., 2020) is the only dataset with open-vocabulary questions that require applying aggregation operations on the underlying chart data; Section 3.2 with Table 2 in Pages 3-4: two main annotations procedures: (i) collect human-authored QA pairs using Amazon Mechanical Turk (AMT) and (ii) generate QA pairs from the Statista human-written summaries; to create human-authored QA pairs, designed an AMT task (see A.1 for details) in which the crowdworkers are asked to focus on two types of questions for each chart image: compositional and visual questions; compositional questions contain at least two mathematical/logical operations like sum, difference and average, while visual questions refer to the visual attributes such as color, height, and length of graphical marks (e.g., bars) in the chart; prior work on QA has performed data augmentation by either creating template-based or machine generated questions; fine-tune a pre-trained T5 model on the SQuAD QA dataset (Rajpurkar et al., 2016) and apply to the human-written chart summaries that come with the charts from Statista to automatically generate questions that are human-like with sufficient lexical and syntactic variations; applying two T5 models: one for answer extraction and the other for answer-aware question generation; for question generation, the proposed answer is first concatenated with the summary in the format: Answer: Answer Context: Chart Summary; then, generate a question from the given question using the chart summary; since the summaries are human-written, the generated questions are similar to the human-authored questions (see example questions in A.7); to filter out invalid questions, developed a simple heuristic where filter out the question if the answer cannot be found in the chart data table; randomly split both of the human-written (ChartQA-H) and machine generated (ChartQA-M) QA pairs into train, validation, and test sets as shown in Table 2; Section 3.3 with Table 4 in Pages 4-5: analyzed the basic linguistic statistics about our benchmark (see A.2) which has more unique tokens on both types of QA pair; observe that questions cover a variety of syntactic structure and sometimes exhibit informal languages and typos; the topic distribution in our data is quite diverse as it is constructed from four different sources; to analyze the nature of questions, randomly selected 300 QA pairs from the benchmark and categorized them into four types (Table 4); the vast majority of questions (76.33% in total) are either compositional or both visual and compositional, which reflects the real-world scenarios where people ask complex reasoning questions; people make visual references to a variety of visual attributes of marks (see A.2), most commonly to color (e.g., ‘orange line’) and length (e.g., ‘tallest bar’) followed by size (e.g., ‘largest slice’) and position (e.g., ‘leftmost bar’); Section A.2 with Tables 7-8 in Page 12 and FIG. 6 in Page 14: Table 7 shows some linguistic statistics about the benchmark; Figure 6 shows the distribution of topics in the dataset for each of the four sources; analyzed how people make visual references to charts in their questions; Table 8 shows the usage of visual references made in the randomly selected 300 QA pairs; Section A.7 with Table 12 in Pages 16-17: sample machine-generated questions with the human-written summaries are shown in Table 12); and 
calculating a plurality of ground truth annotations for each of the plurality of question-answer pairs and associated  (Masry, Section 1 with FIG. 1 in Pages 1-2: data visualizations such as bar charts and line charts have become popular in analyzing data and making informed decisions; to analyze data, often people ask complex reasoning questions about charts involving arithmetic and logical operations (Kim et al., 2020); answering such questions requires a significant amount of perceptual and cognitive efforts as people need to combine multiple operations such as retrieving values, comparing values, finding maximum, calculating sums and differences of values; the answer for a complex reasoning question is derived through various mathematical operations such as aggregation and comparison for complex reasoning questions; Figure 1 illustrates a line chart with the correct answer to each question; Section 3.2 in Pages 3-4: for each chart, the workers provide two questions with the answers; the same questions are then answered by another annotator; if both workers’ answers exactly match, consider the answer to be correct; otherwise, manually check the answers to select the final correct answer; Section 5.1 in Page 7: following Methani et al. (2020), use a relaxed accuracy measure for the numeric answers to allow a minor inaccuracy that may result from the automatic data extraction process; consider an answer to be correct if it is within 5% of the gold answer; for non-numeric answers, still need an exact match to consider an answer to be correct; Section A.1 in Page 12: in each HIT (Human Intelligent Task), the workers verify two previously asked questions by other workers and also provide two new QA pairs; to ensure quality, selected workers with an acceptance rate of 95% and total accomplished HITs of 5000; filtered the workers by giving them a pretest to select the best qualified workers for this task; "Evaluation Metric" of Section A.3 with Table 9 in Pages 12 and 14-15 : evaluation metric is adapted from ChartOCR (Luo et al., 2021).; the distance between any two data values is estimated as follows:                         
                            D
                            
                                    g
                                    t
                                    ,
                                     
                                    p
                                    r
                                
                            =
                            m
                            i
                            n
                            
                                    1
                                    ,
                                    
                                                    g
                                                    t
                                                    -
                                                    p
                                                    r
                                                
                                                    g
                                                    t
                                                
                    , where gt is the ground truth value and pr is the predicted value; for each chart, the cost matrix C, where Cn, m = D(gtn; prm) is computed and the total minimum cost is calculated by solving the following linear sum assignment problem                         
                            C
                            o
                            s
                            t
                            =
                            
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        K
                                    
                                            ∑
                                            
                                                j
                                                =
                                                1
                                            
                                                K
                                            
                                                    C
                                                
                                                    i
                                                    ,
                                                    j
                                                
                                                    X
                                                
                                                    i
                                                    ,
                                                     
                                                    j
                                                
                    , where K = max(N, M) and X is a binary assignment matrix; the final overall score is then estimated as follows:                         
                            O
                            v
                            e
                            r
                            a
                            l
                            l
                             
                            S
                            c
                            o
                            r
                            e
                            =
                            
                                    1
                                
                                    L
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        L
                                    
                                    1
                                    -
                                    
                                            c
                                            o
                                            s
                                            t
                                        
                                                    K
                                                
                                                    i
                                                
                    , where L is the total number of charts; evaluation results are shown in Table 9; Section A.7 with FIG. 10 in Pages 16-17: sample predictions from the model VisionTaPas on ChartQA test set are shown in Figure 10); and 
train a vision-language architecture on the synthetic dataset to answer questions about the graph-like chart images, wherein the vision-language architecture comprises a Bidirectional Encoder Representations from Transformers (BERT) model and a Vision Transformer (ViT), and wherein the training of the vision-language architecture on the synthetic dataset to answer questions about the graph-like chart images comprises generating a representation of the graph-like chart images using the ViT (Masry, Abstract in Page 1: to address the unique challenges in the benchmark involving visual and logical reasoning over charts, present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions; achieve the state-of-the-art results on the previous datasets as well as on the benchmark, the evaluation also reveals several challenges in answering complex reasoning questions; Section 1 in Pages 1-2: generate a large-scale ChartQA dataset with real-world charts and human-authored question-answer pairs for training; a pipeline approach that combines visual features and automatically extracted data from charts to utilize in transformer-based QA models that provide state-of-the-art results; implement an extensive analysis and evaluation of the performance of our models; Section 3.1 in Page 3: extracted the bounding boxes information of the different chart elements (e.g., x-axis labels) from the SVG files to train  data extraction models; Section 3.2 in Pages 3-4: large-scale language models like T5 (Raffel et al., 2020) which are trained on very large data from various web sources can learn general linguistic properties and variations (Brown et al., 2020); the process involves training and applying two T5 models: one for answer extraction and the other for answer-aware question generation; the process involves training and applying two T5 models: one for answer extraction and the other for answer-aware question generation; for answer extraction, the T5 model is trained to generate possible answers separated by [SEP] token given the textual summary as input (i.e., trained on SQuAD’s passage → answer pairs); for question generation, the proposed answer is first concatenated with the summary in the format: Answer: Answer Context: Chart Summary; then, the T5 model is trained to generate a question from the given question using the chart summary; this model is trained on SQuAD’s (passage, answer) → question pairs; Section 4.1 with FIG. 2 in Page 5: consider two problem settings for ChartQA; the first setting assumes that the underlying data table of the chart image is available; formally, given a dataset with N examples D =                         
                            
                                                    c
                                                
                                                    i
                                                
                                            ,
                                             
                                                    t
                                                
                                                    i
                                                
                                            ,
                                             
                                                    q
                                                
                                                    i
                                                
                                            ,
                                             
                                                    a
                                                
                                                    i
                                                
                                    i
                                    =
                                    1
                                
                                    N
                                
                    , where ci represents a chart image, ti represents the underlying data table, qi represents a question over ci, and ai represents the answer to the question; the ChartQA models learn to predict the answer ai given ci, ti and qi; the gold data tables are not generally accessible in most real-world scenarios; thus consider the second setup where the underlying data table ti for chart image ci is extracted by adapting a state-of-the-art ChartOCR (Luo et al., 2021); ChartOCR first locates the main elements of the chart image (e.g., plot area, title) as well as data-encoding marks (e.g., bars ) using key-point detection networks; it then uses the detected keypoints of each mark along with axis-labels to estimate the data value of that mark; however, it does not associate the predicted data values with corresponding text labels (e.g., x-axis-label); hence, extend their approach to output the fully-structured data tables; utilize the CRAFT (Baek et al., 2019) model to recognize the texts in the chart elements; then, associate the data values with their text labels using positional and color information (see A.3 for details); ; Section 4.2 with FIG. 3 in Pages 5-6: the approach to ChartQA builds on two of the state-of-the-art TableQA models: T5 (Raffel et al., 2020; Nan et al., 2021) and TaPas (Herzig et al., 2020); the input to these models consists of the question qi and the data table ti; different from TableQA, ChartQA often involves extracting visual information from chart images; for this, also experiment with the visual counterparts of the TableQA models that also take the chart image features into account; while T5 has a visual variant, VL-T5 (Cho et al., 2021), TaPas does not; in this work, extend TaPas to consider the image features and call it VisionTaPas; more details on models are provided in A.5; T5 (Raffel et al., 2020) is an encoder-decoder model which unifies the NLP tasks as text-to-text generation using the same architecture and loss function; it has been pre-trained on massive amount of unlabelled data with a self-supervised denoising objective; to fine-tune T5 on the ChartQA task, flatten the data table and feed it along with the question as: "Question: Question tokens Table: Flattened table tokens", and the model is trained to generate the answer directly; VL-T5 (Cho et al., 2021) is an extension of T5 that unifies the Vision-Language (VL) tasks as text generation conditioned on multimodal inputs; the input consists of both textual tokens and visual features of the objects extracted from the image using Faster R-CNN (Ren et al., 2015); the model is pre-trained on multiple multimodal tasks such as language modeling, visual QA, and visual grounding; utilize VL-T5 for the ChartQA task in the following manner: (a) for the textual input, do the same as T5 where flatten the data table of the chart image and concatenate it with the question text; (b) for the visual input, extract the visual features of different marks in the chart image (e.g., bars, lines) using Mask R-CNN (He et al., 2017) with Resnet-101 as its backbone (see A.4 for details); unlike the original VL-T5 where a fixed number of objects is provided (36), the number of elements varies from one chart to another; to account for this, pad the extracted visual features with zeros to have a fixed length of 36; TaPas (Herzig et al., 2020) extends a BERT (Devlin et al., 2019) architecture with additional positional embeddings for rows and columns to encode a table; as shown in Fig. 3a, the input to the model has the following format: [CLS] Question tokens [SEP] Flattened table tokens; the tokens are encoded with the table-specific positional embeddings in addition to BERT’s segment and positional embeddings; the model has two output heads: aggregation operation head and cell selection head; the aggregation operation head predicts an operation (e.g., COUNT, SUM, AVERAGE, NONE) which is then applied to the cell values selected by the cell selection head; depending on the operation type, the selected cells can constitute the final answer or the input used to infer the final answer; TaPas is first pre-trained on masked language modeling objective using table-text pairs crawled from Wikipedia where table cells are randomly masked and the model is trained to predict them; it is then fine-tuned in a weakly-supervised manner (using answers as the only supervision) with end-to-end differentiable objectives; VisionTaPas is our extension of TaPas for QA over charts; it consists of three main components: a vision transformer encoder for encoding the chart image, a TaPas encoder for encoding the question and data table and a cross-modal encoder (Fig. 3b); Vision Transformer or ViT (Dosovitskiy et al., 2021) utilizes the transformer encoder architecture (Vaswani et al., 2017) in vision tasks; given a 2D chart image, the image is divided into a sequence of 2D patches {p1, …, pn}; each patch is then flattened and linearly projected into a d-dimensional embedding vector; to incorporate the positional information of the patches, 1D learnable positional embeddings are added to the image features; an L-layer ViT encoder produces a sequence of embeddings H =                         
                            
                                            h
                                        
                                            c
                                            l
                                            s
                                        
                                            L
                                        
                                    ,
                                     
                                            h
                                        
                                            1
                                        
                                            L
                                        
                                    ,
                                     
                                    …
                                    ,
                                     
                                            h
                                        
                                            n
                                        
                                            L
                                        
                     representing the special [CLS] token and the image patches; initialize the ViT module with the pre-trained weights from (Dosovitskiy et al., 2021); the Cross-modality Encoder takes the output of ViT and TaPas encoders (H and Z) and compute multimodal encodings; it has four blocks, each containing a visual branch and a textual-tabular branch; the input first passes through the multiheaded cross attention layers in parallel, where in the visual branch the query vectors are the visual features, and the key and context vectors are the textual-tabular features and vice versa in the textual-tabular branch; the cross-attended features are then passed through a self-attention layer followed by a fully connected layer; similar to the transformer model, each layer applies layer normalization (Ba et al., 2016) and is wrapped with a residual connection; finally, append the aggregation operation and the cell selection heads of TaPas to the final layer at the textual-tabular branch; many questions in the ChartQA dataset require performing a subtraction or ratio operation, which the original TaPas model does not support; thus extend the operation head to add those two operations (Fig. 3b); however, instead of training them in a weakly-supervised manner based on the final answer (as done in TaPas), find it more effective when provided with more direct but potentially noisy supervision on the cells to consider; rely on some heuristics to generate such supervision in the training data; to handle the fixed vocabulary answers (e.g. ‘Yes’, ‘No’), further extend the operation head to include those classes; "ChartQA Dataset" of Section 5.2 with Table 6 in Pages 7-8:  evaluate the transferability of the models and the datasets, where first pretrain the two top performing models (VisionTaPas and VL-T5) on the PlotQA dataset and then fine-tune them on ChartQA; large datasets like PlotQA can be useful for pretraining the model even if the questions are generated from a small number of templates; Section A.4 in Page 15: train the model to detect the following 15 objects: ’Legend’, ’yAxisTitle’, ’ChartTitle’, ’xAxisTitle’, ’LegendPreview’, ’PlotArea’, ’yAxisLabel’, ’xAxisLabel’, ’LegendLabel’, ’PieLabel’, ’bar’, ’pie’, ’pieSlice’, ’line’, and ’dotLine’; for the bounding boxes annotations, use the available bboxes; for the masks, generate them easily using the bounding boxes for all the rectangular objects; for ’pieSlice’ and ’pie’, follow a similar approach to (Singh and Shekhar, 2020) where generate the masks by projecting the radius along the pie perimeter from the starting to the ending points of each slice; use the detectron2 library (Wu et al., 2019) and initialize the model with pre-trained wights on the COCO dataset (Lin et al., 2014); fine-tune the model with a batch size of 8 and an initial learning rate 0f 0.00025 for 50K iterations; Section A.5 with FIG. 9 in Pages 15-16: T5 and VL-T5 fine-tuning process setup is shown in Figure 9).  
Masry fails to explicitly disclose wherein charts are flowcharts.
SUN teaches a system and a method relating to recognize charts using transformer (SUN, ABSTRACT), wherein charts are flowcharts (SUN, ABSTRACT in Page 64292: propose an end-to-end multi-task network FR-DETR (Flowchart Recognition DEtection TRansformer) and a new dataset for precise and robust flowchart recognition; FR-DETR comprises a CNN backbone and a shared multi-scale Transformer structure to perform symbol detection and edge detection using shared feature maps and respective prediction heads in a coarse-to-fine refinement process; the coarse stage analyzes features with low resolution and suggests candidate regions that contain potential targets for the fine stage to produce accurate predictions using features with high resolution; meanwhile, a new dataset is constructed to provide more symbol types and complex backgrounds for network training and evaluation, which contains more than 1000 machine-generated flowchart images, 25K+ symbol instances with nine categories, and 20K+ line segments; Section I with FIG. 1 in Pages 64292-64293: flowchart recognition is an essential sub-task in research on document analysis and recognition; the critical problem of flowchart recognition is to recognize and refine the structural semantics of flowcharts; there are two study areas for flowchart recognition: handwritten flowchart recognition and machine-generated flowchart recognition; understanding the structural semantics of machine-generated flowchart images is crucial for many structural-semantic-based tasks, such as patent retrieval, automatic code generation, and task-oriented dialogue systems; existing methods for recognizing machine-generated flowchart mainly extract the entire structure by analyzing the connected components in images and then identify specific structures using manually chosen features; flowchart structures now have more varied and colorful symbols and more complex backgrounds, leading to problems such as decrements in recognition accuracy as well as incompatibility and inflexibility of manually chosen features; deep-learning-based computer vision technologies are capable of focusing on desired targets, which can be summarized into symbols and edges in flowcharts; the flowchart recognition task can be divided into recognizing symbols using object detection and detecting edges using line segment detection; as shown in Fig. 1, symbols and arrows are indicated by bounding boxes and edges are indicated by line segments; in flowchart recognition, information processed by different tasks is inherently correlated, such as symbols are connected by edges and edges are connecting symbols; believe a multi-task model is more suitable for flowchart recognition because (a) it enables information to be shared between two tasks of object detection and line segment detection, which can simplify the network structure; and (b) it reduces the training and analysis process by managing two tasks simultaneously; proposes an end-to-end multi-task network architecture named FR-DETR, the first deep learning system for machine-generated flowchart recognition to the best of our knowledge; by fusing DETR (DEtection TRansformer) and LETR (LinE segment TRansformers), FR-DETR has a CNN backbone and a multi-scale Transformer structure to perform fine-grained symbol detection and edge detection respectively; to satisfy the requirements of data volume and complexity for model training, a new flowchart dataset is also constructed; the machine-generated flowchart recognition can be accomplished using deep-learning-based object detection and line segment detection to handle the increasing symbol diversity and background complexity of flowchart images; propose an end-to-end multi-task learning network that combines object detection and line segment detection to reduce the costs caused by separate models; the model jointly detects symbols and edges in flowcharts and employs a multi-scale Transformer structure to improve the recognition accuracy of both tasks; the proposed method achieves a better recognition accuracy that outperforms the prior machine-generated flowchart recognition methods; a new machine-generated flowchart dataset is established to address data shortages for deep learning model training; Section II.A in Pages 64293-64294: Rusiñol et al. [7], [8] summarized flowcharts as structure layer and text layer, and then performed recognition after layer separation; Zhang [23] proposed a corner-based structural model (CBSM) based on the analysis of different corners and symbol shapes; the CBSM recognizes symbols by defining corner classification and corner-based spatial constraints for each kind of graphic shapes; Carton et al. [27] fused structural and statistical information to compute grammatical descriptions for each type of symbol; Lemaitre et al. [28] analyzed flowchart structure based on the description and modification of segment (DMOS) and structural information; Bresler et al. [29] solved a max-sum model optimization task to obtain the best symbol description; Schäfer et al. [30] proposed Arrow R-CNN to detect symbols and connecting edges by adding a keypoint head to Faster R-CNN; the keypoint head is designed to detect the arrow and tail belonging to a connecting edge as two keypoints; arrow R-CNN was the first deep-learning-based flowchart recognition approach; Section II.B in Page 64294: Carion et al. [14] introduced a Transformer based end-to-end object detection network DETR that removed pre-designed anchor and non-maximum suppression (NMS); it detects objects using interactions between a fixed number of queries and encoded image features; following the basic query concept of DETR, Xu et al. [19] proposed a Transformer based line segment detector LETR; unlike the prior line segment detection approaches consisting of heuristics-guided steps, the LETR detector directly regressed the endpoints of a line segment and achieved state-of-the-art performance on relative line segment detection datasets; Section II.C in Page 64294: deep-learning-based MTL (multitask learning) models aim to improve network generalization and the capability of jointly learning shared information; compared with single-task models, multi-task networks have advantages such as a reasonable reduction in model size and increment in inference speeds by sharing inherent parts of network structure; Section III.A-D with FIGS. 2-3 in Pages 64294-64296: as shown in Fig. 2, although the representation of connecting edges is adjusted to an entire connection between two symbols, errors in detecting edges and keypoints still occur in Arrow R-CNN; based on the structural analysis of connecting edges, using line segment representation can better handle the dilemma in which Arrow R-CNN is stuck; recently, a Transformer-based line segment detector LETR [19] suggested a model that directly regresses the endpoints coordinates of each line segment; the attention mechanism introduced by Transformer perfectly meets the need to distinguish line segments between wanted and unwanted ones; inspired by the aforementioned methods, with the aim of improving the flowchart  recognition accuracy and reducing the costs of using isolated models, modify LETR into a multi-task model to jointly accomplish the two detection tasks; the model selected to perform symbol detection is DETR; the reasons can be concluded as follows: (a) DETR has a similar Transformer-based structure to LETR, which maximizes structure sharing between the two models and the overall cost reduction; (b) Other CNN-based object detection models and LETR have few shareable parts, which unavoidably results in insufficient structure sharing and difficulties in jointly analyzing features; the overall architecture of FR-DETR is designed based on a multi-scale Transformer encoder-decoder structure as depicted in Fig. 3; the proposed flowchart recognition process can be divided into four sub-tasks: feature extraction, feature encoding, feature decoding and target prediction; FR-DETR uses a CNN backbone to extract a feature map from an input image; the channel dimension of the feature map f0 is then reduced from C to a smaller dimension d to obtain a new feature map f by 1 × 1 convolution; to meet the encoder's requirement, which expects input in the sequence format, the feature map f is flattened to create another feature map z; the Transformer encoder is stacked with multiple encoding layers; each layer consists of a multi-head self-attention module and a feed-forward network (FFN); the encoding layers receive processed features from their predecessor layer and deliver the output features to the corresponding FFN after learning the pairwise relations between the input and output; in general, the flattened feature map z is encoded into a new feature map z'; the positional encoding of f is added to guarantee the flattened feature map z not to lose the spatial relations; following the standard architecture of Transformer, the decoder transforms each N embeddings of symbol and line segment using the multi-head self-attention and cross-attention module; like the positional encoding of the encoder, the input embeddings are learnable positional information that is added to the input of each layer and named as target queries; each decoding layer receives z' from the last encoding layer and two types of target queries b and l, namely symbol queries and line queries, from its predecessor decoding layer; both types of queries are first processed by the self-attention model, and then, each entity in the queries is assigned to different regions of positional encoding by the cross-attention module; the output of the decoder is then used to predict the final results using an FFN; the final results of symbols and line segments are predicted by an FFN; specifically, the coordinates of symbols and line segments are computed by a multi-layer perceptron (MLP) with three layers, and the confidence of the predicted targets is produced by a linear projection layer; in contrast to object detection, which mainly focuses on local and neighborhood regions, line segment detection needs to consider the fine-grained local features and the global information; an efficient way to tackle the problem is a sequential two-stage structure, whose former component produces suggested regions for the other component to perform exact detection; following this idea, FR-DETR performs both the desired tasks in a refinement process using a coarse-to-fine strategy; this strategy enables FR-DETR to learn from multi-scale image features and produce precise predictions; in general, the model first analyzes the global information to locate possible targets coarsely and then uses the location information to examine local features and perform fine-grained recognition; in the coarse stage, FR-DETR studies a low resolution feature map to identify potential regions containing symbols and line segments; the low resolution feature map sent into the coarse encoder is the output of the ResNet [33] C5 layer, and its size is 1/32 of the original image resolution; after the encoding process, the encoded features and init-target queries are then passed into each decoding layer's cross-attention module and self-attention module; the predictions produced by the coarse stage are considered as potential target regions and received by fine decoding layers as mid-target queries; the coarse stage is important for improving the accuracy of the fine stage and can reduce the computation cast compared with directly processing high resolution features; in the fine stage, based on the suggested potential regions, FR-DETR makes detailed predictions using a feature map with 1/16 of the original image resolution, which is the output of the ResNet C4 layer and is twice the size of the feature map used in the coarse stage; in general, fine decoding is similar to that in the coarse stage; the main difference is that it processes information with more details and focuses on the suggested regions to conclude predictions, making the fine stage crucial for accomplishing precise fine-grained detection; in the prediction process, each type of final queries is fed into its corresponding FFN that consists of a classifier and a regressor to predict the category confidence p and the coordinates of every target; if a final query belongs to symbol detection, the coordinates prediction is in the format of bounding box b = (cx, cy, w, h), which denotes the center point, width and height of the box; otherwise, a line segment prediction l is represented by two endpoints (p1; p2), where p1 = (x1; y1), p2 = (x2; y2); the set of predictions                         
                            
                                    t
                                
                                ^
                            
                     has N targets, and the set of ground truth t has M elements, normally N > M; to assign a bipartite matching between the predictions and ground truth, t is assumed as a set that padded with no object (Ø) to meet the size of                         
                            
                                    t
                                
                                ^
                            
                    ; in this case, an optimization for the bipartite matching is used to find the permutation with the lowest cost using eqns. (1)-(4); for the two detection tasks, each task loss must also evaluate the results of classification; by adding a cross-entropy loss term, each task loss can be represented as eqns. (5)-(6); the total loss of FR-DETR is formulated as eqn. (7); Section IV.A with Table 1 in Page 64297: the original CLEF-IP dataset released in 2011 contains machine-generated flowchart images and other structural diagrams; after removing non-flowcharts, approximately 60 images remain, most of which are provided by the European Patent Office (EPO) for the patent retrieval study; these flowcharts have a simple white background, and their structures are drawn in black; the dataset has three main symbols: rectangle for processing action, diamond for decision, and oval for terminator; to enrich the symbol category and structural complexity, public flowchart images are collected through the Internet by using image search engines, such as Google Image, Bing Image and Baidu Image; after filtering low quality images and removing duplicates, a dataset containing more than 1,000 images is constructed; the new dataset includes 25K+ symbol instances and 20K+ line segments; statistical details are shown in Table 1; the FR-DETR model is trained on the new dataset, which is randomly divided into 800 training images and 200 testing images; data augmentation methods including random resize, random flip, and random crop, are taken through all experiments).
Masry and SUN are analogous art because they are from the same field of endeavor, a system and a method relating to recognize charts using transformer.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to apply the teaching of SUN to Masry.  Motivation for doing so would extend capability of document analysis and recognition.

Independent Claim 19
Masry discloses a system for providing answers to questions posed about  (Masry, Abstract in Page 1: charts are very popular for analyzing data; when exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations; they also commonly refer to visual features of a chart in their questions; to address the unique challenges in the benchmark involving visual and logical reasoning over charts, present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions; while the present models achieve the state-of-the-art results on the previous datasets as well as on the benchmark, the evaluation also reveals several challenges in answering complex reasoning questions; 2nd paragraph of Section 1 in Page 1: the goal of a Chart Question Answering (ChartQA) system is to help users by taking a chart and a natural language question as input and predicting the answer; Section 4.1 with FIG. 2 in Page 5: the overall process of the ChartQA system is shown in Fig. 2), comprising: a synthetic dataset generation module adapted to generate a plurality of synthetic  (Masry, Section 1 in Pages 1-2: the charts are created automatically using a programming tool like Matplotlib (Singh and Shekhar, 2020); benchmark consists of 20,882 charts which are curated from four different online sources to ensure variety in visual styles and topics; propose an approach that combines visual features and extracted data from the chart image; pipeline first extracts the underlying data table from the chart image by adapting the ChartOCR model (Luo et al., 2021) as well as the visual features from the chart image using neural models; then, adapt two transformer-based QA models where utilize both the extracted data table and visual features of the chart in a unified way; "Chart Data Extraction" of Section 2 in pages 2-3: (Siegel et al., 2016) proposed fully automatic chart data extraction pipelines; Luo et al. (2021) also automatically extract data from real-world charts with high accuracy; extend their pipeline to extract the fully-structured data table to pass it to models; Section 3.1 in Page 3: crawled charts from four different sources: (i) Statista (statista.com) is an online platform that presents charts covering a variety of topics including economy, politics, and industry; (ii) the Pew research (pewresearch.org) publishes report about social and economic issues, demographic trends and public opinion with a wide variety of charts; (iii) Our World In Data or OWID (ourworldindata.org) is another platform that contains thousands of charts about different global issues such as economy, finance, and society; (iv) Organisation for Economic Co-operation and Development or OECD (oecd.org) is a global organization which shares reports and data analysis for policymaking; for the Pew dataset, only crawled chart images since the underlying data tables are not available; for the other three, extracted the underlying data tables, metadata (e.g., title, chart type), SVG file and associate text description; finally, extracted the bounding boxes information of the different chart elements (e.g., x-axis labels) from the SVG files to train our data extraction models; Section 3.3 with Table 3 in Pages 4-5: dataset has three commonly used chart types: bar, line, and pie charts (Table 3); categorize the bar and line charts into simple vs complex where data tables of simple charts have only two columns where complex charts involve multiple columns (e.g., stacked or grouped bars and multi-line charts); "Model" of Section A.3 in Page 12 with FIGS. 7-8 in Page 14: extend ChartOCR (Luo et al., 2021) which relies on both deep-learning models and rule-based techniques to parse the chart image into the underlying data table; key-point detection networks, adapted from (Law and Deng, 2019), locates the chart visual marks (e.g. bars, plot area, line points); ideally, the network locates the top-left point and bottom-right points for the rectangular objects (e.g. bar, plot area); in line charts, the detection network locates the coordinates of the points connecting the line segments; in pie charts, the network locates the intersection points between the pie segments along the pie perimeter; extend their detection networks to also locate the chart textual elements (e.g. x-axis-label, legend-label ) as shown in Figure 7a and utilize the CRAFT model (Baek et al., 2019) to read their underlying texts; the chart scale is estimated using the y-axis-labels value for line and bar charts, Figure 7b; for pie charts, the value of each segment is estimated by calculating the angle between its borderlines; finally, the model aggregates the extracted data values (using color and proximity heuristics) to output the final raw data values.; extend their approach to extract the fully-structured data table with the textual labels (e.g. column headers); as shown in Figure 7, associate the estimated bars data values (e.g., ‘17.13’, ‘40.14’) with their closest x-axis-label (’Snapchat’); moreover, if the chart has more than one data series (dark bars or blue bars values), each data series is matched with its legend-label (e.g., ‘2016’, ‘2014’) based on the color of the legend mark and data-encoding marks (e.g., bars); if cannot match data values with legends by colors (e.g., when all legend marks have the same color or there are no legend marks), use other criteria that associate data-encoding marks with legend marks (e.g., proximity, alignment); e.g., in Figure 8b, ’More’ is matched with ’17’ and ’29’ since they are vertically aligned; similarly, for line charts if there is no explicit legend mark for a line series associate the legend labels with the points of their closest lines as shown in Figure 8a; 1st paragraph of Section 1 with FIG. 1 in Page 1: Figure 1 displays a line chart with complex reasoning questions about the line chart involving arithmetic and logical operations; "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: FigureQA and DVQA use synthetically-generated data to plot the charts; LEAF-QA and LEAFQA++ use real-world data to plot the charts; the charts are plotted using a software in PlotQA (Methani et al., 2020); Section A.1 in Page 12 with FIG. 5 in Page 13: the data collection interface is shown in Figure 5; while presenting the chart in the user interface for annotation task, ensure that the data labels of chart elements are visible to workers so that they can accurately perform the necessary arithmetic and logical operations to provide and answer the questions successfully) and a plurality of questions, possible answers, and correct answer tuples from associated graph data (Abstract in Page 1: present a large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries; Section 1 with FIG. 1 in Pages 1-2: Figure 1 illustrates sample questions in the benchmark; e.g., the question Q1 in Fig. 1 requires the user to compute the differences between the two lines for each year and find the year with the highest difference; Q2 in Fig. 1 refers to the color of a mark (‘line’) and its attribute (‘peak’) in the chart; the questions are generated automatically using pre-defined templates (Kahou et al., 2017; Kafle et al., 2018; Chaudhry et al., 2020; Singh and Shekhar, 2020); present a largescale benchmark covering 9,608 human-written questions focusing on logical and visual reasoning questions; generate another 23,111 questions automatically from human-written chart summaries using a T5 model (Raffel et al., 2020) and manually validated a subset of it for quality assurance; collect a large number of questions automatically while maintaining rich variations in language as they were generated from human-written summaries; "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: ChartQA differs from previous datasets in two main aspects: the questions’ types (human-authored vs. template-based) and the chart source (real-world vs. generated using a tool); a detailed comparison is shown in Table 1; earlier datasets such as FigureQA (Kahou et al., 2017), DVQA (Kafle et al., 2018), LEAF-QA (Chaudhry et al., 2020) and LEAF-QA++ (Singh and Shekhar, 2020) are mostly synthetic where the questions are generated using a small number of templates and the answers come from a fixed set of vocabulary (e.g. ‘yes’, ‘no’); PlotQA (Methani et al., 2020) is the only dataset with open-vocabulary questions that require applying aggregation operations on the underlying chart data; Section 3.2 with Table 2 in Pages 3-4: two main annotations procedures: (i) collect human-authored QA pairs using Amazon Mechanical Turk (AMT) and (ii) generate QA pairs from the Statista human-written summaries; to create human-authored QA pairs, designed an AMT task (see A.1 for details) in which the crowdworkers are asked to focus on two types of questions for each chart image: compositional and visual questions; compositional questions contain at least two mathematical/logical operations like sum, difference and average, while visual questions refer to the visual attributes such as color, height, and length of graphical marks (e.g., bars) in the chart; prior work on QA has performed data augmentation by either creating template-based or machine generated questions; fine-tune a pre-trained T5 model on the SQuAD QA dataset (Rajpurkar et al., 2016) and apply to the human-written chart summaries that come with the charts from Statista to automatically generate questions that are human-like with sufficient lexical and syntactic variations; applying two T5 models: one for answer extraction and the other for answer-aware question generation; for answer extraction, the T5 model is trained to generate possible answers separated by [SEP] token given the textual summary as input (i.e., trained on SQuAD’s passage → answer pairs); for question generation, the proposed answer is first concatenated with the summary in the format: Answer: Answer Context: Chart Summary; then, generate a question from the given question using the chart summary; since the summaries are human-written, the generated questions are similar to the human-authored questions (see example questions in A.7); to filter out invalid questions, developed a simple heuristic where filter out the question if the answer cannot be found in the chart data table; randomly split both of the human-written (ChartQA-H) and machine generated (ChartQA-M) QA pairs into train, validation, and test sets as shown in Table 2; Section 3.3 with Table 4 in Pages 4-5: analyzed the basic linguistic statistics about our benchmark (see A.2) which has more unique tokens on both types of QA pair; observe that questions cover a variety of syntactic structure and sometimes exhibit informal languages and typos; the topic distribution in our data is quite diverse as it is constructed from four different sources; to analyze the nature of questions, randomly selected 300 QA pairs from the benchmark and categorized them into four types (Table 4); the vast majority of questions (76.33% in total) are either compositional or both visual and compositional, which reflects the real-world scenarios where people ask complex reasoning questions; people make visual references to a variety of visual attributes of marks (see A.2), most commonly to color (e.g., ‘orange line’) and length (e.g., ‘tallest bar’) followed by size (e.g., ‘largest slice’) and position (e.g., ‘leftmost bar’); Section A.2 with Tables 7-8 in Page 12 and FIG. 6 in Page 14: Table 7 shows some linguistic statistics about the benchmark; Figure 6 shows the distribution of topics in the dataset for each of the four sources; analyzed how people make visual references to charts in their questions; Table 8 shows the usage of visual references made in the randomly selected 300 QA pairs; Section A.7 with Table 12 in Pages 16-17: sample machine-generated questions with the human-written summaries are shown in Table 12; Section 1 with FIG. 1 in Pages 1-2: data visualizations such as bar charts and line charts have become popular in analyzing data and making informed decisions; to analyze data, often people ask complex reasoning questions about charts involving arithmetic and logical operations (Kim et al., 2020); answering such questions requires a significant amount of perceptual and cognitive efforts as people need to combine multiple operations such as retrieving values, comparing values, finding maximum, calculating sums and differences of values; the answer for a complex reasoning question is derived through various mathematical operations such as aggregation and comparison for complex reasoning questions; Figure 1 illustrates a line chart with the correct answer to each question; Section 3.2 in Pages 3-4: for each chart, the workers provide two questions with the answers; the same questions are then answered by another annotator; if both workers’ answers exactly match, consider the answer to be correct; otherwise, manually check the answers to select the final correct answer; Section 5.1 in Page 7: following Methani et al. (2020), use a relaxed accuracy measure for the numeric answers to allow a minor inaccuracy that may result from the automatic data extraction process; consider an answer to be correct if it is within 5% of the gold answer; for non-numeric answers, still need an exact match to consider an answer to be correct; Section A.1 in Page 12: in each HIT (Human Intelligent Task), the workers verify two previously asked questions by other workers and also provide two new QA pairs; to ensure quality, selected workers with an acceptance rate of 95% and total accomplished HITs of 5000; filtered the workers by giving them a pretest to select the best qualified workers for this task; "Evaluation Metric" of Section A.3 with Table 9 in Pages 12 and 14-15 : evaluation metric is adapted from ChartOCR (Luo et al., 2021).; the distance between any two data values is estimated as follows:                         
                            D
                            
                                    g
                                    t
                                    ,
                                     
                                    p
                                    r
                                
                            =
                            m
                            i
                            n
                            
                                    1
                                    ,
                                    
                                                    g
                                                    t
                                                    -
                                                    p
                                                    r
                                                
                                                    g
                                                    t
                                                
                    , where gt is the ground truth value and pr is the predicted value; for each chart, the cost matrix C, where Cn, m = D(gtn; prm) is computed and the total minimum cost is calculated by solving the following linear sum assignment problem                         
                            C
                            o
                            s
                            t
                            =
                            
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        K
                                    
                                            ∑
                                            
                                                j
                                                =
                                                1
                                            
                                                K
                                            
                                                    C
                                                
                                                    i
                                                    ,
                                                    j
                                                
                                                    X
                                                
                                                    i
                                                    ,
                                                     
                                                    j
                                                
                    , where K = max(N, M) and X is a binary assignment matrix; the final overall score is then estimated as follows:                         
                            O
                            v
                            e
                            r
                            a
                            l
                            l
                             
                            S
                            c
                            o
                            r
                            e
                            =
                            
                                    1
                                
                                    L
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                        L
                                    
                                    1
                                    -
                                    
                                            c
                                            o
                                            s
                                            t
                                        
                                                    K
                                                
                                                    i
                                                
                    , where L is the total number of charts; evaluation results are shown in Table 9; Section A.7 with FIG. 10 in Pages 16-17: sample predictions from the model VisionTaPas on ChartQA test set are shown in Figure 10); and 
a vision-language machine learning model trained on the synthetic dataset to answer questions about input  (Masry, Abstract in Page 1: to address the unique challenges in the benchmark involving visual and logical reasoning over charts, present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions; achieve the state-of-the-art results on the previous datasets as well as on the benchmark, the evaluation also reveals several challenges in answering complex reasoning questions; Section 1 in Pages 1-2: generate a large-scale ChartQA dataset with real-world charts and human-authored question-answer pairs for training; a pipeline approach that combines visual features and automatically extracted data from charts to utilize in transformer-based QA models that provide state-of-the-art results; implement an extensive analysis and evaluation of the performance of our models; Section 3.1 in Page 3: extracted the bounding boxes information of the different chart elements (e.g., x-axis labels) from the SVG files to train  data extraction models; Section 3.2 in Pages 3-4: large-scale language models like T5 (Raffel et al., 2020) which are trained on very large data from various web sources can learn general linguistic properties and variations (Brown et al., 2020); the process involves training and applying two T5 models: one for answer extraction and the other for answer-aware question generation; the process involves training and applying two T5 models: one for answer extraction and the other for answer-aware question generation; for answer extraction, the T5 model is trained to generate possible answers separated by [SEP] token given the textual summary as input (i.e., trained on SQuAD’s passage → answer pairs); for question generation, the proposed answer is first concatenated with the summary in the format: Answer: Answer Context: Chart Summary; then, the T5 model is trained to generate a question from the given question using the chart summary; this model is trained on SQuAD’s (passage, answer) → question pairs; Section 4.1 with FIG. 2 in Page 5: consider two problem settings for ChartQA; the first setting assumes that the underlying data table of the chart image is available; formally, given a dataset with N examples D =                         
                            
                                                    c
                                                
                                                    i
                                                
                                            ,
                                             
                                                    t
                                                
                                                    i
                                                
                                            ,
                                             
                                                    q
                                                
                                                    i
                                                
                                            ,
                                             
                                                    a
                                                
                                                    i
                                                
                                    i
                                    =
                                    1
                                
                                    N
                                
                    , where ci represents a chart image, ti represents the underlying data table, qi represents a question over ci, and ai represents the answer to the question; the ChartQA models learn to predict the answer ai given ci, ti and qi; the gold data tables are not generally accessible in most real-world scenarios; thus consider the second setup where the underlying data table ti for chart image ci is extracted by adapting a state-of-the-art ChartOCR (Luo et al., 2021); ChartOCR first locates the main elements of the chart image (e.g., plot area, title) as well as data-encoding marks (e.g., bars ) using key-point detection networks; it then uses the detected keypoints of each mark along with axis-labels to estimate the data value of that mark; however, it does not associate the predicted data values with corresponding text labels (e.g., x-axis-label); hence, extend their approach to output the fully-structured data tables; utilize the CRAFT (Baek et al., 2019) model to recognize the texts in the chart elements; then, associate the data values with their text labels using positional and color information (see A.3 for details); ; Section 4.2 with FIG. 3 in Pages 5-6: the approach to ChartQA builds on two of the state-of-the-art TableQA models: T5 (Raffel et al., 2020; Nan et al., 2021) and TaPas (Herzig et al., 2020); the input to these models consists of the question qi and the data table ti; different from TableQA, ChartQA often involves extracting visual information from chart images; for this, also experiment with the visual counterparts of the TableQA models that also take the chart image features into account; while T5 has a visual variant, VL-T5 (Cho et al., 2021), TaPas does not; in this work, extend TaPas to consider the image features and call it VisionTaPas; more details on models are provided in A.5; T5 (Raffel et al., 2020) is an encoder-decoder model which unifies the NLP tasks as text-to-text generation using the same architecture and loss function; it has been pre-trained on massive amount of unlabelled data with a self-supervised denoising objective; to fine-tune T5 on the ChartQA task, flatten the data table and feed it along with the question as: "Question: Question tokens Table: Flattened table tokens", and the model is trained to generate the answer directly; VL-T5 (Cho et al., 2021) is an extension of T5 that unifies the Vision-Language (VL) tasks as text generation conditioned on multimodal inputs; the input consists of both textual tokens and visual features of the objects extracted from the image using Faster R-CNN (Ren et al., 2015); the model is pre-trained on multiple multimodal tasks such as language modeling, visual QA, and visual grounding; utilize VL-T5 for the ChartQA task in the following manner: (a) for the textual input, do the same as T5 where flatten the data table of the chart image and concatenate it with the question text; (b) for the visual input, extract the visual features of different marks in the chart image (e.g., bars, lines) using Mask R-CNN (He et al., 2017) with Resnet-101 as its backbone (see A.4 for details); unlike the original VL-T5 where a fixed number of objects is provided (36), the number of elements varies from one chart to another; to account for this, pad the extracted visual features with zeros to have a fixed length of 36; TaPas (Herzig et al., 2020) extends a BERT (Devlin et al., 2019) architecture with additional positional embeddings for rows and columns to encode a table; as shown in Fig. 3a, the input to the model has the following format: [CLS] Question tokens [SEP] Flattened table tokens; the tokens are encoded with the table-specific positional embeddings in addition to BERT’s segment and positional embeddings; the model has two output heads: aggregation operation head and cell selection head; the aggregation operation head predicts an operation (e.g., COUNT, SUM, AVERAGE, NONE) which is then applied to the cell values selected by the cell selection head; depending on the operation type, the selected cells can constitute the final answer or the input used to infer the final answer; TaPas is first pre-trained on masked language modeling objective using table-text pairs crawled from Wikipedia where table cells are randomly masked and the model is trained to predict them; it is then fine-tuned in a weakly-supervised manner (using answers as the only supervision) with end-to-end differentiable objectives; VisionTaPas is our extension of TaPas for QA over charts; it consists of three main components: a vision transformer encoder for encoding the chart image, a TaPas encoder for encoding the question and data table and a cross-modal encoder (Fig. 3b); Vision Transformer or ViT (Dosovitskiy et al., 2021) utilizes the transformer encoder architecture (Vaswani et al., 2017) in vision tasks; given a 2D chart image, the image is divided into a sequence of 2D patches {p1, …, pn}; each patch is then flattened and linearly projected into a d-dimensional embedding vector; to incorporate the positional information of the patches, 1D learnable positional embeddings are added to the image features; an L-layer ViT encoder produces a sequence of embeddings H =                         
                            
                                            h
                                        
                                            c
                                            l
                                            s
                                        
                                            L
                                        
                                    ,
                                     
                                            h
                                        
                                            1
                                        
                                            L
                                        
                                    ,
                                     
                                    …
                                    ,
                                     
                                            h
                                        
                                            n
                                        
                                            L
                                        
                     representing the special [CLS] token and the image patches; initialize the ViT module with the pre-trained weights from (Dosovitskiy et al., 2021); the Cross-modality Encoder takes the output of ViT and TaPas encoders (H and Z) and compute multimodal encodings; it has four blocks, each containing a visual branch and a textual-tabular branch; the input first passes through the multiheaded cross attention layers in parallel, where in the visual branch the query vectors are the visual features, and the key and context vectors are the textual-tabular features and vice versa in the textual-tabular branch; the cross-attended features are then passed through a self-attention layer followed by a fully connected layer; similar to the transformer model, each layer applies layer normalization (Ba et al., 2016) and is wrapped with a residual connection; finally, append the aggregation operation and the cell selection heads of TaPas to the final layer at the textual-tabular branch; many questions in the ChartQA dataset require performing a subtraction or ratio operation, which the original TaPas model does not support; thus extend the operation head to add those two operations (Fig. 3b); however, instead of training them in a weakly-supervised manner based on the final answer (as done in TaPas), find it more effective when provided with more direct but potentially noisy supervision on the cells to consider; rely on some heuristics to generate such supervision in the training data; to handle the fixed vocabulary answers (e.g. ‘Yes’, ‘No’), further extend the operation head to include those classes; "ChartQA Dataset" of Section 5.2 with Table 6 in Pages 7-8:  evaluate the transferability of the models and the datasets, where first pretrain the two top performing models (VisionTaPas and VL-T5) on the PlotQA dataset and then fine-tune them on ChartQA; large datasets like PlotQA can be useful for pretraining the model even if the questions are generated from a small number of templates; Section A.4 in Page 15: train the model to detect the following 15 objects: ’Legend’, ’yAxisTitle’, ’ChartTitle’, ’xAxisTitle’, ’LegendPreview’, ’PlotArea’, ’yAxisLabel’, ’xAxisLabel’, ’LegendLabel’, ’PieLabel’, ’bar’, ’pie’, ’pieSlice’, ’line’, and ’dotLine’; for the bounding boxes annotations, use the available bboxes; for the masks, generate them easily using the bounding boxes for all the rectangular objects; for ’pieSlice’ and ’pie’, follow a similar approach to (Singh and Shekhar, 2020) where generate the masks by projecting the radius along the pie perimeter from the starting to the ending points of each slice; use the detectron2 library (Wu et al., 2019) and initialize the model with pre-trained wights on the COCO dataset (Lin et al., 2014); fine-tune the model with a batch size of 8 and an initial learning rate 0f 0.00025 for 50K iterations; Section A.5 with FIG. 9 in Pages 15-16: T5 and VL-T5 fine-tuning process setup is shown in Figure 9).  
Masry fails to explicitly disclose wherein charts are flowcharts.
SUN teaches a system and a method relating to recognize charts using transformer (SUN, ABSTRACT), wherein charts are flowcharts (SUN, ABSTRACT in Page 64292: propose an end-to-end multi-task network FR-DETR (Flowchart Recognition DEtection TRansformer) and a new dataset for precise and robust flowchart recognition; FR-DETR comprises a CNN backbone and a shared multi-scale Transformer structure to perform symbol detection and edge detection using shared feature maps and respective prediction heads in a coarse-to-fine refinement process; the coarse stage analyzes features with low resolution and suggests candidate regions that contain potential targets for the fine stage to produce accurate predictions using features with high resolution; meanwhile, a new dataset is constructed to provide more symbol types and complex backgrounds for network training and evaluation, which contains more than 1000 machine-generated flowchart images, 25K+ symbol instances with nine categories, and 20K+ line segments; Section I with FIG. 1 in Pages 64292-64293: flowchart recognition is an essential sub-task in research on document analysis and recognition; the critical problem of flowchart recognition is to recognize and refine the structural semantics of flowcharts; there are two study areas for flowchart recognition: handwritten flowchart recognition and machine-generated flowchart recognition; understanding the structural semantics of machine-generated flowchart images is crucial for many structural-semantic-based tasks, such as patent retrieval, automatic code generation, and task-oriented dialogue systems; existing methods for recognizing machine-generated flowchart mainly extract the entire structure by analyzing the connected components in images and then identify specific structures using manually chosen features; flowchart structures now have more varied and colorful symbols and more complex backgrounds, leading to problems such as decrements in recognition accuracy as well as incompatibility and inflexibility of manually chosen features; deep-learning-based computer vision technologies are capable of focusing on desired targets, which can be summarized into symbols and edges in flowcharts; the flowchart recognition task can be divided into recognizing symbols using object detection and detecting edges using line segment detection; as shown in Fig. 1, symbols and arrows are indicated by bounding boxes and edges are indicated by line segments; in flowchart recognition, information processed by different tasks is inherently correlated, such as symbols are connected by edges and edges are connecting symbols; believe a multi-task model is more suitable for flowchart recognition because (a) it enables information to be shared between two tasks of object detection and line segment detection, which can simplify the network structure; and (b) it reduces the training and analysis process by managing two tasks simultaneously; proposes an end-to-end multi-task network architecture named FR-DETR, the first deep learning system for machine-generated flowchart recognition to the best of our knowledge; by fusing DETR (DEtection TRansformer) and LETR (LinE segment TRansformers), FR-DETR has a CNN backbone and a multi-scale Transformer structure to perform fine-grained symbol detection and edge detection respectively; to satisfy the requirements of data volume and complexity for model training, a new flowchart dataset is also constructed; the machine-generated flowchart recognition can be accomplished using deep-learning-based object detection and line segment detection to handle the increasing symbol diversity and background complexity of flowchart images; propose an end-to-end multi-task learning network that combines object detection and line segment detection to reduce the costs caused by separate models; the model jointly detects symbols and edges in flowcharts and employs a multi-scale Transformer structure to improve the recognition accuracy of both tasks; the proposed method achieves a better recognition accuracy that outperforms the prior machine-generated flowchart recognition methods; a new machine-generated flowchart dataset is established to address data shortages for deep learning model training; Section II.A in Pages 64293-64294: Rusiñol et al. [7], [8] summarized flowcharts as structure layer and text layer, and then performed recognition after layer separation; Zhang [23] proposed a corner-based structural model (CBSM) based on the analysis of different corners and symbol shapes; the CBSM recognizes symbols by defining corner classification and corner-based spatial constraints for each kind of graphic shapes; Carton et al. [27] fused structural and statistical information to compute grammatical descriptions for each type of symbol; Lemaitre et al. [28] analyzed flowchart structure based on the description and modification of segment (DMOS) and structural information; Bresler et al. [29] solved a max-sum model optimization task to obtain the best symbol description; Schäfer et al. [30] proposed Arrow R-CNN to detect symbols and connecting edges by adding a keypoint head to Faster R-CNN; the keypoint head is designed to detect the arrow and tail belonging to a connecting edge as two keypoints; arrow R-CNN was the first deep-learning-based flowchart recognition approach; Section II.B in Page 64294: Carion et al. [14] introduced a Transformer based end-to-end object detection network DETR that removed pre-designed anchor and non-maximum suppression (NMS); it detects objects using interactions between a fixed number of queries and encoded image features; following the basic query concept of DETR, Xu et al. [19] proposed a Transformer based line segment detector LETR; unlike the prior line segment detection approaches consisting of heuristics-guided steps, the LETR detector directly regressed the endpoints of a line segment and achieved state-of-the-art performance on relative line segment detection datasets; Section II.C in Page 64294: deep-learning-based MTL (multitask learning) models aim to improve network generalization and the capability of jointly learning shared information; compared with single-task models, multi-task networks have advantages such as a reasonable reduction in model size and increment in inference speeds by sharing inherent parts of network structure; Section III.A-D with FIGS. 2-3 in Pages 64294-64296: as shown in Fig. 2, although the representation of connecting edges is adjusted to an entire connection between two symbols, errors in detecting edges and keypoints still occur in Arrow R-CNN; based on the structural analysis of connecting edges, using line segment representation can better handle the dilemma in which Arrow R-CNN is stuck; recently, a Transformer-based line segment detector LETR [19] suggested a model that directly regresses the endpoints coordinates of each line segment; the attention mechanism introduced by Transformer perfectly meets the need to distinguish line segments between wanted and unwanted ones; inspired by the aforementioned methods, with the aim of improving the flowchart  recognition accuracy and reducing the costs of using isolated models, modify LETR into a multi-task model to jointly accomplish the two detection tasks; the model selected to perform symbol detection is DETR; the reasons can be concluded as follows: (a) DETR has a similar Transformer-based structure to LETR, which maximizes structure sharing between the two models and the overall cost reduction; (b) Other CNN-based object detection models and LETR have few shareable parts, which unavoidably results in insufficient structure sharing and difficulties in jointly analyzing features; the overall architecture of FR-DETR is designed based on a multi-scale Transformer encoder-decoder structure as depicted in Fig. 3; the proposed flowchart recognition process can be divided into four sub-tasks: feature extraction, feature encoding, feature decoding and target prediction; FR-DETR uses a CNN backbone to extract a feature map from an input image; the channel dimension of the feature map f0 is then reduced from C to a smaller dimension d to obtain a new feature map f by 1 × 1 convolution; to meet the encoder's requirement, which expects input in the sequence format, the feature map f is flattened to create another feature map z; the Transformer encoder is stacked with multiple encoding layers; each layer consists of a multi-head self-attention module and a feed-forward network (FFN); the encoding layers receive processed features from their predecessor layer and deliver the output features to the corresponding FFN after learning the pairwise relations between the input and output; in general, the flattened feature map z is encoded into a new feature map z'; the positional encoding of f is added to guarantee the flattened feature map z not to lose the spatial relations; following the standard architecture of Transformer, the decoder transforms each N embeddings of symbol and line segment using the multi-head self-attention and cross-attention module; like the positional encoding of the encoder, the input embeddings are learnable positional information that is added to the input of each layer and named as target queries; each decoding layer receives z' from the last encoding layer and two types of target queries b and l, namely symbol queries and line queries, from its predecessor decoding layer; both types of queries are first processed by the self-attention model, and then, each entity in the queries is assigned to different regions of positional encoding by the cross-attention module; the output of the decoder is then used to predict the final results using an FFN; the final results of symbols and line segments are predicted by an FFN; specifically, the coordinates of symbols and line segments are computed by a multi-layer perceptron (MLP) with three layers, and the confidence of the predicted targets is produced by a linear projection layer; in contrast to object detection, which mainly focuses on local and neighborhood regions, line segment detection needs to consider the fine-grained local features and the global information; an efficient way to tackle the problem is a sequential two-stage structure, whose former component produces suggested regions for the other component to perform exact detection; following this idea, FR-DETR performs both the desired tasks in a refinement process using a coarse-to-fine strategy; this strategy enables FR-DETR to learn from multi-scale image features and produce precise predictions; in general, the model first analyzes the global information to locate possible targets coarsely and then uses the location information to examine local features and perform fine-grained recognition; in the coarse stage, FR-DETR studies a low resolution feature map to identify potential regions containing symbols and line segments; the low resolution feature map sent into the coarse encoder is the output of the ResNet [33] C5 layer, and its size is 1/32 of the original image resolution; after the encoding process, the encoded features and init-target queries are then passed into each decoding layer's cross-attention module and self-attention module; the predictions produced by the coarse stage are considered as potential target regions and received by fine decoding layers as mid-target queries; the coarse stage is important for improving the accuracy of the fine stage and can reduce the computation cast compared with directly processing high resolution features; in the fine stage, based on the suggested potential regions, FR-DETR makes detailed predictions using a feature map with 1/16 of the original image resolution, which is the output of the ResNet C4 layer and is twice the size of the feature map used in the coarse stage; in general, fine decoding is similar to that in the coarse stage; the main difference is that it processes information with more details and focuses on the suggested regions to conclude predictions, making the fine stage crucial for accomplishing precise fine-grained detection; in the prediction process, each type of final queries is fed into its corresponding FFN that consists of a classifier and a regressor to predict the category confidence p and the coordinates of every target; if a final query belongs to symbol detection, the coordinates prediction is in the format of bounding box b = (cx, cy, w, h), which denotes the center point, width and height of the box; otherwise, a line segment prediction l is represented by two endpoints (p1; p2), where p1 = (x1; y1), p2 = (x2; y2); the set of predictions                         
                            
                                    t
                                
                                ^
                            
                     has N targets, and the set of ground truth t has M elements, normally N > M; to assign a bipartite matching between the predictions and ground truth, t is assumed as a set that padded with no object (Ø) to meet the size of                         
                            
                                    t
                                
                                ^
                            
                    ; in this case, an optimization for the bipartite matching is used to find the permutation with the lowest cost using eqns. (1)-(4); for the two detection tasks, each task loss must also evaluate the results of classification; by adding a cross-entropy loss term, each task loss can be represented as eqns. (5)-(6); the total loss of FR-DETR is formulated as eqn. (7); Section IV.A with Table 1 in Page 64297: the original CLEF-IP dataset released in 2011 contains machine-generated flowchart images and other structural diagrams; after removing non-flowcharts, approximately 60 images remain, most of which are provided by the European Patent Office (EPO) for the patent retrieval study; these flowcharts have a simple white background, and their structures are drawn in black; the dataset has three main symbols: rectangle for processing action, diamond for decision, and oval for terminator; to enrich the symbol category and structural complexity, public flowchart images are collected through the Internet by using image search engines, such as Google Image, Bing Image and Baidu Image; after filtering low quality images and removing duplicates, a dataset containing more than 1,000 images is constructed; the new dataset includes 25K+ symbol instances and 20K+ line segments; statistical details are shown in Table 1; the FR-DETR model is trained on the new dataset, which is randomly divided into 800 training images and 200 testing images; data augmentation methods including random resize, random flip, and random crop, are taken through all experiments).
Masry and SUN are analogous art because they are from the same field of endeavor, a system and a method relating to recognize charts using transformer.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to apply the teaching of SUN to Masry.  Motivation for doing so would extend capability of document analysis and recognition.

Claim 20
Masry in view of SUN discloses all the elements as stated in Claim 19 and further discloses an adaptation module adapted to receive an annotated real-world dataset of flowchart images and to adjust vision-language machine learning model to answer questions about similar flowcharts (Masry, "Existing Datasets" of Section 2 in Page 2 with Table 1 in Page 3: ChartQA differs from previous datasets in two main aspects: the questions’ types (human-authored vs. template-based) and the chart source (real-world vs. generated using a tool); a detailed comparison is shown in Table 1; earlier datasets such as FigureQA (Kahou et al., 2017), DVQA (Kafle et al., 2018), LEAF-QA (Chaudhry et al., 2020) and LEAF-QA++ (Singh and Shekhar, 2020) are mostly synthetic where the questions are generated using a small number of templates and the answers come from a fixed set of vocabulary (e.g. ‘yes’, ‘no’); PlotQA (Methani et al., 2020) is the only dataset with open-vocabulary questions that require applying aggregation operations on the underlying chart data; FigureQA and DVQA use synthetically-generated data to plot the charts; LEAF-QA and LEAFQA++ use real-world data to plot the charts; the charts are plotted using a software in PlotQA (Methani et al., 2020); "Dataset Augmentation" of Section 3.2 in Pages 3-4: prior work on QA has performed data augmentation by either creating template-based or machine generated questions; fine-tune a pre-trained T5 model on the SQuAD QA dataset (Rajpurkar et al., 2016) and apply to the human-written chart summaries that come with the charts from Statista to automatically generate questions that are human-like with sufficient lexical and syntactic variations; "ChartQA Dataset" of Section 5.2 with Table 6 in Pages 7-8:  evaluate the transferability of the models and the datasets, where first pretrain the two top performing models (VisionTaPas and VL-T5) on the PlotQA dataset and then fine-tune them on ChartQA; large datasets like PlotQA can be useful for pretraining the model even if the questions are generated from a small number of templates; Section A.4 in Page 15: train the model to detect the following 15 objects: ’Legend’, ’yAxisTitle’, ’ChartTitle’, ’xAxisTitle’, ’LegendPreview’, ’PlotArea’, ’yAxisLabel’, ’xAxisLabel’, ’LegendLabel’, ’PieLabel’, ’bar’, ’pie’, ’pieSlice’, ’line’, and ’dotLine’; for the bounding boxes annotations, use the available bboxes; for the masks, generate them easily using the bounding boxes for all the rectangular objects; for ’pieSlice’ and ’pie’, follow a similar approach to (Singh and Shekhar, 2020) where generate the masks by projecting the radius along the pie perimeter from the starting to the ending points of each slice; use the detectron2 library (Wu et al., 2019) and initialize the model with pre-trained wights on the COCO dataset (Lin et al., 2014); fine-tune the model with a batch size of 8 and an initial learning rate of 0.00025 for 50K iterations) (SUN, ABSTRACT in Page 64292: propose an end-to-end multi-task network FR-DETR (Flowchart Recognition DEtection TRansformer) and a new dataset for precise and robust flowchart recognition; FR-DETR comprises a CNN backbone and a shared multi-scale Transformer structure to perform symbol detection and edge detection using shared feature maps and respective prediction heads in a coarse-to-fine refinement process; the coarse stage analyzes features with low resolution and suggests candidate regions that contain potential targets for the fine stage to produce accurate predictions using features with high resolution; meanwhile, a new dataset is constructed to provide more symbol types and complex backgrounds for network training and evaluation, which contains more than 1000 machine-generated flowchart images, 25K+ symbol instances with nine categories, and 20K+ line segments).  

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
SINGH et al. (US 2022/0121679 A1, pub. date: 04/21/2022) discloses in ABSTRACT that (1) determining an answer to a query associated with a graphical representation of data; (2) obtaining a visual embedding for a graphical representation of data, the visual embedding representing a plurality of graphical elements; (3) obtaining a query embedding for a query associated with the graphical representation of data, the query embedding representing a plurality of textual elements of the query with at least one textual element substituted with an identifier for at least one graphical element of the set of graphical elements; and (4) generating a chart sequence from the visual embedding and a query sequence from the query embedding, generating an output sequence based on the graph and the query sequences, and determining an answer to the query from the output sequence.  SINGH further discloses in ¶¶ [0061]-[0066] with FIG. 7 that (1) the chart reasoning module 702 includes a sequencing network 704 and a transformer block 710; (2)  the sequencing network 704 is configured to receive training data (e.g., training data 700), wherein the training data 700 can include query embeddings and charts embeddings for pre-training the chart reasoning module 702; (3) the sequencing network 704 retrieves or obtains the training data 700 from a storage; (4) the transformer block 710 is comprised of a series of transformers, including query transformer 712, chart transformer 714, and transformer 716; (5) each of the transformers 712-716 in transformer block 710 is configured to receive an input sequence and to generate an output sequence; e.g., the query transformer 712 performs question understanding, the chart transformer 714 performs chart structure understanding, and the transformer 716 performs reasoning over the chart to find an answer to a query; (6) the sequencing network 704 can generate a query sequence by assigning a position number to each element of a received query embedding (e.g., word or chart element identifier); (7) a normalization layer is applied before providing the query sequence 706 as an input to the transformer block 710; (8) when the sequencing network 704 receives a visual embedding from the training data 700, the sequencing network 704 generates a chart sequence (e.g., chart sequence 708); (8) the query sequence 706 and the chart sequence 708 are then submitted as inputs to the query transformer 712 and the chart transformer 714, respectively; (9) the outputs of the query transformer 712 and the chart transformer 714 are then submitted as inputs to the transformer 716; and (10) the output of the transformer 716 is training output 718.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to HWEI-MIN LU whose telephone number is (313)446-4913. The examiner can normally be reached Mon - Fri: 9:00 AM - 6:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Mariela D. Reyes can be reached at (571) 270-1006. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/HWEI-MIN LU/Primary Examiner, Art Unit 2142
Read full office action
Prosecution Timeline

Apr 17, 2023
Application Filed
Jan 10, 2026
Non-Final Rejection — §101, §102, §103
Mar 06, 2026
Interview Requested
Mar 31, 2026
Examiner Interview Summary
Mar 31, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

17/737,938
Patent 12602578
LIGHT SOURCE COLOR COORDINATE ESTIMATION SYSTEM AND DEEP LEARNING METHOD THEREOF
2y 5m to grant Granted Apr 14, 2026
17/804,513
Patent 12596954
MACHINE LEARNING FOR MANAGEMENT OF POSITIONING TECHNIQUES AND RADIO FREQUENCY USAGE
2y 5m to grant Granted Apr 07, 2026
17/231,757
Patent 12591770
PREDICTING A STATE OF A COMPUTER-CONTROLLED ENTITY
2y 5m to grant Granted Mar 31, 2026
17/662,568
Patent 12579466
DYNAMIC USER-INTERFACE COMPARISON BETWEEN MACHINE LEARNING OUTPUT AND TRAINING DATA
2y 5m to grant Granted Mar 17, 2026
17/805,377
Patent 12561222
REDUCING BIAS IN MACHINE LEARNING MODELS UTILIZING A FAIRNESS DEVIATION CONSTRAINT AND DECISION MATRIX
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
62%
Grant Probability
99%
With Interview (+39.5%)
3y 1m
Median Time to Grant
Low
PTA Risk
Based on 217 resolved cases by this examiner. Grant probability derived from career allow rate.