Last updated: May 29, 2026
Application No. 18/383,799
MULTI-MODAL MACHINE LEARNING MODEL FOR DIGITAL DOCUMENT PROCESSING

Non-Final OA §103
Filed
Oct 25, 2023
Examiner
BARNES JR, CARL E
Art Unit
2178
Tech Center
2100 — Computer Architecture & Software
Assignee
Intuit Inc.
OA Round
1 (Non-Final)
This examiner grants 32% of cases after interview

— +24.2% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 205 resolved cases, 2023–2026
Examiner Intelligence

BARNES JR, CARL E View full profile →
Grants only 32% of cases
Career Allowance Rate
66 granted / 205 resolved
-22.8% vs TC avg
Strong +24% interview lift
Without
With
+24.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 10m
Avg Prosecution
23 currently pending
Career history
238
Total Applications
across all art units
Statute-Specific Performance

§101
0.2%
-39.8% vs TC avg
§103
96.7%
+56.7% vs TC avg
§102
2.3%
-37.7% vs TC avg
§112
0.4%
-39.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 205 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 02/20/2024 was filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description: Fig. 4A element 409, 411, is not shown in the description filed on 10/25/2023.  
Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 1-2, 4-8, 11, and 13-19 are rejected under 35 U.S.C. 103 as being unpatentable over Rimchala (US PGPUB: 20230386236 A1, Filed Date: Nov. 30, 2022) in view of ACHIWA (US PGPUB: 20250078549 A1, Filed Date: Aug. 31, 2023).
Regarding independent claim 1, Rimchala teaches: A method comprising: 
receiving a digital image, (Rimchala − [0021,0023] obtained by the system is a document image, the document (100) is a financial document.) wherein the digital image comprises text arranged in a layout within the digital image; (Rimchala − [0021, 0023] The financial document may be any document that includes financial information. Some examples of the financial document include a receipt, a tax document (e.g., W2), a bank statement, a balance sheet, a cash flow statement, a profit and loss statement, etc.) Examiner Note: financial documents are a layout within a digital image.
generating, by an optical character recognition model, (Rimchala − [0028] the document image (306) may be transmitted to an Optical Character Recognition (OCR) engine (308). The OCR engine is a visual based model that is trained to detect text from the image formatted document (e.g., electronic/scanned copy or a hard copy format or mobile captured images) and generate output including machine encoded text.) 
a layout text vector that encodes at least one word in the text of the digital image and also encodes a position of the at least one word in the layout of the digital image; (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding.)
generating, by a visual encoder model, a visual representation vector embedding a content of the digital image; (Rimchala − [0031] the document image gets converted to an image embedding vectors using visual feature extractors specific to the choice of encoder models. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”.) [0047] A structural representation is a data structure in which each entity is separated in the data structure with a separated entity value and entity label. For example, rather than a single string of entity values, entity labels and corresponding delimiters, the structural representation has a data structure for the document and is composed of individual separate data structures for entities. An example of the structural representation is a JavaScript Object Notation (JSON) document or eXtensible Markup Language (XML) document) Examiner NOTE: a separate encoder model is a visual encoder model, used for generating a structure representation of the document.
converting both the layout text vector and the visual representation vector into a projected text vector, (Rimchala − [0037] [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”.) [0061] The OCR output may be converted to a vector representation (708), which is the OCR text divided into tokens.) 
wherein the projected text vector comprises a digital format suitable for input to a large language model; (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding. [0037] a BERT (Bidirectional Encoder Representations from Transformers) encoder model may be combined with a BERT for Language Modeling (BERTLM) decoder model. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”. [0061] The OCR output may be converted to a vector representation (708), which is the OCR text divided into tokens.) Examiner Note: BERT (Bidirectional Encoder Representation from Transformers) is considered a large language model (LLM)).
and generating an output comprising a key-value pair, wherein: a key of the key-value pair represents a type of the text and a value of the key-value pair represents a value of the type, (Rimchala − Fig. 7 [0060-0061] The extracted content (714) is structure formatted text containing key-value pairs that relates each entity label with an entity identifier. As shown in the example, the output of the decoder model is entity label “Tax Year” related to entity value “2016”, entity label “Wages Amount” related to entity value “9168.26”, entity label “Employee SSN” related to entity value “777-66-9999”, and entity label “Employer EIN” related to entity value “32-1726411”.)
Rimchala does not explicitly teach: combining, into a prompt, the projected text vector, a system message, and a task instruction;
However, ACHIWA teaches: combining, into a prompt, (ACHIWA − [0046] The large language model 116 is a model called LLM (Large Language Model) capable of generating sentences in an interactive manner, and generates replies to input instruction messages (prompts).) 
the projected text vector, a system message, (ACHIWA − [0109] In S802, the CPU 261 obtains an instruction message template from the storage 265. The instruction message template, which has been prepared in advance, may be a template prepared as a preset template by the engineer or the user or such a preset template to which a correction or an addition has been made by the system or the user. Examiner NOTE: instruction message template can be system message prepared in advance) 
and a task instruction; (ACHIWA − [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image.)
and the output is generated by the large language model which takes, as input, the prompt. (ACHIWA − [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image. A reply 1411 to the instruction message 1410 from the large language model 116 indicates that the large language model 116 has returned “June 2, 2023” as the character string of the item “date” and “¥13,000” as the character string of the item “total amount”.)
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala and ACHIWA as each inventions relates to character recognition of documents. Adding the teaching of ACHIWA provides Rimchala with instruction messages (prompts) for generating replies to inputs. One of ordinary skill in the art would have been motivated for correcting and updating extraction rules through prompts for misrecognized characters. Therefore, improving character recognition of document images.
Regarding dependent claim 2, depends on claim 1, Rimchala teaches: wherein the visual representation vector comprises a hidden representation vector output by a plurality of inner layers of the visual encoder model. (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”.)
Regarding dependent claim 4, depends on claim 1, Rimchala teaches: wherein: converting comprises inputting a combination of the layout text vector and the visual representation vector to a projection network model, the projection network model outputs the projected text vector, and the projection network model projects the layout text vector and the visual representation vector into a textual token embedding space. (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”.))
Regarding dependent claim 5, depends on claim 1, Rimchala teaches: extract a plurality of key-value pairs from the digital image in the document, (Rimchala − Fig. 7 [0060-0061]The extracted content (714) is structure formatted text containing key-value pairs that relates each entity label with an entity identifier. As shown in the example, the output of the decoder model is entity label “Tax Year” related to entity value “2016”, entity label “Wages Amount” related to entity value “9168.26”, entity label “Employee SSN” related to entity value “777-66-9999”, and entity label “Employer EIN” related to entity value “32-1726411”.) but does not explicitly teach: the task instruction
However, ACHIWA teaches: wherein: the task instruction is to extract the key-value pair, or the task instruction is to extract a plurality of key-value pairs from the digital image, and the key-value pair is one of the plurality of key-value pairs representing key information entities in a document. (ACHIWA − [0046] The external information processing server 105 is an apparatus that utilizes a large language model 116. The large language model 116 is a model called LLM (Large Language Model) capable of generating sentences in an interactive manner, and generates replies to input instruction messages (prompts). For example, ChatGPT (registered trademark), [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] A reply 1411 to the instruction message 1410 from the large language model 116 indicates that the large language model 116 has returned “June 2, 2023” as the character string of the item “date” and “¥13,000” as the character string of the item “total amount”.) return instruction message is key information entities in document.
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala and ACHIWA as each inventions relates to character recognition of documents. Adding the teaching of ACHIWA provides Rimchala with instruction messages (prompts) for generating replies to inputs. One of ordinary skill in the art would have been motivated for correcting and updating extraction rules through prompts for misrecognized characters. Therefore, improving character recognition of document images.
Regarding dependent claim 6, depends on claim 1, Rimchala does not explicitly teach: the prompt is generated for a zero shot inference without demonstration, and the prompt is additionally generated to include both the task instruction and a demonstration comprising a multimodal input followed by an expected output represented as a known key-value pair in a structured format.
However, ACHIWA teaches: wherein: the prompt is generated for a zero shot inference without demonstration, and the prompt is additionally generated to include both the task instruction and a demonstration comprising a multimodal input followed by an expected output represented as a known key-value pair in a structured format. (ACHIWA − [0046] The external information processing server 105 is an apparatus that utilizes a large language model 116. The large language model 116 is a model called LLM (Large Language Model) capable of generating sentences in an interactive manner, and generates replies to input instruction messages (prompts). For example, ChatGPT (registered trademark), [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] A reply 1411 to the instruction message 1410 from the large language model 116 indicates that the large language model 116 has returned “June 2, 2023” as the character string of the item “date” and “¥13,000” as the character string of the item “total amount”.) JSON is a structured format; processing server 105 operates like ChatGPT (registered trademark), at zero-shot inference, a core capability where it performs tasks (like summarizing, translating, or classifying) with a prompt,
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala and ACHIWA as each inventions relates to character recognition of documents. Adding the teaching of ACHIWA provides Rimchala with instruction messages (prompts) for generating replies to inputs. One of ordinary skill in the art would have been motivated for correcting and updating extraction rules through prompts for misrecognized characters. Therefore, improving character recognition of document images.
Regarding dependent claim 7, depends on claim 1, Rimchala does not explicitly teach: wherein: the prompt is generated for a zero shot inference without demonstration, and the prompt is generated to include meta-information including a document type of the digital image.
However, ACHIWA teaches: wherein: the prompt is generated for a zero shot inference without demonstration, and the prompt is generated to include meta-information including a document type of the digital image. (ACHIWA −[0046] The external information processing server 105 is an apparatus that utilizes a large language model 116. The large language model 116 is a model called LLM (Large Language Model) capable of generating sentences in an interactive manner, and generates replies to input instruction messages (prompts). For example, ChatGPT (registered trademark), [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] A reply 1411 to the instruction message 1410 from the large language model 116 indicates that the large language model 116 has returned “June 2, 2023” as the character string of the item “date” and “¥13,000” as the character string of the item “total amount”.) date is a type of data; processing server 105 operates like ChatGPT (registered trademark), at zero-shot inference, a core capability where it performs tasks (like summarizing, translating, or classifying) with a prompt,
Regarding independent claim 8, Rimchala teaches: A method comprising: 
receiving training data comprising a reference output for a digital image, (Rimchala − [0021,0023] obtained by the system is a document image, the document (100) is a financial document. [0027] Training documents (302) are documents that are each associated with corresponding ground truth information (304).) wherein the digital image comprises text arranged in a layout within the digital image; (Rimchala − [0021, 0023] The financial document may be any document that includes financial information. Some examples of the financial document include a receipt, a tax document (e.g., W2), a bank statement, a balance sheet, a cash flow statement, a profit and loss statement, etc.) Examiner Note: financial documents are a layout within a digital image.
performing a first sub-method comprising: generating, by a visual encoder model, a visual representation vector embedding a content of the digital image; (Rimchala − [0031] the document image gets converted to an image embedding vectors using visual feature extractors specific to the choice of encoder models. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”.) [0047] A structural representation is a data structure in which each entity is separated in the data structure with a separated entity value and entity label. For example, rather than a single string of entity values, entity labels and corresponding delimiters, the structural representation has a data structure for the document and is composed of individual separate data structures for entities. An example of the structural representation is a JavaScript Object Notation (JSON) document or eXtensible Markup Language (XML) document) Examiner NOTE: a separate encoder model is a visual
converting, using a projection network model, the visual representation vector into a projected text vector, wherein the projected text vector comprises a digital format suitable for input to a large language model; (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding. [0037] a BERT (Bidirectional Encoder Representations from Transformers) encoder model may be combined with a BERT for Language Modeling (BERTLM) decoder model. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”. [0061] The OCR output may be converted to a vector representation (708), which is the OCR text divided into tokens.) Examiner Note: BERT (Bidirectional Encoder Representation from Transformers) is considered a large language model (LLM)).)
generating a loss function by comparing the output to the reference output; (Rimchala − [0033] The raw text (321) may be compared with a similar raw text version of the ground truth information by a loss function (326) to obtain a comparison result.)
adjusting, based on the loss function, one or more parameters in the projection network model; (Rimchala − [0033] The comparison result indicates whether a match or no match exists between particular entities and entity values in the raw and ground truth information. Using the comparison result, the loss function (326) is configured to generate a parameter updates. The loss function (326) is used to compute updates to the model parameters by backpropagation through the entire encoder-decoder model.)
and training a first trained projection network model by iterating, until convergence, (Rimchala − [0033] [0044] When the models are deemed trained, such as by satisfying a convergence criterion, the system may be deployed to a production system. The production system may be the same or different computer system than the training system (300).)
receiving the training data, performing the first sub-method, generating the loss function, and adjusting the one or more parameters, (Rimchala − [0033] The loss function (326) is used to compute updates to the model parameters by backpropagation through the entire encoder-decoder model. [0034] In one or more embodiments, the loss function (326) further includes functionality to calculate the loss (e.g., the respective model parameter updates) using a difference (353) calculated by a value-in-OCR comparator (351). )
wherein upon convergence the projection network model is transformed into the first trained projection network model. (Rimchala − [0033] [0044] When the models are deemed trained, such as by satisfying a convergence criterion, the system may be deployed to a production system. The production system may be the same or different computer system than the training system (300).)
Rimchala does not explicitly teach: combining, into a prompt, the projected text vector, a system message, and a task instruction;
However, ACHIWA teaches: combining, into a prompt, (ACHIWA − [0046] The large language model 116 is a model called LLM (Large Language Model) capable of generating sentences in an interactive manner, and generates replies to input instruction messages (prompts).) 
the projected text vector, a system message, (ACHIWA − [0109] In S802, the CPU 261 obtains an instruction message template from the storage 265. The instruction message template, which has been prepared in advance, may be a template prepared as a preset template by the engineer or the user or such a preset template to which a correction or an addition has been made by the system or the user. Examiner NOTE: instruction message template can be system message prepared in advance)
and a task instruction; (ACHIWA − [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image.)
and generating, using a large language model that takes the prompt as input, an output comprising a sequence of next tokens in an optical character recognition text determined for the text in the image; (ACHIWA – [0045] The display control unit 158 performs control for displaying the item values extracted by the document image analysis unit 154 and the reply to the instruction message obtained from the large language model 116 to the user. [0061] The information processing server 103 and execute an information processing program stored in the storage 265 to execute information processing such as character recognition (OCR) and information extraction. [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image. A reply 1411 to the instruction message 1410 from the large language model 116 indicates that the large language model 116 has returned “June 2, 2023” as the character string of the item “date” and “¥13,000” as the character string of the item “total amount”.)
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala and ACHIWA as each inventions relates to character recognition of documents. Adding the teaching of ACHIWA provides Rimchala with instruction messages (prompts) for generating replies to inputs. One of ordinary skill in the art would have been motivated for correcting and updating extraction rules through prompts for misrecognized characters. Therefore, improving character recognition of document images.
Regarding dependent claim 11, depends on claim 8, Rimchala teaches: wherein receiving, performing, generating, adjusting, and training comprise a first training operation, and wherein the method further comprises a second training operation comprising: (Rimchala − [0045] FIG. 5, like reference numbers are used for the same elements as shown in FIG. 3. Continual training may optionally be performed resulting in additional components which are not shown in FIG. 5.) Continual training is a second training operation.
re-receiving the training data; (Rimchala − [0021,0023] obtained by the system is a document image, the document (100) is a financial document. [0027] Training documents (302) are documents that are each associated with corresponding ground truth information (304).)
performing a second sub-method comprising: generating, by an optical character recognition model, (Rimchala − [0028] the document image (306) may be transmitted to an Optical Character Recognition (OCR) engine (308). The OCR engine is a visual based model that is trained to detect text from the image formatted document (e.g., electronic/scanned copy or a hard copy format or mobile captured images) and generate output including machine encoded text.) 
a layout text vector that encodes at least one word in the text of the digital image and also encodes a position of the at least one word in the layout of the digital image; (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding.)
generating, by a visual encoder model, a visual representation vector embedding a content of the digital image; (Rimchala − [0031] the document image gets converted to an image embedding vectors using visual feature extractors specific to the choice of encoder models. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”.) [0047] A structural representation is a data structure in which each entity is separated in the data structure with a separated entity value and entity label. For example, rather than a single string of entity values, entity labels and corresponding delimiters, the structural representation has a data structure for the document and is composed of individual separate data structures for entities. An example of the structural representation is a JavaScript Object Notation (JSON) document or eXtensible Markup Language (XML) document) Examiner NOTE: a separate encoder model is a visual encoder
converting, using a projection network model, both the layout text vector and the visual representation vector into a projected text vector, wherein the projected text vector comprises a digital format suitable for input to a large language model;  (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding. [0037] a BERT (Bidirectional Encoder Representations from Transformers) encoder model may be combined with a BERT for Language Modeling (BERTLM) decoder model. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”. [0061] The OCR output may be converted to a vector representation (708), which is the OCR text divided into tokens.) Examiner Note: BERT (Bidirectional Encoder Representation from Transformers) is considered a large language model (LLM)).)
and generating an output comprising a key-value pair, wherein: a key of the key-value pair represents a type of the text and a value of the key-value pair represents a value of the type, Rimchala − Fig. 7 [0060-0061]The extracted content (714) is structure formatted text containing key-value pairs that relates each entity label with an entity identifier. As shown in the example, the output of the decoder model is entity label “Tax Year” related to entity value “2016”, entity label “Wages Amount” related to entity value “9168.26”, entity label “Employee SSN” related to entity value “777-66-9999”, and entity label “Employer EIN” related to entity value “32-1726411”.)
generating a second loss function by comparing the second output to the reference output; (Rimchala − [0033] The raw text (321) may be compared with a similar raw text version of the ground truth information by a loss function (326) to obtain a comparison result.) repeating a second iteration generate a second loss function during training.
adjusting, based on the loss function, the one or more parameters of both the first trained projection network model and the large language model; (Rimchala − [0033] The comparison result indicates whether a match or no match exists between particular entities and entity values in the raw and ground truth information. Using the comparison result, the loss function (326) is configured to generate a parameter updates. The loss function (326) is used to compute updates to the model parameters by backpropagation through the entire encoder-decoder model.)
and training a second trained projection network model and a trained large language model by iterating, until convergence, (Rimchala − [0033] [0044] When the models are deemed trained, such as by satisfying a convergence criterion, the system may be deployed to a production system. The production system may be the same or different computer system than the training system (300).) 
re-receiving the training data, re-performing the second sub-method, generating the second loss function, and adjusting the one or more parameters, (Rimchala − [0033] The loss function (326) is used to compute updates to the model parameters by backpropagation through the entire encoder-decoder model. [0034] In one or more embodiments, the loss function (326) further includes functionality to calculate the loss (e.g., the respective model parameter updates) using a difference (353) calculated by a value-in-OCR comparator (351). )
wherein: upon convergence the first trained projection network model is transformed into the second trained projection network model and the large language model is transformed into the trained large language model; (Rimchala − [0033] The comparison result indicates whether a match or no match exists between particular entities and entity values in the raw and ground truth information. Using the comparison result, the loss function (326) is configured to generate a parameter updates. The loss function (326) is used to compute updates to the model parameters by backpropagation through the entire encoder-decoder model.)
and the trained large language model is adapted for extraction of the key-value pair. (Rimchala − Fig. 7 [0060-0061]The extracted content (714) is structure formatted text containing key-value pairs that relates each entity label with an entity identifier. As shown in the example, the output of the decoder model is entity label “Tax Year” related to entity value “2016”, entity label “Wages Amount” related to entity value “9168.26”, entity label “Employee SSN” related to entity value “777-66-9999”, and entity label “Employer EIN” related to entity value “32-1726411”.)
Rimchala does not explicitly teach: combining, into a prompt, the projected text vector, a system message, and a task instruction;
However, ACHIWA teaches: combining, into a prompt, the projected text vector, a system message, (ACHIWA − [0109] In S802, the CPU 261 obtains an instruction message template from the storage 265. The instruction message template, which has been prepared in advance, may be a template prepared as a preset template by the engineer or the user or such a preset template to which a correction or an addition has been made by the system or the user. Examiner NOTE: instruction message template can be system message prepared in advance)
and a task instruction; (ACHIWA − [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image.)
and the output is generated by the large language model which takes, as input, the prompt; (ACHIWA − [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image. A reply 1411 to the instruction message 1410 from the large language model 116 indicates that the large language model 116 has returned “June 2, 2023” as the character string of the item “date” and “¥13,000” as the character string of the item “total amount”.)
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala and ACHIWA as each inventions relates to character recognition of documents. Adding the teaching of ACHIWA provides Rimchala with instruction messages (prompts) for generating replies to inputs. One of ordinary skill in the art would have been motivated for correcting and updating extraction rules through prompts for misrecognized characters. Therefore, improving character recognition of document images.
Regarding dependent claim 13, depends on claim 12, Rimchala teaches: receiving, after the second training operation, a new digital image; (Rimchala − [0021, 0023] The financial document may be any document that includes financial information. Some examples of the financial document include a receipt, a tax document (e.g., W2), a bank statement, a balance sheet, a cash flow statement, a profit and loss statement, etc. [0045] FIG. 5, like reference numbers are used for the same elements as shown in FIG. 3. Continual training may optionally be performed resulting in additional components which are not shown in FIG. 5.) Continual training is a second training operation.
performing the second sub-method a third time on the new digital image; and presenting the key-value pair extracted from the new digital image. (Rimchala − [0021, 0023] The financial document may be any document that includes financial information. Some examples of the financial document include a receipt, a tax document (e.g., W2), a bank statement, a balance sheet, a cash flow statement, a profit and loss statement, etc.  Fig. 7 [0060-0061]The extracted content (714) is structure formatted text containing key-value pairs that relates each entity label with an entity identifier. As shown in the example, the output of the decoder model is entity label “Tax Year” related to entity value “2016”, entity label “Wages Amount” related to entity value “9168.26”, entity label “Employee SSN” related to entity value “777-66-9999”, and entity label “Employer EIN” related to entity value “32-1726411”.)
Regarding independent claim 14, Rimchala teaches: A system comprising: (Rimchala − [0003] In general, in one aspect, one or more embodiments relate to a system that includes an encoder machine learning model, executing on at least one computer processor,)
a computer processor; (Rimchala − [0003] In general, in one aspect, one or more embodiments relate to a system that includes an encoder machine learning model, executing on at least one computer processor,)
a data repository in communication with the computer processor and storing: (Rimchala − [0027] As shown in FIG. 3, the training document repository (301) is a data repository for storing training data.)
a digital image comprising text arranged in a layout within the digital image, (Rimchala − [0021, 0023] The financial document may be any document that includes financial information. Some examples of the financial document include a receipt, a tax document (e.g., W2), a bank statement, a balance sheet, a cash flow statement, a profit and loss statement, etc.) Examiner Note: financial documents are a layout within a digital image.
a layout text vector that encodes at least one word in the text of the digital image and also encodes a position of the at least one word in the layout of the digital (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding.) image, a visual representation vector embedding a content of the digital image, 
a projected text vector, wherein the projected text vector comprises a digital format suitable for input to a large language model, (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding. [0037] a BERT (Bidirectional Encoder Representations from Transformers) encoder model may be combined with a BERT for Language Modeling (BERTLM) decoder model. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”. [0061] The OCR output may be converted to a vector representation (708), which is the OCR text divided into tokens.) Examiner Note: BERT (Bidirectional Encoder Representation from Transformers) is considered a large language model (LLM)).)
and an output comprising a key-value pair, wherein a key of the key-value pair represents a type of the text and a value of the key-value pair represents a value of the type; (Rimchala − Fig. 7 [0060-0061]The extracted content (714) is structure formatted text containing key-value pairs that relates each entity label with an entity identifier. As shown in the example, the output of the decoder model is entity label “Tax Year” related to entity value “2016”, entity label “Wages Amount” related to entity value “9168.26”, entity label “Employee SSN” related to entity value “777-66-9999”, and entity label “Employer EIN” related to entity value “32-1726411”.)
an optical character recognition model which, when executed by the computer processor, is programmed to generate the layout text vector; (Rimchala − [0028] the document image (306) may be transmitted to an Optical Character Recognition (OCR) engine (308). The OCR engine is a visual based model that is trained to detect text from the image formatted document (e.g., electronic/scanned copy or a hard copy format or mobile captured images) and generate output including machine encoded text.)
a visual encoder model which, when executed by the computer processor, is programmed to generate visual representation vector; (Rimchala − [0031] the document image gets converted to an image embedding vectors using visual feature extractors specific to the choice of encoder models. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”.) [0047] A structural representation is a data structure in which each entity is separated in the data structure with a separated entity value and entity label. For example, rather than a single string of entity values, entity labels and corresponding delimiters, the structural representation has a data structure for the document and is composed of individual separate data structures for entities. An example of the structural representation is a JavaScript Object Notation (JSON) document or eXtensible Markup Language (XML) document) Examiner NOTE: a separate encoder model is a visual encoder model, used for generating a structure representation of the document.
a projection network model which, when executed by the computer processor, is programmed to generate the projected text vector; (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding. [0037] a BERT (Bidirectional Encoder Representations from Transformers) encoder model may be combined with a BERT for Language Modeling (BERTLM) decoder model. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”. [0061] The OCR output may be converted to a vector representation (708), which is the OCR text divided into tokens.) Examiner Note: BERT (Bidirectional Encoder Representation from Transformers) is considered a large language model (LLM)).)
Rimchala does not explicitly teach: a prompt, a system message, a task instruction;
However, ACHIWA teaches: a prompt, a system message, (ACHIWA − [0046] The large language model 116 is a model called LLM (Large Language Model) capable of generating sentences in an interactive manner, and generates replies to input instruction messages (prompts). [0109] In S802, the CPU 261 obtains an instruction message template from the storage 265. The instruction message template, which has been prepared in advance, may be a template prepared as a preset template by the engineer or the user or such a preset template to which a correction or an addition has been made by the system or the user. Examiner NOTE: instruction message template can be system message prepared in advance)
a task instruction, (ACHIWA − [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image.)
a prompt generator which, when executed by the computer processor, is programmed to generate the prompt by combining the projected text vector, the system message, and the task instruction; (ACHIWA − [0046] The large language model 116 is a model called LLM (Large Language Model) capable of generating sentences in an interactive manner, and generates replies to input instruction messages (prompts). [0109] In S802, the CPU 261 obtains an instruction message template from the storage 265. The instruction message template, which has been prepared in advance, may be a template prepared as a preset template by the engineer or the user or such a preset template to which a correction or an addition has been made by the system or the user. [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image.)
and the large language model which, when executed by the computer processor, is programmed to generate the output comprising the key-value pair. (ACHIWA − [0046] The external information processing server 105 is an apparatus that utilizes a large language model 116. The large language model 116 is a model called LLM (Large Language Model) capable of generating sentences in an interactive manner, and generates replies to input instruction messages (prompts). For example, ChatGPT (registered trademark), [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] A reply 1411 to the instruction message 1410 from the large language model 116 indicates that the large language model 116 has returned “June 2, 2023” as the character string of the item “date” and “¥13,000” as the character string of the item “total amount”.) return instruction message is key information entities in document.
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala and ACHIWA as each inventions relates to character recognition of documents. Adding the teaching of ACHIWA provides Rimchala with instruction messages (prompts) for generating replies to inputs. One of ordinary skill in the art would have been motivated for correcting and updating extraction rules through prompts for misrecognized characters. Therefore, improving character recognition of document images.
Regarding dependent claim 15, depends on claim 14, Rimchala teaches: further comprising: a training controller which, when executed by the computer processor, is programmed to train only the projection network model in a first training stage to generate a trained projection network model. (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”.))
Regarding dependent claim 16, depends on claim 15, Rimchala teaches: wherein the training controller is further programmed to train both the trained projection network model and the large language model in a second training stage. (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding. [0037] a BERT (Bidirectional Encoder Representations from Transformers) encoder model may be combined with a BERT for Language Modeling (BERTLM) decoder model. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”. [0061] The OCR output may be converted to a vector representation (708), which is the OCR text divided into tokens.) Examiner Note: BERT (Bidirectional Encoder Representation from Transformers) is considered a large language model (LLM)).
Regarding dependent claim 17, depends on claim 14, Rimchala teaches: further comprising: a training controller which, when executed by the computer processor, is programmed to: receive training data comprising a reference output comprising a reference digital image; (Rimchala − [0021,0023] obtained by the system is a document image, the document (100) is a financial document. [0027] Training documents (302) are documents that are each associated with corresponding ground truth information (304).)
perform a first sub-method comprising: generating, by the visual encoder model, the visual representation vector; (Rimchala − [0031] the document image gets converted to an image embedding vectors using visual feature extractors specific to the choice of encoder models. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”.) [0047] A structural representation is a data structure in which each entity is separated in the data structure with a separated entity value and entity label. For example, rather than a single string of entity values, entity labels and corresponding delimiters, the structural representation has a data structure for the document and is composed of individual separate data structures for entities. An example of the structural representation is a JavaScript Object Notation (JSON) document or eXtensible Markup Language (XML) document) Examiner NOTE: a separate encoder model is a visual encoder model, used for generating a structure representation of the document.
converting the visual representation vector into the projected text vector; [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”.)
generate a loss function by comparing the output to the reference output; (Rimchala − [0033] The raw text (321) may be compared with a similar raw text version of the ground truth information by a loss function (326) to obtain a comparison result.)
adjust, based on the loss function, at least one parameter of the projection network model; and(Rimchala − [0033] The comparison result indicates whether a match or no match exists between particular entities and entity values in the raw and ground truth information. Using the comparison result, the loss function (326) is configured to generate a parameter updates. The loss function (326) is used to compute updates to the model parameters by backpropagation through the entire encoder-decoder model.)
training a first trained projection network model by iterating, until convergence, (Rimchala − [0033] [0044] When the models are deemed trained, such as by satisfying a convergence criterion, the system may be deployed to a production system. The production system may be the same or different computer system than the training system (300).)
receiving the training data, performing the sub- method, generating the loss function, adjusting the at least one parameter, (Rimchala − [0033] The loss function (326) is used to compute updates to the model parameters by backpropagation through the entire encoder-decoder model. [0034] In one or more embodiments, the loss function (326) further includes functionality to calculate the loss (e.g., the respective model parameter updates) using a difference (353) calculated by a value-in-OCR comparator (351). )
wherein upon convergence the projection network model is transformed into the first trained projection network model. (Rimchala − [0033] [0044] When the models are deemed trained, such as by satisfying a convergence criterion, the system may be deployed to a production system. The production system may be the same or different computer system than the training system (300).)
Rimchala does not explicitly teach: combining, into the prompt, the projected text vector, a system message, and a task instruction;
However, ACHIWA teaches: combining, into the prompt, the projected text vector, a system message, (ACHIWA − [0109] In S802, the CPU 261 obtains an instruction message template from the storage 265. The instruction message template, which has been prepared in advance, may be a template prepared as a preset template by the engineer or the user or such a preset template to which a correction or an addition has been made by the system or the user. Examiner NOTE: instruction message template can be system message prepared in advance)
and a task instruction; (ACHIWA − [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image.)
and generating, using a large language model that takes the prompt as input, an output comprising a sequence of next tokens in an optical character recognition text determined for the text in the image; (ACHIWA – [0045] The display control unit 158 performs control for displaying the item values extracted by the document image analysis unit 154 and the reply to the instruction message obtained from the large language model 116 to the user. [0061] The information processing server 103 and execute an information processing program stored in the storage 265 to execute information processing such as character recognition (OCR) and information extraction. [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image. A reply 1411 to the instruction message 1410 from the large language model 116 indicates that the large language model 116 has returned “June 2, 2023” as the character string of the item “date” and “¥13,000” as the character string of the item “total amount”.)
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala and ACHIWA as each inventions relates to character recognition of documents. Adding the teaching of ACHIWA provides Rimchala with instruction messages (prompts) for generating replies to inputs. One of ordinary skill in the art would have been motivated for correcting and updating extraction rules through prompts for misrecognized characters. Therefore, improving character recognition of document images.
Regarding dependent claim 18, depends on claim 17, Rimchala teaches: wherein receiving, performing the sub-method, generating the loss function, 
adjusting, and training comprise a first training operation, (Rimchala − [0045] FIG. 5, like reference numbers are used for the same elements as shown in FIG. 3. Continual training may optionally be performed resulting in additional components which are not shown in FIG. 5.) Continual training is a second training operation.
and wherein the training controller is further programmed to perform a second training operation comprising: Rimchala − [0045] FIG. 5, like reference numbers are used for the same elements as shown in FIG. 3. Continual training may optionally be performed resulting in additional components which are not shown in FIG. 5.) Continual training is a second training operation.
re-receiving the training data comprising: a reference digital image comprising reference text; (Rimchala − [0021,0023] obtained by the system is a document image, the document (100) is a financial document. [0027] Training documents (302) are documents that are each associated with corresponding ground truth information (304).)
 wherein a reference key of the reference key-value pair represents a reference type of the text and a reference value of the reference key-value pair represents a reference value of the reference type; (Rimchala − Fig. 7 [0060-0061]The extracted content (714) is structure formatted text containing key-value pairs that relates each entity label with an entity identifier. As shown in the example, the output of the decoder model is entity label “Tax Year” related to entity value “2016”, entity label “Wages Amount” related to entity value “9168.26”, entity label “Employee SSN” related to entity value “777-66-9999”, and entity label “Employer EIN” related to entity value “32-1726411”.)
perform a second sub-method comprising: generating, by an optical character recognition model, (Rimchala − [0028] the document image (306) may be transmitted to an Optical Character Recognition (OCR) engine (308). The OCR engine is a visual based model that is trained to detect text from the image formatted document (e.g., electronic/scanned copy or a hard copy format or mobile captured images) and generate output including machine encoded text.)
a layout text vector that encodes at least one word in the reference text of the reference digital image and also encodes a position of the at least one word in the layout of the reference digital image; (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding.) image, a visual representation vector embedding a content of the digital image,
generating, by a visual encoder model, a visual representation vector embedding a content of the reference digital image; (Rimchala − [0031] the document image gets converted to an image embedding vectors using visual feature extractors specific to the choice of encoder models. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”.) [0047] A structural representation is a data structure in which each entity is separated in the data structure with a separated entity value and entity label. For example, rather than a single string of entity values, entity labels and corresponding delimiters, the structural representation has a data structure for the document and is composed of individual separate data structures for entities. An example of the structural representation is a JavaScript Object Notation (JSON) document or eXtensible Markup Language (XML) document) Examiner NOTE: a separate encoder model is a visual encoder model, used for generating a structure representation of the document.
converting, using a projection network model, both the layout text vector and the visual representation vector into a projected text vector, wherein the projected text vector comprises a digital format suitable for input to a large language model; (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding. [0037] a BERT (Bidirectional Encoder Representations from Transformers) encoder model may be combined with a BERT for Language Modeling (BERTLM) decoder model. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”. [0061] The OCR output may be converted to a vector representation (708), which is the OCR text divided into tokens.) Examiner Note: BERT (Bidirectional Encoder Representation from Transformers) is considered a large language model (LLM)).)
generating an output comprising a key-value pair, wherein: a key of the key-value pair represents a type of the text and a value of the key-value pair represents a value of the type, (Rimchala − Fig. 7 [0060-0061]The extracted content (714) is structure formatted text containing key-value pairs that relates each entity label with an entity identifier. As shown in the example, the output of the decoder model is entity label “Tax Year” related to entity value “2016”, entity label “Wages Amount” related to entity value “9168.26”, entity label “Employee SSN” related to entity value “777-66-9999”, and entity label “Employer EIN” related to entity value “32-1726411”.)
and generating a second loss function by comparing the second output to the reference key-value pair of the reference output; (Rimchala − [0033] The raw text (321) may be compared with a similar raw text version of the ground truth information by a loss function (326) to obtain a comparison result.) repeating a second iteration generate a second loss function during training.
adjusting, based on the second loss function, both the first trained projection network model and the large language model; ; (Rimchala − [0033] The comparison result indicates whether a match or no match exists between particular entities and entity values in the raw and ground truth information. Using the comparison result, the loss function (326) is configured to generate a parameter updates. The loss function (326) is used to compute updates to the model parameters by backpropagation through the entire encoder-decoder model.)
and training a second trained projection network model and a trained large language model by iterating, until convergence, (Rimchala − [0033] [0044] When the models are deemed trained, such as by satisfying a convergence criterion, the system may be deployed to a production system. The production system may be the same or different computer system than the training system (300).) 
the receiving, the performing, the generating, and the adjusting of the second training operation, (Rimchala − [0033] The loss function (326) is used to compute updates to the model parameters by backpropagation through the entire encoder-decoder model. [0034] In one or more embodiments, the loss function (326) further includes functionality to calculate the loss (e.g., the respective model parameter updates) using a difference (353) calculated by a value-in-OCR comparator (351). )
 wherein upon convergence the first trained projection network model is transformed into the second trained projection network model and the large language model is transformed into the trained large language model. (Rimchala − [0033] The comparison result indicates whether a match or no match exists between particular entities and entity values in the raw and ground truth information. Using the comparison result, the loss function (326) is configured to generate a parameter updates. The loss function (326) is used to compute updates to the model parameters by backpropagation through the entire encoder-decoder model.)
Rimchala does not explicitly teach: a reference prompt comprising a reference digital image, a reference system message, and a reference task instruction, and a reference output comprising a reference key-value pair, combining, into a prompt, the projected text vector, a system message, and a task instruction;
However, ACHIWA teaches: a reference prompt comprising a reference digital image, a reference system message, and a reference task instruction, and a reference output comprising a reference key-value pair, (ACHIWA − [0109] In S802, the CPU 261 obtains an instruction message template from the storage 265. The instruction message template, which has been prepared in advance, may be a template prepared as a preset template by the engineer or the user or such a preset template to which a correction or an addition has been made by the system or the user. [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image. Examiner NOTE: instruction message template can be system message prepared in advance)
combining, into a prompt, the projected text vector, a system message, (ACHIWA − [0109] In S802, the CPU 261 obtains an instruction message template from the storage 265. The instruction message template, which has been prepared in advance, may be a template prepared as a preset template by the engineer or the user or such a preset template to which a correction or an addition has been made by the system or the user. Examiner NOTE: instruction message template can be system message prepared in advance)
and a task instruction; (ACHIWA − [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image.)
and the output is generated by the large language model which takes, as input, the prompt; (ACHIWA − [0165-0168] message “output only the data and total amount from the following text in the JSON format”, as shown in Fig. 14B element 1410. Is a task instruction; [0168] For example, the instruction message 1410 in FIG. 14B includes an instruction to answer the item values corresponding to the items determined to have been unextracted or erroneously extracted from among the group of character strings recognized from the document image. A reply 1411 to the instruction message 1410 from the large language model 116 indicates that the large language model 116 has returned “June 2, 2023” as the character string of the item “date” and “¥13,000” as the character string of the item “total amount”.)
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala and ACHIWA as each inventions relates to character recognition of documents. Adding the teaching of ACHIWA provides Rimchala with instruction messages (prompts) for generating replies to inputs. One of ordinary skill in the art would have been motivated for correcting and updating extraction rules through prompts for misrecognized characters. Therefore, improving character recognition of document images.
Regarding dependent claim 19, depends on claim 14, Rimchala teaches: wherein the visual representation vector comprises a hidden representation vector output by a plurality of inner layers of the visual encoder model. (Rimchala − [0031] The encoder model (316) generates an encoded hidden state vector (320) as output. In one or more embodiments, the raw OCR text is first converted to a sequence word or subword tokens by a tokenizer specific to the encoder model, each token with corresponding token embedding vector and one dimensional (1D) positional encoding and two dimensional (2D) layout positional embedding. [0038] The ViT encoder uses an attention mechanism among pixels in each image patch to generate the encoder hidden state vector representing image “token”.)

Claim(s) 3 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Rimchala and ACHIWA as applied to claims 2 and 19 above, and further in view of Baker (US PGPUB: 20120163707 A1, Filed Date: 28, 2010).
Regarding dependent claim 3, depends on claim 2, Rimchala does not explicitly teach: wherein the visual representation vector excludes a caption text for the digital image.
However, BAKER teaches: wherein the visual representation vector excludes a caption text for the digital image. (BAKER − [0070] Some embodiments may use various filters or heuristics to eliminate positive examples that may be noise. For example, a caption or replacement text for an image that is not descriptive may be removed as a positive example.)
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala, ACHIWA and BAKER as each inventions relates to character recognition of documents. One of ordinary skill in the art would have been motivated for correcting and updating misrecognized characters. Therefore, improving character recognition of document images.
Regarding dependent claim 20, depends on claim 19, Rimchala does not explicitly teach: wherein the visual representation vector excludes a caption text for the digital image.
However, BAKER teaches: wherein the visual representation vector excludes a caption text for the digital image. (BAKER − [0070] Some embodiments may use various filters or heuristics to eliminate positive examples that may be noise. For example, a caption or replacement text for an image that is not descriptive may be removed as a positive example.)
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala, ACHIWA and BAKER as each inventions relates to character recognition of documents. One of ordinary skill in the art would have been motivated for correcting and updating misrecognized characters. Therefore, improving character recognition of document images.

Claim(s) 9-10 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Rimchala and ACHIWA as applied to claims 8 and 11 above, and further in view of Lester (US PGPUB: 20230325725 A1, Filed Date: Apr. 12, 2022).
Regarding dependent claim 9, depends on claim 8, Rimchala teaches: a the visual encoder model but does not explicitly teach: wherein the visual encoder model and the large language model are frozen.
However, Lester teaches: wherein the visual encoder model and the large language model are frozen such that only the one or more parameters of the projection network model are trained. (Lester − [0030] The plurality of pre-trained parameters for the pre-trained machine-learned model can be fixed during prompt tuning (e.g., the pre-trained machine-learned model can be frozen such that the parameters are not adjusted during training of the prompt parameters) [0034] prompt tuning can involve inputting parameters with the input data into the frozen model such that only those parameters are updated.)
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala, ACHIWA and Lester as each inventions relates to character recognition of documents. One of ordinary skill in the art would have been motivated for correcting training data based on specific task and reduce the computation resources used during training.
Regarding dependent claim 10, depends on claim 8, Rimchala teaches: a the visual encoder model but does not explicitly teach: wherein the visual encoder model is frozen, and wherein adjusting further comprises adjusting both the projection network model and the large language model.
However, Lester teaches: wherein the visual encoder model is frozen, and wherein adjusting further comprises adjusting both the projection network model and the large language model. (Lester − [0030] The plurality of pre-trained parameters for the pre-trained machine-learned model can be fixed during prompt tuning (e.g., the pre-trained machine-learned model can be frozen such that the parameters are not adjusted during training of the prompt parameters) [0034] prompt tuning can involve inputting parameters with the input data into the frozen model such that only those parameters are updated. [0113] prompt parameter training can involve a plurality of iterations of output generation and comparison. During such training, the parameters of the pre-trained machine-learned model 906 can remain unadjusted, or “frozen.” Adjust parameters of one model while other model 906 is frozen.) 
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala, ACHIWA and Lester as each inventions relates to character recognition of documents. One of ordinary skill in the art would have been motivated for correcting training data based on specific task and reduce the computation resources used during training.
Regarding dependent claim 12, depends on claim 11, Rimchala teaches: a the visual encoder model but does not explicitly teach: wherein: the visual encoder model and the large language model are frozen during the first training operation such that the projection network model is trained during the first training operation, and the visual encoder model is frozen during the second training operation such that both the first trained projection network model and the large language model are trained during the second training operation.
However, Lester teaches: wherein: the visual encoder model and the large language model are frozen during the first training operation such that the projection network model is trained during the first training operation, and the visual encoder model is frozen during the second training operation such that both the first trained projection network model and the large language model are trained during the second training operation. (Lester − [0030] The plurality of pre-trained parameters for the pre-trained machine-learned model can be fixed during prompt tuning (e.g., the pre-trained machine-learned model can be frozen such that the parameters are not adjusted during training of the prompt parameters) [0034] prompt tuning can involve inputting parameters with the input data into the frozen model such that only those parameters are updated. [0101] As depicted, model tuning 202 can include retraining a machine-learned model for each task.)
Accordingly, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the teaching of Rimchala, ACHIWA and Lester as each inventions relates to character recognition of documents. One of ordinary skill in the art would have been motivated for correcting training data based on specific task and reduce the computation resources used during training.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
DENK, US 20210150202 A1, analyzing contextual information for document processing.
TORRES, US 20210209513 A1, Providing NLP models for document processing.
ROSSI, US 20220350968 A1, Generating extraction models for document processing.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CARL E BARNES JR whose telephone number is (571)270-3395. The examiner can normally be reached Monday-Friday 9am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Hong can be reached at (571) 272-4124. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/CARL E BARNES JR/Examiner, Art Unit 2178                                                                                                                                                                                                        
/STEPHEN S HONG/Supervisory Patent Examiner, Art Unit 2178
Read full office action
Prosecution Timeline

Oct 25, 2023
Application Filed
Jan 13, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/898,903
Patent 12639806
MEDICAL SYSTEM, INFORMATION PROCESSING METHOD, AND COMPUTER-READABLE MEDIUM
3y 9m to grant Granted May 26, 2026
17/289,673
Patent 12614280
SYSTEM FOR ESTIMATING PRIMARY OPEN-ANGLE GLAUCOMA LIKELIHOOD
5y 0m to grant Granted Apr 28, 2026
17/953,132
Patent 12584932
SLIDE IMAGING APPARATUS AND A METHOD FOR IMAGING A SLIDE
3y 6m to grant Granted Mar 24, 2026
16/871,512
Patent 12541640
COMPUTING DEVICE FOR MULTIPLE CELL LINKING
5y 8m to grant Granted Feb 03, 2026
16/262,443
Patent 12536464
SYSTEM FOR CONSTRUCTING EFFECTIVE MACHINE-LEARNING PIPELINES WITH OPTIMIZED OUTCOMES
6y 12m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
32%
Grant Probability
56%
With Interview (+24.2%)
3y 10m (~1y 3m remaining)
Median Time to Grant
Low
PTA Risk
Based on 205 resolved cases by this examiner. Grant probability derived from career allowance rate.