Last updated: April 19, 2026
Application No. 18/766,599
CONTINUAL TEXT RECOGNITION USING PROMPT-GUIDED KNOWLEDGE DISTILLATION

Non-Final OA §DP
Filed
Jul 08, 2024
Examiner
ABDI, AMARA
Art Unit
2668
Tech Center
2600 — Communications
Assignee
Exlservice Holdings Inc.
OA Round
1 (Non-Final)
Interview Optional

— -7.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 816 resolved cases, 2023–2026
Examiner Intelligence

ABDI, AMARA View full profile →
Grants 83% — above average
Career Allow Rate
677 granted / 816 resolved
+21.0% vs TC avg
Minimal -8% lift
Without
With
+-7.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
33 currently pending
Career history
849
Total Applications
across all art units
Statute-Specific Performance

§101
9.8%
-30.2% vs TC avg
§103
60.7%
+20.7% vs TC avg
§102
10.2%
-29.8% vs TC avg
§112
10.0%
-30.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 816 resolved cases
Office Action

§DP
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.



Claims 1-17 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-6, 8-14, and 16-19 of U.S. Patent No. 12,033,408, (see Table below). Although the claims at issue are not identical, they are not patentably distinct from each other because:
-- Claims 1, 8, and 15 of the instant Application, recite common subject matter with the patent claims 1, 9 and 17; 
-- Whereby claims 1, 8, and 15 of the instant application, which recite the open-ended transitional phrase “comprising”, do not preclude the additional elements recited 
   by patent claims 1, 9 and 17; and 
-- Whereby the elements of claims 1, 8, and 15 of the instant Application are fully anticipated by patent claim 1, 9 and 17.

Instant Application
comparison
US-Patent 12,033,408
1. At least one non-transitory, computer-readable storage medium comprising instructions recorded thereon, the instructions, when executed by at least one processor of a text recognition system, causing the text recognition system to: using an image file that includes a visual representation of alphanumeric characters, cause a trained region encoder to determine a region of interest in the image file; generate a data augmentation entity that comprises a modified image associated with an image extracted from the region of interest; using a trained instance encoder, generate a first set of visual instances corresponding to the image and a second set of visual instances corresponding to the data augmentation entity; generate a first sequence associated with the first set of visual instances and a second sequence associated with the second set of visual instances; and based on a comparison of the first sequence and the second sequence, perform operations comprising: generate additional training data for the trained region encoder; and cause an instance decoder to generate an indication of recognized alphanumeric data that corresponds to the region of interest.

1.  At least one non-transitory, computer-readable storage medium comprising instructions recorded thereon, the instructions, when executed by at least one processor of a text recognition system, causing the text recognition system to: obtaincomprising a visual representation of alphanumeric characters; receive a prompt relating to the image file,wherein the prompt is associated with a query regarding a region of the image file; using the prompt and the image file, cause a trained region encoder to determine a first region of interest in the image file,wherein the trained region encoder includes an ten-based continual knowledge distillation model; modify a first image associated with the first region of interest of the image file to genera a data augmentation entity,whein the data augmentation entity comprises a modified image; using a trained instance encoder, generate a first set of visual instances corresponding to the first image associated with the first region of interest and a second set of visual instances corresponding to the data augmentation entity of the first region of interest,wherein the trained instance encoder is trained using self-supervised gradient recursion; generate a first ordered sequence associated with the first set of visual instances and a second ordered sequence associated with the second set of visual instances; and using an output of executing a self-supervised contrastive loss function on the first ordered sequence and the second ordered sequence, perform operations comprising:automatically further train the attention-based continual knowledge distillation model of the trained region encoder; and provide the first ordered sequence to an instance decoder to generate, for display on a user interface, an output item in respone to the prompt.
1. At least one non-transitory, computer-readable storage medium comprising instructions recorded thereon, the instructions, when executed by at least one processor of a text recognition system, causing the text recognition system to: obtain an image file comprising a visual representation of alphanumeric characters; receive a prompt relating to the image file, wherein the prompt is associated with a query regarding a region of the image file; using the prompt and the image file, cause a trained region encoder to determine a first region of interest in the image file, wherein the trained region encoder includes an attention-based continual knowledge distillation model; modify a first image associated with the first region of interest of the image file to generate a data augmentation entity, wherein the data augmentation entity comprises a modified image; using a trained instance encoder, generate a first set of visual instances corresponding to the first image associated with the first region of interest and a second set of visual instances corresponding to the data augmentation entity of the first region of interest, wherein the trained instance encoder is trained using self-supervised gradient recursion; generate a first ordered sequence associated with the first set of visual instances and a second ordered sequence associated with the second set of visual instances; and using an output of executing a self-supervised contrastive loss function on the first ordered sequence and the second ordered sequence, perform operations comprising: automatically further train the attention-based continual knowledge distillation model of the trained region encoder; and provide the first ordered sequence to an instance decoder to generate, for display on a user interface, an output item in response to the prompt.  

2. The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instructions to further train the trained region encoder cause the text recognition system to: provide a representation of a prompt and the image file to a first feature extractor to generate a first feature set associated with a teacher model; provide the representation of the prompt and the image file to a second feature extractor to generate a second feature set associated with a student model; based on providing the first feature set and the prompt to the teacher model, generate a set of region proposals; using the second feature set, the prompt, and the set of region proposals provided to the student model, generate a cross-entropy loss metric; and update the student model based on the cross-entropy loss metric to train the trained region encoder.

2.  The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instructions to further train the trained region encoder cause the text recognition system to: provide a representation of the prompt and the image file to a first feature extractor to generate a first feature set associated with a teacher model; provide the representation of the prompt and the image file to a second feature extractor to generate a second feature set associated with a student model;
2. The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instructions to further train the trained region encoder cause the text recognition system to: provide a representation of the prompt and the image file to a first feature extractor to generate a first feature set associated with a teacher model; provide the representation of the prompt and the image file to a second feature extractor to generate a second feature set associated with a student model; based on providing the first feature set and the prompt to the teacher model, generate a set of region proposals; using the second feature set, the prompt, and the set of region proposals provided to the student model, generate a cross-entropy loss metric; and update the student model based on the cross-entropy loss metric to train the trained region encoder.    
3. The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instructions for generating the region of interest cause the text recognition system to: generate a prompt vector representing the prompt in a vector format; provide the prompt vector to a global contextual attention engine to generate a set of attention indicators associated with elements of the prompt vector, the set of attention indicators comprising a set of attention weights; and determine the region of interest using the set of attention indicators and the prompt vector.

3.  The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instructions for generating the first regionindicators comprising a set of attention weights; and determine the first region
3. The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instructions for generating the first region cause the text recognition system to: generate a prompt vector representing the prompt in a vector format; provide the prompt vector to a global contextual attention engine to generate a set of attention indicators associated with elements of the prompt vector, the set of attention indicators comprising a set of attention weights; and determine the first region using the set of attention indicators and the prompt vector.  

4. The at least one non-transitory, computer-readable storage medium of claim 3, further comprising operations to, based on providing first image and an output to the global contextual attention engine, update the global contextual attention engine to generate a set of updated region determinations based on input prompts.

4.  The at least one non-transitory, computer-readable storage medium of claim 3, wherein the instructions for further training the attention-based continual knowledgedistillation model cause the text recognition system to, based on providing the first image and the output to the global contextual attention engine, update the global contextual attention engine to generate a set of updated region determinations based on input prompts.
4. The at least one non-transitory, computer-readable storage medium of claim 3, wherein the instructions for further training the attention-based continual knowledge distillation model cause the text recognition system to, based on providing the first image and the output to the global contextual attention engine, update the global contextual attention engine to generate a set of updated region determinations based on input prompts.  

5. The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instructions for generating the data augmentation entity cause the text recognition system to perform a first operation on the image to generate the modified image, wherein the first operation comprises at least one of: a rotation, a translation, a scaling, a noise addition, a color variation, a linear contrast operation, a shear operation, or a skew operation.
5.  The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instructions for generating the data augmentation entity cause the text recognition system to perform a first operation on the first image to generate the modified image, wherein the first operation comprises at least one of: a rotation, a translation, a scaling, a noise addition, a color variation, a linear contrast operation, a shear operation, or a skew operation.
5. The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instructions for generating the data augmentation entity cause the text recognition system to perform a first operation on the first image to generate the modified image, wherein the first operation comprises at least one of: a rotation, a translation, a scaling, a noise addition, a color variation, a linear contrast operation, a shear operation, or a skew operation.   
6. The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instructions cause the text recognition system to: using gradient recursion on the first set of visual instances and the second set of visual instances of the trained instance encoder, automatically generate updated model parameters for the trained instance encoder; and using the updated model parameters, retrain the trained instance encoder to generate sets of instances.
6.  The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instructions cause the text recognition system to: using gradient recursion on the first set of visual instances and the second set of visual instances of the trained instance encoder, automatically generate updated model parameters for the trained instance encoder; and
6. The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instructions cause the text recognition system to: using gradient recursion on the first set of visual instances and the second set of visual instances of the trained instance encoder, automatically generate updated model parameters for the trained instance encoder; and using the updated model parameters, retrain the trained instance encoder to generate sets of instances. 
7. The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instance decoder comprises a transformer model, an attention decoder, or a connectionist temporal classification model.
8. The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instance decoder comprises a transformer model, an attention decoder, or a connectionist temporal classification model.
8. The at least one non-transitory, computer-readable storage medium of claim 1, wherein the instance decoder comprises a transformer model, an attention decoder, or a connectionist temporal classification model.  
8. A text recognition system comprising at least one processor and at least one non-transitory, computer-readable storage medium comprising instructions recorded thereon, the instructions, when executed by the at least one, causing the text recognition system to: using an image file that includes a visual representation of alphanumeric characters, cause a trained region encoder to determine a region of interest in the image file; generate a data augmentation entity that comprises a modified image associated with an image extracted from the region of interest; using a trained instance encoder, generate a first set of visual instances corresponding to the image and a second set of visual instances corresponding to the data augmentation entity; generate a first sequence associated with the first set of visual instances and a second sequence associated with the second set of visual instances; and based on a comparison of the first sequence and the second sequence, perform operations comprising: generate additional training data for the trained region encoder; and cause an instance decoder to generate an indication of recognized alphanumeric data that corresponds to the region of interest.

9. A text recognition system comprising: at least one hardware processor; and at least one non-transitoryemory storwhich, when executed by the at least one hardware processor, cause the obtaincomprising a visual representation of alphanumeric characters; receive a prompt relating to theimage file, wherethe prompt is associated with a query regarding a region of the image file; using the prompt and the image file, cause a trained region encoder to determine a first region of the image file, wherein the trained region encoder includes an attention-based continual knowledge distillation model; modify a first image extracted from the first region to generate a data augmentation entity, wherein the data augmentation entity comprises a modified image; using a trained instance encoder, generate a first set of visual instances corresponding to the first image and a second set of visual instances corresponding to the data augmentation entity, wherein the trained instance encoder is trained using self-supervised gradient recursion; generate a first ordered sequence associated with the first set of visual instances and a second ordered sequence associated with the second set of visual instances; using an output of executing a self-supervised contrastive loss function on the first ordered sequence and the second ordered sequence, perform operations comprising:automatically further train the attention-based continual knowledge distillation model of the trained region encoder; and provide the first ordered sequence to an instance decoder to generate, for display o user interface, an output item in respone to the prompt.
9. A text recognition system comprising: at least one hardware processor; and at least one non-transitory memory storing instructions, which, when executed by the at least one hardware processor, cause the system to: obtain an image file comprising a visual representation of alphanumeric characters; receive a prompt relating to the image file, wherein the prompt is associated with a query regarding a region of the image file; using the prompt and the image file, cause a trained region encoder to determine a first region of the image file, wherein the trained region encoder includes an attention-based continual knowledge distillation model; modify a first image extracted from the first region to generate a data augmentation entity, wherein the data augmentation entity comprises a modified image; using a trained instance encoder, generate a first set of visual instances corresponding to the first image and a second set of visual instances corresponding to the data augmentation entity, wherein the trained instance encoder is trained using self-supervised gradient recursion; generate a first ordered sequence associated with the first set of visual instances and a second ordered sequence associated with the second set of visual instances; using an output of executing a self-supervised contrastive loss function on the first ordered sequence and the second ordered sequence, perform operations comprising: automatically further train the attention-based continual knowledge distillation model of the trained region encoder; and provide the first ordered sequence to an instance decoder to generate, for display on a user interface, an output item in response to the prompt.  
9. The text recognition system of claim 8, wherein the instructions to further train the trained region encoder cause the text recognition system to: provide a representation of a prompt and the image file to a first feature extractor to generate a first feature set associated with a teacher model; provide the representation of the prompt and the image file to a second feature extractor to generate a second feature set associated with a student model; based on providing the first feature set and the prompt to the teacher model, generate a set of region proposals; using the second feature set, the prompt, and the set of region proposals provided to the student model, generate a cross-entropy loss metric; and update the student model based on the cross-entropy loss metric to train the trained region encoder.
10. The 9, wherein the instructions for further training the trained region encoder cause the the prompt and the image file to a first feature extractor to generate a first feature set associated with a teacher model; provide the representation of the prompt and the image file to a second feature extractor to generate a second feature set associated with a student model; retrain the trained region encoder.
10. The system of claim 9, wherein the instructions for further training the trained region encoder cause the system to: provide a representation of the prompt and the image file to a first feature extractor to generate a first feature set associated with a teacher model; provide the representation of the prompt and the image file to a second feature extractor to generate a second feature set associated with a student model; based on providing the first feature set and the prompt to the teacher model, generate a set of region proposals; using the second feature set, the prompt, and the set of region proposals provided to the student model, generate a cross-entropy loss metric; and update the student model based on the cross-entropy loss metric to retrain the trained region encoder.  
10. The text recognition system of claim 8, wherein the instructions for generating the region of interest cause the text recognition system to: generate a prompt vector representing the prompt in a vector format; provide the prompt vector to a global contextual attention engine to generate a set of attention indicators associated with elements of the prompt vector, the set of attention indicators comprising a set of attention weights; and determine the region of interest using the set of attention indicators and the prompt vector.

11. The 9, wherein the instructions for generating the first region first region
11. The system of claim 9, wherein the instructions for generating the first region cause the system to: generate a prompt vector representing the prompt in a vector format; provide the prompt vector to a global contextual attention engine to generate a set of attention indicators associated with elements of the prompt vector, the set of attention indicators comprising a set of attention weights; and determine the first region using the set of attention indicators and the prompt vector.  
 
11. The text recognition system of claim 10, the instructions further causing operations to, based on providing first image and an output to the global contextual attention engine, update the global contextual attention engine to generate a set of updated region determinations based on input prompts.

12. The 1, wherein the instructions for updating the attention- based continual knowledge distillation model cause the system to, based on providing the first image and the output item to the global contextual attention engine, update the global contextual attention engine to generate a set of updated region determinations based on input prompts.
12. The system of claim 11, wherein the instructions for updating the attention- based continual knowledge distillation model cause the system to, based on providing the first image and the output item to the global contextual attention engine, update the global contextual attention engine to generate a set of updated region determinations based on input prompts.  

12. The text recognition system of claim 8, wherein the instructions for generating the data augmentation entity cause the text recognition system to perform a first operation on the image to generate the modified image, wherein the first operation comprises at least one of: a rotation, a translation, a scaling, a noise addition, a color variation, a linear contrast operation, a shear operation, or a skew operation.

13. The 9, wherein the instructions for generating the data augmentation entity cause the  first image to generate the modified image, wherein the first operation comprises at least one of
13. The system of claim 9, wherein the instructions for generating the data augmentation entity cause the system to perform a first operation on the first image to generate the modified image, wherein the first operation comprises at least one of a rotation, a translation, a scaling, a noise addition, a color variation, a linear contrast operation, a shear operation, or a skew operation.  
13. The text recognition system of claim 8, wherein the instructions cause the text recognition system to: using gradient recursion on the first set of visual instances and the second set of visual instances of the trained instance encoder, automatically generate updated model parameters for the trained instance encoder; and using the updated model parameters, retrain the trained instance encoder to generate sets of instances.
14. The9, wherein the instructions cause the further retrain the trained instance encoder to generate sets of instances.
14. The system of claim 9, wherein the instructions cause the system to: using gradient recursion on the first set of visual instances and the second set of visual instances of the trained instance encoder, automatically generate updated model parameters for the trained instance encoder; and using the updated model parameters, further retrain the trained instance encoder to generate sets of instances.  

14. The text recognition system of claim 8, wherein the instance decoder comprises a transformer model, an attention decoder, or a connectionist temporal classification model.
16. The 9, wherein the instance decoder comprises a transformer model, an attention decoder, or a connectionist temporal classification model.
16. The system of claim 9, wherein the instance decoder comprises a transformer model, an attention decoder, or a connectionist temporal classification model.  
15. A computer-implemented method, comprising: using an image file that includes a visual representation of alphanumeric characters, causing a trained region encoder of a text recognition system to determine a region of interest in the image file; generating a data augmentation entity that comprises a modified image associated with an image extracted from the region of interest; using a trained instance encoder, generating a first set of visual instances corresponding to the image and a second set of visual instances corresponding to the data augmentation entity; generating a first sequence associated with the first set of visual instances and a second sequence associated with the second set of visual instances; and based on a comparison of the first sequence and the second sequence, performing operations comprising: generating additional training data for the trained region encoder; and causing an instance decoder to generate an indication of recognized alphanumeric data that corresponds to the region of interest.

17. A method performed by a text recognition system, the methodobtaining an image file comprising a set of visual data; receiving a prompt relatig to the image file,wherein the prompt is associated with a query regarding a region of the image file; using the prompt and the image file, causing a trained reencoder to determine a first region of the image file; modifying a firstextracted from the first region to generate a data augmentation entity,wherein the data augmentation entity comprises a modified imagefirst image and a second set of visual instances corresponding to the data augmentation entity; generating a first ordered sequence associated with the first set of visual instances and a second ordered sequence associated with the second set of visual instances; using an output of executing a self-supervised contrastive loss function on the first ordered sequence and the second ordered sequence, performing operations comprising:automatically further training providing the first ordered sequence to an instance decoder to generate, for display o user interface, an output item in respone to the prompt.
17. A method performed by a text recognition system, the method comprising: obtaining an image file comprising a set of visual data; receiving a prompt relating to the image file, wherein the prompt is associated with a query regarding a region of the image file; using the prompt and the image file, causing a trained region encoder to determine a first region of the image file; modifying a first image extracted from the first region to generate a data augmentation entity, wherein the data augmentation entity comprises a modified image; using a trained instance encoder, generating a first set of visual instances corresponding to the first image and a second set of visual instances corresponding to the data augmentation entity; generating a first ordered sequence associated with the first set of visual instances and a second ordered sequence associated with the second set of visual instances; using an output of executing a self-supervised contrastive loss function on the first ordered sequence and the second ordered sequence, performing operations comprising: automatically further training the trained region encoder; and providing the first ordered sequence to an instance decoder to generate, for display on a user interface, an output item in response to the prompt.  
 
  
 
 
  
 
 
  
 
16. The method of claim 15, further comprising: providing a representation of a prompt and the image file to a first feature extractor to generate a first feature set associated with a teacher model; providing the representation of the prompt and the image file to a second feature extractor to generate a second feature set associated with a student model; based on providing the first feature set and the prompt to the teacher model, generating a set of region proposals; using the second feature set, the prompt, and the set of region proposals provided to the student model, generating a cross-entropy loss metric; and updating the student model based on the cross-entropy loss metric to train the trained region encoder.

18. The method of claim 17,  method comprising further training the trained region encoder by: providing a representation of the prompt and the image file to a first feature extractor to generate a first feature set associated with a teacher model;retrain the trained region encoder.
18. The method of claim 17, the method comprising further training the trained region encoder by: providing a representation of the prompt and the image file to a first feature extractor to generate a first feature set associated with a teacher model; providing the representation of the prompt and the image file to a second feature extractor to generate a second feature set associated with a student model; based on providing the first feature set and the prompt to the teacher model, generating a set of region proposals; using the second feature set, the prompt, and the set of region proposals provided to the student model, generating a cross-entropy loss metric; and updating the student model based on the cross-entropy loss metric to retrain the trained region encoder.  
17. The method of claim 15, further comprising: generating a prompt vector representing the prompt in a vector format; providing the prompt vector to a global contextual attention engine to generate a set of attention indicators associated with elements of the prompt vector, the set of attention indicators comprising a set of attention weights; and determining the region of interest using the set of attention indicators and the prompt vector.
19. The method of claim 17, comprising: generating a prompt vector representing the prompt in a vector format; providing the prompt vector to a global contextual attention engine to generate a set of attention indicators associated with elements of the prompt vector, the set of attention indicators comprising a set of attention weights; and determining the first region
19. The method of claim 17, comprising: generating a prompt vector representing the prompt in a vector format; providing the prompt vector to a global contextual attention engine to generate a set of attention indicators associated with elements of the prompt vector, the set of attention indicators comprising a set of attention weights; and determining the first region using the set of attention indicators and the prompt vector.  


Allowable Subject Matter
Claims 1-20 would be allowable if amended or if a terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) is timely filed, to overcome the nonstatutory double patenting rejection, set forth in this Office action

The following is a statement of reasons for the indication of allowable subject matter:   
-- Claims 1, 8, and 15 are allowable over the prior art of record.
-- Claims 2-7 are allowable in view of their dependency from claim 1
-- Claims 9-14 are allowable in view of their dependency from claim 8
-- Claims 16-20 are allowable in view of their dependency from claim 15.


With respect to claim 1, the prior art of record, alone or in reasonable combination, does not teach or suggest, the following underlined limitation(s), (in consideration of the claim as a whole):
“using a trained instance encoder, generate a first set of visual instances corresponding to the image and a second set of visual instances corresponding to the data augmentation entity; generate a first sequence associated with the first set of visual instances and a second sequence associated with the second set of visual instances; and based on a comparison of the first sequence and the second sequence, perform operations comprising: generate additional training data for the trained region encoder; and cause an instance decoder to generate an indication of recognized alphanumeric data that corresponds to the region of interest.

A prior art of record, Gehrmann et al, (US-PGPUB 20210192126), discloses at least one non-transitory, computer-readable storage medium comprising instructions recorded thereon, the instructions, when executed by at least one processor of a text recognition system, (see at least: Par. 0120, “one or more instructions stored on a computer-readable storage medium”), causing the text recognition system to:
using an image file that includes a visual representation of alphanumeric characters, (see at least: Par. 0044-0045, displaying a digital document within an interactive graphical user interface, and the text summary system determines document tags corresponding to topics within the digital document, [i.e., using an image file, “digital document 102”, that includes a visual representation of alphanumeric characters, “digital document 102 implicitly includes text”]);
cause a trained region encoder to determine a region of interest in the image file, (see at least: Fig. 1, Par. 0027, the text summary system utilizes a machine-learning model, “trained region encoder”, and from Fig. 6, Par. 0092, the text summary system identifying a document segment in the digital document, [i.e., causing a trained region encoder, “machine-learning model”, to determine a region of interest in the image file, “identifying a document segment in the digital document”]);
generate a data augmentation entity that comprises a modified image associated with an image extracted from the region of interest, (see at least: Par. 0046, the text summary system determines which document segments relate to which document tags; and from Par. 0048, the text summary system automatically generates structured text summaries for each of the document segments that correspond to a document tag, [i.e., generating a data augmentation entity, “left image in block 110 in Fig. 1”, that comprises a modified image, “structured text summaries region image”, associated with an image extracted from the region of interest, “document segments image region”]).
However, Hu et al fails to teach or suggest, either alone or in combination with the other cited references, using a trained instance encoder, generate a first set of visual instances corresponding to the image and a second set of visual instances corresponding to the data augmentation entity; generate a first sequence associated with the first set of visual instances and a second sequence associated with the second set of visual instances; and based on a comparison of the first sequence and the second sequence, perform operations comprising: generate additional training data for the trained region encoder; and cause an instance decoder to generate an indication of recognized alphanumeric data that corresponds to the region of interest.
A further Bhotika et al, (US-PGPUB 20200160050) discloses using an image file that includes a visual representation of alphanumeric characters, (Par. 0017-0018, 0025, implicitly receiving electronic (or “digital”) documents comprising text; and from Par. 0026, The term “text” (or the like) may be used to refer to alphanumeric data, “visual representation of alphanumeric characters”);
cause a trained region encoder to determine a region of interest in the image file, (see at least: Par. 0017, document processing service can detect and comprehend various segments of different types within a document);
using a trained instance encoder, to generate a first set of visual instances corresponding to the image, (see at least: Par. 0080-0083, the text element encoder 408 may may comprise one or more units (e.g., layers of a neural network) that can learn and generate font embeddings 434 for the visual aspects of the text elements 426 (e.g., the visual representation of the text elements within the electronic document 310A, such as a portion of an image including the text element(s)), [i.e., using a trained instance encoder, “text element encoder 408”, to generate a first set of visual instances corresponding to the image, “visual representation of the text elements within the electronic document 310A”]);
generate a first sequence associated with the first set of visual instances, (see at least: Par. 0084, each of the per-pixel feature vectors for a text element may then be “grouped” by a grouping unit 314, which may create a single grouped feature vector 432 for a text element; and from Par. 0063, the text recognition/localization unit 314 can also identify the text itself within the electronic document 310, [i.e., generating a first sequence associated with the first set of visual instances, “implicit by creating a single grouped feature vector 432 for a text element”]);
cause an instance decoder to generate an indication of recognized alphanumeric data that corresponds to the region of interest, (see at least: Par. 0094, machine learning model(s) 406 operate upon this data for each text element to generate feature vectors 418A-418N, [i.e., feature vectors 418A-418N are implicitly an indication of recognized alphanumeric data that corresponds to the region of interest]).
However, Bhotika et al fails to teach or suggest, either alone or in combination with the other cited references, using a trained instance encoder, generate a first set of visual instances corresponding to the image and a second set of visual instances corresponding to the data augmentation entity; generate a first sequence associated with the first set of visual instances and a second sequence associated with the second set of visual instances; and based on a comparison of the first sequence and the second sequence, perform operations comprising: generate additional training data for the trained region encoder; and cause an instance decoder to generate an indication of recognized alphanumeric data that corresponds to the region of interest.

Another prior art of record, Hu et al, (US-PGPUB 20210303939), discloses at least one non-transitory, computer-readable storage medium comprising instructions recorded thereon, the instructions, when executed by at least one processor of a text recognition system, (see at least: Par. 0099, “computer-readable storage media 1606”), causing the text recognition system to: 
using an image file that includes a visual representation of alphanumeric characters, (see at least: Fig. 1, and Par. 0035, the input image 104 appears as part of an electronic document 110, includes intra-image text including alphanumeric information, and external text 112, which can generally appear as a title or caption associated with the input image 104, [i.e., implicitly using the electronic document 110 comprising a visual representation of alphanumeric characters]);
cause a trained region encoder to determine a region of interest in the image file, (see at least: Fig. 1, Par. 0041, convolutional neural network (CNN) maps the annotated image 124 to an output result that identifies one or more target regions. In the specific example of FIG. 1, the CNN 134 indicates that the candidate region 130 corresponds to a target region, [i.e., determining region of interest in the image file, “using CNN to identify one or more target regions]);
generate a data augmentation entity that comprises a modified image associated with an image extracted from the region of interest, (see at least: Fig. 10, and Par. 0082-0088, text encoder neural network transforms external text into encoded context information, by using the transformed counterpart of the [CLS] token (in the final output layer) as the encoded context information. In the special case in which the electronic document 110 contains no external text 112, the text encoder 140 can provide default context information that conveys that fact, [i.e., generating a data augmentation entity, “encoded context information”, that comprises a modified image associated with an image extracted from the region of interest, “implicit by transforming the output results]).

However, Hu et al fails to teach or suggest, either alone or in combination with the other cited references, using a trained instance encoder, generate a first set of visual instances corresponding to the image and a second set of visual instances corresponding to the data augmentation entity; generate a first sequence associated with the first set of visual instances and a second sequence associated with the second set of visual instances; and based on a comparison of the first sequence and the second sequence, perform operations comprising: generate additional training data for the trained region encoder; and cause an instance decoder to generate an indication of recognized alphanumeric data that corresponds to the region of interest.

The prior art of record, Salacinski et al, (US-PGPUB 20230343126) discloses using an image file that includes a visual representation of alphanumeric characters, (see at least: Fig. 2, implicitly receiving PDF image 204); cause a trained region encoder to determine a region of interest in the image file, (Par. 0078, page layout analysis model 208 may use a faster R-CNN using image segmentation, which detects and localizes the ROI in image 204); and generating an indication of recognized alphanumeric data that corresponds to the region of interest, (Par. 0079, read PDF image model 210 may detect and extract the text in image 204); but fails to teach or suggest, either alone or in combination with the other cited references, the above limitations (as combined with the other claimed limitations).


The prior art of record, Berestovsky et a, (US-PGPUB 20230065915), (from IDS), discloses the algorithm begins with the Detect computation 303 for predicting the bounding box that encompasses the words that collectively define the table., (i.e., detecting region of interest or table region of the document); and a Line-Item Extractor 315, which uses the OCR 311 enclosed by the predicted table region, (i.e., detecting the alphanumeric characters within the region of interest); but fails to teach or suggest, either alone or in combination with the other cited references, the above limitations (as combined with the other claimed limitations).

The prior art of record, Becker et al, (US-PGPUB 20230121351), (from IDS),  discloses processor 115 can detect the tables, using bounding box coordinates for each table found in each page of the digitized document, (i.e., implicitly detecting region of interest in the digital document), and then extracting data by using machine learning and/or heuristics, and/or pattern matching, etc., to identify relevant table elements and their respective relationships, e.g., relationships between headers and cell values, (i.e., detecting alphanumeric data within the region of interest); but fails to teach or suggest, either alone or in combination with the other cited references, the above limitations (as combined with the other claimed limitations).

Regarding claim 8, claim 8 recites substantially similar limitations as set forth in claim 1. As such, claim 8 is in condition for allowance, for at least similar reasons, as stated above.

Regarding claim 15, claim 15 recites substantially similar limitations as set forth in claim 1. As such, claim 15 is in condition for allowance, for at least similar reasons, as stated above.

Other prior art listed on the attached form PTO-892 show various aspects of the 
invention but none, either alone or in combination, teach or suggest all the claimed limitations

Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMARA ABDI whose telephone number is (571)272-0273. The examiner can normally be reached 9:00am-5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached at (571) 272-7332. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AMARA ABDI/Primary Examiner, Art Unit 2668                                                                                                                                                                                            04/02/2026
Read full office action
Prosecution Timeline

Jul 08, 2024
Application Filed
Apr 03, 2026
Non-Final Rejection — §DP (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/569,692
Patent 12602822
METHOD DEVICE AND STORAGE MEDIUM FOR BACK-END OPTIMIZATION OF SIMULTANEOUS LOCALIZATION AND MAPPING
2y 5m to grant Granted Apr 14, 2026
18/962,814
Patent 12597252
METHOD OF TRACKING OBJECTS
2y 5m to grant Granted Apr 07, 2026
18/288,713
Patent 12576595
SYSTEMS AND METHODS FOR IMPROVED VOLUMETRIC ADDITIVE MANUFACTURING
2y 5m to grant Granted Mar 17, 2026
18/222,744
Patent 12574469
VIDEO SURVEILLANCE SYSTEM, VIDEO PROCESSING APPARATUS, VIDEO PROCESSING METHOD, AND VIDEO PROCESSING PROGRAM
2y 5m to grant Granted Mar 10, 2026
18/222,360
Patent 12563154
VIDEO SURVEILLANCE SYSTEM, VIDEO PROCESSING APPARATUS, VIDEO PROCESSING METHOD, AND VIDEO PROCESSING PROGRAM
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
83%
Grant Probability
76%
With Interview (-7.5%)
2y 7m
Median Time to Grant
Low
PTA Risk
Based on 816 resolved cases by this examiner. Grant probability derived from career allow rate.