Last updated: July 17, 2026

Application No. 18/604,902

ONE-SHOT MULTIMODAL LEARNING FOR DOCUMENT IDENTIFICATION

Final Rejection §103

Filed

Mar 14, 2024

Priority

Mar 27, 2023 — provisional 63/454,830

Examiner

BEZUAYEHU, SOLOMON G

Art Unit

2674

Tech Center

2600 — Communications

Assignee

Iron Mountain Incorporated

OA Round

2 (Final)

Interview Optional

— +30.2% interview lift. Examiner has a relatively high allowance rate (75%); +30.2% interview lift. A written response may suffice.

Based on 627 resolved cases, 2023–2026

Examiner Intelligence

BEZUAYEHU, SOLOMON G View full profile →

Grants 75% — above average

Career Allowance Rate

473 granted / 627 resolved

+13.4% vs TC avg

Strong +30% interview lift

Without

With

+30.2%

Interview Lift

resolved cases with interview

Typical timeline

3y 3m

Avg Prosecution

42 currently pending

Career history

663

Total Applications

across all art units

Statute-Specific Performance

§101

4.1%

-35.9% vs TC avg

§103

86.9%

+46.9% vs TC avg

§102

2.6%

-37.4% vs TC avg

§112

1.8%

-38.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 627 resolved cases

Office Action

§103

DETAILED ACTION
Response to Arguments
Applicant's arguments filed with respect to claims 4/24/2026 have been fully considered but are moot in view of the new ground(s) of rejection. The rejections are necessitated due to claim amendments.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 4, and 10-12, 14-15, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Kletter et al. (Pub. No. US 2009/0324100) in view of Zhang et al. (Pub. No. US 2023/022285).
Regarding claim 1, Kletter teaches a computer-implemented method of document image processing, the method comprising: for each template document image of a plurality of template document images [Para. 43 “In this flow, target images 510 are processed sequentially, one at a time, to extract their visual fingerprint information based on keypoint identification”]:
 generating a corresponding digital fingerprint (fingerprint information) of a plurality of digital fingers by: determining a plurality of regions (connected components) within the template document image, wherein the plurality of regions comprises a plurality of text regions [Para. 40 “For each document in the collection of target images 310, keypoints are identified 320 and for each keypoint, fingerprint information is computed from local groups of keypoints by performing fingerprinting operations 330.”; Para. 44 “Finally, a duplicate removal module removes any duplicate connected components having nearly the same centroid location. The resulting word centroids locations are selected as candidate image keypoints”; and
 filtering (sorted by relative strength) the plurality of regions (connected components) to determine at least one region of interest (smaller subset of connected components) [Para. 54 “In a second embodiment, the available connected components are sorted by relative strength, for example, giving weight to optimum of the connected component dimensions, pixel count, aspect ratio, and/or proximity to other connected components, and only the smaller subset of connected components are outputted”], wherein the digital fingerprint (fingerprint information) is determined using the at least one region of interest (smaller subset of connected components) [Para. 54 “In a second embodiment, the available connected components are sorted by relative strength, for example, giving weight to optimum of the connected component dimensions, pixel count, aspect ratio, and/or proximity to other connected components, and only the smaller subset of connected components are outputted”; Para. 44 “The resulting word centroids locations are selected as candidate image keypoints.”; and Para. 40 “For each document in the collection of target images 310, keypoints are identified 320 and for each keypoint, fingerprint information is computed from local groups of keypoints by performing fingerprinting operations 330.”];  
Kletter teaches extracting/generating visual fingerprint information based on keypoint identification [Para. 43 “extract their visual fingerprint information based on keypoint identification”]. 
However, Kletter doesn’t explicitly teach generating augmented data that is based on information from the template document image. 
Zhang teaches generating augmented data (masked training block) that is based on information from the template document image [Para. 66 “For example, a method for pretraining can include obtaining a set of media training blocks from a set of training documents. In some implementations, the set of media training blocks can include images from the training documents. The method can include masking one or more images of a media training block to obtain a masked training block”]. 
 It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter’s target image pre-processing by applying Zhang’s image masking operation to each target image to generate augmented data from the target image content before model training. This modification improves Kletter identification reliability for degraded query images. 
Kletter teaches generating candidate fingerprint combinations [Para. 41].
However, Kletter doesn’t explicitly teach generating at least one training sample using the augmented data. 
Zhang teaches generating at least one training sample (Media block representations) using the augmented data (masked training block) [Para. 66].
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter’s document-fingerprint preprocessing system by applying Zhang’s masking operation to generate augmented data (masked training block) from each document image and processing the augmented data with the block-level encoding model to generate at least one training sample for model training. This modification improves Kletter by training on representations derived from intentionally masked document content, thereby increasing identification robustness when a query document is degraded, transformed, or contain missing visual features. 
Kletter teaches using the plurality of digital fingerprints [Para. 42 “At query time, FIG. 4 illustrates performing a real-time image query 400 for a particular query image 410, by identifying keypoint locations 420 in the particular query image 410 and computing fingerprint information 430 for each query keypoint from local groups of query keypoints, matching the query fingerprints 440 to the existing Fan Tree fingerprint data 480 to determine the best matching document or set of documents within the collection”]. 
However, Kletter doesn’t explicitly teach using the plurality of digital fingerprints and the at least one training sample, training a multimodal model.
Zhang teaches using the plurality of digital fingerprints and the at least one training sample (media block representation), training a multimodal model (multimodal transformer) [Para. 37 “The encoder model can encode each block with a multimodal transformer in the lower-level and aggregates the block-level representations and connections utilizing a specifically designed transformer at the higher-level.”; and Para. 66 “the media block representation for the masked training block can include a prediction output that can include a predicted similarity between the masked training block and each of a plurality of additional masked training blocks from the training batch.”], 
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter’s document fingerprint matching system by converting each digital fingerprint (fingerprint information) into a fingerprint-derived feature vector with Zhang’s training sample (media block repetition) as input to the multimodal model during training. The modification improves discrimination among visually similar documents and robustness to degraded queries. 
Kletter teaches for each template document image of the plurality of template document images to train the multimodal model to generate an identification prediction based on a query document image (query image) [Para. 43 “in this flow, target images 510 are processed sequentially, one at a time, to extract their visual fingerprint information based on keypoint identification.”; Para. 45 “The resulting fingerprints are stored in the Fingerprint Database 550 where each image will have a unique image ID 540 corresponding to the target images 510” and Para. 42 “Finally, a Fingerprint score analysis module 490 examines the resulting list of accumulated scores and counts of matching fingerprints 470 for each document to determine the best matching document or set of documents 495 within the collection that best matches the query image 410”].
However, Kletter doesn’t explicitly teach the rest of claim limitations. 
Zhang teaches wherein training the multimodal model comprises using the at least one training sample (media block representations) for each template document image of the plurality of template document images to train the multimodal model to generate an identification prediction (classification) based on a query document image [Para. 37 “the encoder model can encode each block with a multimodal transformer in the lower-level and aggregates the block-level representations and connections utilizing a specifically designed transformer at the higher-level.”; Para. 66 “the media block representation for the masked training block can include a prediction output that can include a predicted similarity between the masked training block and each of a plurality of additional masked training blocks from the training batch”; and Para. 47 “the model can be trained for document classification with ground truth data that may include a document and a pre-determined classification for the document”]. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter’s pre-target image identification workflow by generating a training sample (media block representation) from each target image, pairing it with that image’s unique image ID as the predetermined identification predication and using the resulting pairs to train Zhang’s multimodal model before applying the trained model to query document image. This modification improves query document identification under occlusion, distortion, or content variation. 
Regarding claims 4 and 17, Kletter teaches wherein, for each template document image of the plurality of template document images, filtering the plurality of regions comprises: applying a first filter to select, from among the plurality of regions, a first plurality of selected regions (smaller subset of connected components) ; and applying a second filter (remove duplicates module 680) to omit, from among the first plurality of selected regions, at least one selected region to obtain a second plurality of selected regions (list of key points 685) [Para. 54, 160, and 56].  
Regarding claims 10 and 19, Kletter doesn’t explicitly teach the claim limitation. 
 Zhang teaches wherein, for each template document image of the plurality of template document images, each text region of the plurality of text regions indicates: a text string detected within the text region, and a boundary of the text region within a corresponding template document image [Para. 7, and Para. 8 “The layout data can include spatial layout data descriptive of spatial positions of the plurality of blocks within the document”].  
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter to teach the claim limitation, feature as taught by Zhang; because the modification enables the system to improve automatic recognition and classification of document images by building richer multimodal fingerprints from selected text/image regions and using them to train a multimodal model so the system can accurately identify document types with very few template. 
Regarding claims 11 and 20 Kletter doesn’t explicitly teach the claim limitation. 
Zhang teaches wherein, for each template document image of the plurality of template document images, each text region of the plurality of text regions indicates: a text string detected within the text region, and a location and image patch of the text region within a corresponding template document image [Para. 63, and 66-67].  
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter to teach the claim limitation, feature as taught by Zhang; because the modification enables the system to improve automatic recognition and classification of document images by building richer multimodal fingerprints from selected text/image regions and using them to train a multimodal model so the system can accurately identify document types with very few templates. 
Regarding claim 12 Kletter doesn’t explicitly teach the claim limitation. 
Zhang teaches wherein, for each template document image of the plurality of template document images, the plurality of regions includes at least one image region, and each image region of the at least one image region indicates: a boundary of the image region within a corresponding template document image, and image content of the image region [fig. 2, 4, and related description]. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter to teach the claim limitation, feature as taught by Zhang; because the modification enables the system to improve automatic recognition and classification of document images by building richer multimodal fingerprints from selected text/image regions and using them to train a multimodal model so the system can accurately identify document types with very few templates. 
Regarding claim 14. Kletter doesn’t explicitly teach the claim limitation. 
Zhang teaches wherein the multimodal model includes a multimodal transformer model [Para. 37].  
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify to teach the claim limitation, feature as taught by Zhang; because the modification enables the system to improve automatic recognition and classification of document images by building richer multimodal fingerprints from selected text/image regions and using them to train a multimodal model so the system can accurately identify document types with very few templates. 
Claims 15 and 18 are rejected for the same reason as claim 1. Furthermore, Kletter teaches a processor and non-transitory readable media to perform the claim limitations [fig 1, 12 and related description].

Claims 2, 3, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Kletter et al. (Pub. No. US 2009/0324100) in view of Zhang et al. (Pub. No. US 2023/022285) further in view of Yiheng XU (“LayoutLM: Pre-training of Text and Layout for Document Image understanding”).
Regarding claim 2, Kletter in view of Zhang doesn’t explicitly teach the claim limitations. 
 However, XU teaches wherein each template document image of the plurality of template document images is unique among the plurality of template document images [fig. 1 and related description. It’s clear to see each scanned document has different template].  
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of Zhang to teach training the model using document signatures, feature as taught by XU; because the modification enables the system accurately understand scanned documents by training the model using 2D layout and visual structures. 

Regarding claims 3 and 16, Kletter in view of Zhang doesn’t explicitly teach the claim limitations. 
 However, XU teaches a first template document image that is an image of a first edition of a form document, and a second template document image that is an image of a second edition of the form document that is different than the first edition [fig. 1 and related description].
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of Zhang to teach training the model using document signatures, feature as taught by XU; because the modification enables the system accurately understand scanned documents by training the model using 2D layout and visual structures. 

Claims 5-9 are rejected under 35 U.S.C. 103 as being unpatentable over Kletter et al. (Pub. No. US 2009/0324100) in view of Zhang et al. (Pub. No. US 2023/022285) further in view of King et al. (Patent No. US 8953886).
Regarding claim 5, Kletter teaches wherein, for each template document image of the plurality of template document images, applying the first filter comprises selecting at least one text region from among the plurality of text regions [Para. 50 “The Estimate CC module 630 processes the binary image 625 to gather connected-component elements, and proceeds to histogram the connected-component height, because character height is less variable and more indicative of the font size than character width in most Roman languages”].
Kletter further in view of Zhang doesn’t explicitly teach the about it is based at least on a number of characters in the text region. 
King teaches it is based at least on a number of characters in the text region [Col. 4 lines 65-Col. 5 lines 5].
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of Zhang to teach training the model using document signatures, feature as taught by King; because the modification enables the system improve document template identification by generating more discriminative fingerprints that focus on informative, text-based regions so similar looking forms can be reliably told apart even with very little training data. 
Regarding claim 6, Kletter teaches wherein, for each template document image of the plurality of template document images, applying the first filter comprises selecting at least one text region (connected component centroids; keypoints) from among the plurality of text regions [Para. 50 “The Estimate CC module 630 processes the binary image 625 to gather connected-component elements, and proceeds to histogram the connected-component height, because character height is less variable and more indicative of the font size than character width in most Roman languages”; Para. 57 “The list of remaining connected component centroids at the output of the Remove Duplicates module 680 becomes the final candidate query keypoints list 695”].
Kletter further in view of Zhang doesn’t explicitly teach the about it is based at least on a number of characters in the text region. 
King teaches based at least on a number of characters in the text region and natural language processing (NLP) non-stop words [Col. 4 line 65-Col. 5 line 7].  
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of Zhang to teach the claim limitation, feature as taught by King; because the modification enables the system improve document template identification by generating more discriminative fingerprints that focus on informative, text-based regions so similar looking forms can be reliably told apart even with very little training data. 
Claim 7, Kletter further in view of Zhang doesn’t explicitly teach the claim limitation. 
However, King teaches wherein, for each template document image of the plurality of template document images, applying the second filter comprises omitting at least one region of the at least one selected region based on a number of occurrences of the region among the plurality of template document images [fig. 5, 6 and related description].  
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of Zhang to teach the claim limitation, feature as taught by King; because the modification enables the system improve document template identification by generating more discriminative fingerprints that focus on informative, text-based regions so similar looking forms can be reliably told apart even with very little training data. 
Regarding claim 8, Kletter teaches wherein the plurality of regions of interest (candidate query key point list) includes the second plurality of selected regions [Para. 57].  
Regarding claim 9, Kletter teaches wherein each fingerprint of the plurality of fingerprints includes a feature vector (string of quantized integers, sequence of about 35 quantized integers in the range of .7 and 35-dimentional vector space) that is based on a corresponding plurality of regions of interest (local neighborhoods of key points) [Para. 5 “Form "fingerprints" that may represent the two-dimensional spatial arrangements of local neighborhoods of keypoints. A fingerprint is a string of quantized integers that encode certain distortion-invariant triangle area ratios among the keypoints in each neighborhood.” and Para. 6 “The fingerprints are of high dimension which may be composed of a sequence of about 35 quantized integers in the range of [0,7], which can be interpreted as a 35-dimensional vector space”].

Conclusion
          Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SOLOMON G BEZUAYEHU whose telephone number is (571)270-7452.  The examiner can normally be reached on Monday-Friday 10 AM-8 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Oneal Mistry can be reached on 313-446-4912. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 888-786-0101 (IN USA OR CANADA) or 571-272-4000.
/SOLOMON G BEZUAYEHU/
Primary Examiner, Art Unit 2666

Read full office action

Prosecution Timeline

Mar 14, 2024

Application Filed

Jan 27, 2026

Non-Final Rejection mailed — §103

Apr 24, 2026

Response Filed

Jun 24, 2026

Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/125,164

Patent 12682658

WHITE LINE RECOGNITION DEVICE, MOBILE OBJECT CONTROL SYSTEM, AND WHITE LINE RECOGNITION METHOD

3y 3m to grant Granted Jul 14, 2026

18/471,712

Patent 12676012

SYSTEM AND METHOD FOR LANE GRAPH ESTIMATION

2y 9m to grant Granted Jul 07, 2026

18/081,252

Patent 12670758

AUTHENTICATION DEVICE AND VEHICLE HAVING THE SAME

3y 6m to grant Granted Jun 30, 2026

18/538,833

Patent 12671766

SYSTEM AND METHOD FOR ELECTRONIC NOTIFICATION IN INSTITUTIONAL COMMUNICATIONS

2y 6m to grant Granted Jun 30, 2026

17/924,706

Patent 12664760

METHOD OF ANALYZING A COMPONENT, METHOD OF TRAINING A SYSTEM, APPARATUS, COMPUTER PROGRAM, AND COMPUTER-READABLE STORAGE MEDIUM

3y 7m to grant Granted Jun 23, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

75%

Grant Probability

99%

With Interview (+30.2%)

3y 3m (~11m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 627 resolved cases by this examiner. Grant probability derived from career allowance rate.