Prosecution Insights
Last updated: April 19, 2026
Application No. 18/604,902

ONE-SHOT MULTIMODAL LEARNING FOR DOCUMENT IDENTIFICATION

Non-Final OA §101§103
Filed
Mar 14, 2024
Examiner
BEZUAYEHU, SOLOMON G
Art Unit
2674
Tech Center
2600 — Communications
Assignee
Iron Mountain Incorporated
OA Round
1 (Non-Final)
75%
Grant Probability
Favorable
1-2
OA Rounds
3y 4m
To Grant
99%
With Interview

Examiner Intelligence

Grants 75% — above average
75%
Career Allow Rate
464 granted / 618 resolved
+13.1% vs TC avg
Strong +31% interview lift
Without
With
+30.9%
Interview Lift
resolved cases with interview
Typical timeline
3y 4m
Avg Prosecution
30 currently pending
Career history
648
Total Applications
across all art units

Statute-Specific Performance

§101
16.0%
-24.0% vs TC avg
§103
49.7%
+9.7% vs TC avg
§102
13.4%
-26.6% vs TC avg
§112
11.7%
-28.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 618 resolved cases

Office Action

§101 §103
DETAILED ACTION Claim Rejections - 35 USC § 101 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter because the claim(s) as a whole, considering all claim elements both individually and in combination, do not amount to significantly more than an abstract idea. The claims are directed to the abstract idea (i.e. generating/creating a fingerprint for each template document image, detecting/observing a plurality of text regions in the template document image and filtering/selecting a plurality of regions to obtain regions of interest), which is a method of mental process because each limitation can be performed by a human being and (i.e. training multimodal model with signature) is nothing more than mathematical concept. The additional element(s) or combination of elements in the claim(s) other than the abstract idea per se amount(s) to no more than: recitation of generic computer structure that serves to perform generic computer functions that are well-understood, routine, and conventional activities previously known to the pertinent industry. Viewed as a whole, these additional claim element(s) do not provide meaningful limitation(s) to transform the abstract idea into a patent eligible application of the abstract idea such that the claim(s) amounts to significantly more than the abstract idea itself. Therefore, the claim(s) are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter. Claims 15 and 18 are rejected for the same reason as claim 1. None of the dependent claims has a limitation that amounts to a significantly more than abstract idea. Therefore, claims 2-14, 16, 17, 19, and 20 are also rejected. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1-4 and 15-18 are rejected under 35 U.S.C. 103 as being unpatentable over Kletter et al. (Pub. No. US 2009/0324100) in view of Yiheng XU (“LayoutLM: Pre-training of Text and Layout for Document Image understanding”). Regarding claim 1, Kletter teaches a computer-implemented method of document image processing, the method comprising: for each template document image (target images 510) of a plurality/collection of template document images (target images), generating a corresponding fingerprint of a plurality of fingerprints [Para. 43 “target images 510 are processed sequentially, one at a time, to extract their visual fingerprint information based on keypoint identification”; Para. 45; 737, 202, 214, 208, fig. 5, 6 and related description]; wherein, for each template document image of the plurality of template document images, generating the corresponding fingerprint comprises: detecting a plurality of regions (connected components) within the template document image, wherein the plurality of regions, comprises a plurality of text regions (text lines) [Para. 55 “connected components tend to be situated in text lines” fig. 5, 6 and related description]; and filtering the plurality of regions (connected components) to obtain a plurality of regions of interest (smaller subset of connected components), wherein the fingerprint is based on the plurality of regions of interest (smaller subset of connected components) [Para. 160 “only the smaller subset of connected components are outputted”]. However, Kletter doesn’t explicitly teach based on the plurality of fingerprints, training a multimodal model. XU teaches based on the plurality of fingerprints, training a multimodal model [Abstract “we also leverage image features to incorporate words’ visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for documentlevel pre-training”. Introduction “We add these two input embeddings because the 2-D position embedding can capture the relationship among tokens within a document, meanwhile the image embedding can capture some appearance features such as font directions, types, and colors. In addition, we adopt a multi-task learning objective for LayoutLM, including a Masked Visual-Language Model (MVLM) loss and a Multi-label Document Classification (MDC) loss, which further enforces joint pre-training for text and layout. In this work, our focus is the document pre-training based on scanned document images, while digital-born documents are less challenging because they can be considered as a special case where OCR is not required, thus they are out of the scope of this paper”. Fig. 1, 2 and related description]. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter to teach training the model using document signatures, feature as taught by XU; because the modification enables the system accurately understand scanned documents by training the model using 2D layout and visual structures. Claims 15 and 18 are rejected for the same reason as claim 1. Furthermore, Kletter teaches a processor and non-transitory readable media to perform the claim limitations [fig 1, 12 and related description]. Regarding claim 2, Kletter in view of XU teaches all claim limitation above. Furthermore, XU teaches wherein each template document image of the plurality of template document images is unique among the plurality of template document images [fig. 1 and related description. It’s clear to see each scanned document has different template]. Regarding claims 3 and 16, Kletter in view of XU teaches all claim limitation above. Furthermore, XU teaches a first template document image that is an image of a first edition of a form document, and a second template document image that is an image of a second edition of the form document that is different than the first edition [fig. 1 and related description]. Regarding claims 4 and 17, Kletter teaches wherein, for each template document image of the plurality of template document images, filtering the plurality of regions comprises: applying a first filter to select, from among the plurality of regions, a first plurality of selected regions (smaller subset of connected components) ; and applying a second filter (remove duplicates module 680) to omit, from among the first plurality of selected regions, at least one selected region to obtain a second plurality of selected regions (list of key points 685) [Para. 54, 160, and 56]. Claims 5-9 are rejected under 35 U.S.C. 103 as being unpatentable over Kletter et al. (Pub. No. US 2009/0324100) in view of Yiheng XU (“LayoutLM: Pre-training of Text and Layout for Document Image understanding”) further in view of King et al. (Patent No. US 8953886). Regarding claim 5, Kletter teaches wherein, for each template document image of the plurality of template document images, applying the first filter comprises selecting at least one text region from among the plurality of text regions [Para. 50 “The Estimate CC module 630 processes the binary image 625 to gather connected-component elements, and proceeds to histogram the connected-component height, because character height is less variable and more indicative of the font size than character width in most Roman languages”]. Kletter further in view of XU doesn’t explicitly teach the about it is based at least on a number of characters in the text region. King teaches it is based at least on a number of characters in the text region [Col. 4 lines 65-Col. 5 lines 5]. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of XU to teach training the model using document signatures, feature as taught by King; because the modification enables the system improve document template identification by generating more discriminative fingerprints that focus on informative, text-based regions so similar looking forms can be reliably told apart even with very little training data. Regarding claim 6, Kletter teaches wherein, for each template document image of the plurality of template document images, applying the first filter comprises selecting at least one text region (connected component centroids; keypoints) from among the plurality of text regions [Para. 50 “The Estimate CC module 630 processes the binary image 625 to gather connected-component elements, and proceeds to histogram the connected-component height, because character height is less variable and more indicative of the font size than character width in most Roman languages”; Para. 57 “The list of remaining connected component centroids at the output of the Remove Duplicates module 680 becomes the final candidate query keypoints list 695”]. Kletter further in view of XU doesn’t explicitly teach the about it is based at least on a number of characters in the text region. King teaches based at least on a number of characters in the text region and natural language processing (NLP) non-stop words [Col. 4 line 65-Col. 5 line 7]. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of XU to teach the claim limitation, feature as taught by King; because the modification enables the system improve document template identification by generating more discriminative fingerprints that focus on informative, text-based regions so similar looking forms can be reliably told apart even with very little training data. Claim 7, Kletter further in view of XU doesn’t explicitly teach the claim limitation. However, King teaches wherein, for each template document image of the plurality of template document images, applying the second filter comprises omitting at least one region of the at least one selected region based on a number of occurrences of the region among the plurality of template document images [fig. 5, 6 and related description]. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of XU to teach the claim limitation, feature as taught by King; because the modification enables the system improve document template identification by generating more discriminative fingerprints that focus on informative, text-based regions so similar looking forms can be reliably told apart even with very little training data. Regarding claim 8, Kletter teaches wherein the plurality of regions of interest (candidate query key point list) includes the second plurality of selected regions [Para. 57]. Regarding claim 9, Kletter teaches wherein each fingerprint of the plurality of fingerprints includes a feature vector (string of quantized integers, sequence of about 35 quantized integers in the range of .7 and 35-dimentional vector space) that is based on a corresponding plurality of regions of interest (local neighborhoods of key points) [Para. 5 “Form "fingerprints" that may represent the two-dimensional spatial arrangements of local neighborhoods of keypoints. A fingerprint is a string of quantized integers that encode certain distortion-invariant triangle area ratios among the keypoints in each neighborhood.” and Para. 6 “The fingerprints are of high dimension which may be composed of a sequence of about 35 quantized integers in the range of [0,7], which can be interpreted as a 35-dimensional vector space”]. Claims 10-14 are rejected under 35 U.S.C. 103 as being unpatentable over Kletter et al. (Pub. No. US 2009/0324100) in view of Yiheng XU (“LayoutLM: Pre-training of Text and Layout for Document Image understanding”) further in view of Zhang et al. (Pub. No. US 2023/0222285). Regarding claims 10 and 19, Kletter in view of XU doesn’t explicitly teach the claim limitation. Zhang teaches wherein, for each template document image of the plurality of template document images, each text region of the plurality of text regions indicates: a text string detected within the text region, and a boundary of the text region within a corresponding template document image [Para. 7, and Para. 8 “The layout data can include spatial layout data descriptive of spatial positions of the plurality of blocks within the document”]. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of XU to teach the claim limitation, feature as taught by Zhang; because the modification enables the system to improve automatic recognition and classification of document images by building richer multimodal fingerprints from selected text/image regions and using them to train a multimodal model so the system can accurately identify document types with very few template. Regarding claims 11 and 20 Kletter in view of XU doesn’t explicitly teach the claim limitation. Zhang teaches wherein, for each template document image of the plurality of template document images, each text region of the plurality of text regions indicates: a text string detected within the text region, and a location and image patch of the text region within a corresponding template document image [Para. 63, and 66-67]. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of XU to teach the claim limitation, feature as taught by Zhang; because the modification enables the system to improve automatic recognition and classification of document images by building richer multimodal fingerprints from selected text/image regions and using them to train a multimodal model so the system can accurately identify document types with very few templates. Regarding claim 12 Kletter in view of XU doesn’t explicitly teach the claim limitation. Zhang teaches wherein, for each template document image of the plurality of template document images, the plurality of regions includes at least one image region, and each image region of the at least one image region indicates: a boundary of the image region within a corresponding template document image, and image content of the image region [fig. 2, 4, and related description]. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of XU to teach the claim limitation, feature as taught by Zhang; because the modification enables the system to improve automatic recognition and classification of document images by building richer multimodal fingerprints from selected text/image regions and using them to train a multimodal model so the system can accurately identify document types with very few templates. Regarding claim 13. Kletter in view of XU doesn’t explicitly teach the claim limitation. Zhang teaches for each template document image of the plurality of template document images: generating augmented data that is based on information from the template document image, and generating a plurality of training samples that are based on the augmented data, wherein training the multimodal model comprises using the plurality of training samples for each template document image of the plurality of template document images to train the multimodal model [Para. 89-92, 135, fig. 2, and related description]. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of XU to teach the claim limitation, feature as taught by Zhang; because the modification enables the system to improve automatic recognition and classification of document images by building richer multimodal fingerprints from selected text/image regions and using them to train a multimodal model so the system can accurately identify document types with very few templates. Regarding claim 14. Kletter in view of XU doesn’t explicitly teach the claim limitation. Zhang teaches wherein the multimodal model includes a multimodal transformer model [Para. 37]. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Kletter in view of XU to teach the claim limitation, feature as taught by Zhang; because the modification enables the system to improve automatic recognition and classification of document images by building richer multimodal fingerprints from selected text/image regions and using them to train a multimodal model so the system can accurately identify document types with very few templates. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to SOLOMON G BEZUAYEHU whose telephone number is (571)270-7452. The examiner can normally be reached on Monday-Friday 10 AM-7 PM. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, O’Neal Mistry can be reached on 313-446-4912. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-0101 (IN USA OR CANADA) or 571-272-1000. /SOLOMON G BEZUAYEHU/ Primary Examiner, Art Unit 2666
Read full office action

Prosecution Timeline

Mar 14, 2024
Application Filed
Jan 22, 2026
Non-Final Rejection — §101, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602717
APPARATUS, METHOD, AND COMPUTER-READABLE STORAGE MEDIUM FOR CONTEXTUALIZED EQUIPMENT RECOMMENDATION
2y 5m to grant Granted Apr 14, 2026
Patent 12602946
DOCUMENT CLASSIFICATION USING UNSUPERVISED TEXT ANALYSIS WITH CONCEPT EXTRACTION
2y 5m to grant Granted Apr 14, 2026
Patent 12591350
TECHNIQUES FOR POSITIONING SPEAKERS WITHIN A VENUE
2y 5m to grant Granted Mar 31, 2026
Patent 12586355
ROAD AND INFRASTRUCTURE ANALYSIS TOOL
2y 5m to grant Granted Mar 24, 2026
Patent 12561852
Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
75%
Grant Probability
99%
With Interview (+30.9%)
3y 4m
Median Time to Grant
Low
PTA Risk
Based on 618 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month