DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending in the application. Claims 1, 14 and 17-20 have been amended.
The amendment filed 3/30/2026 overcomes objections to claims 1 (partially), 14 and 17.
The amendment filed 3/30/2026 overcomes rejection to claims 18-20 under 35 USC 101.
Response to Arguments
Applicant’s arguments filed 3/30/2026, with respect to 35 USC 102 rejection applied to claim(s) 18-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant's arguments filed 3/30/2026, with respect to 35 USC 103 rejections applied to claim 13, have been fully considered but they are not persuasive. Response is presented in the following.
Arguments (REMARKS pages 9-10, annotation added)
PNG
media_image1.png
733
873
media_image1.png
Greyscale
Response
During patent examination, the pending claims must be "given their broadest reasonable interpretation consistent with the specification." Phillips v. AWH Corp., 415 F.3d 1303, 1316, 75 USPQ2d 1321, 1329 (Fed. Cir. 2005). Under a broadest reasonable interpretation (BRI), words of the claim must be given their plain meaning, unless such meaning is inconsistent with the specification. "Though understanding the claim language may be aided by explanations contained in the written description, it is important not to import into a claim limitations that are not part of the claim. For example, a particular embodiment appearing in the written description may not be read into a claim when the claim language is broader than the embodiment." Superguide Corp. v. DirecTV Enterprises, Inc., 358 F.3d 870, 875, 69 USPQ2d 1865, 1868 (Fed. Cir. 2004). See MPEP 2111.01.
Claim 13 recites “generating a first image description that the multi-modal transformer-based LLM determines is conveyed by the image”, “generating a second image description based on the activated multi-modal transformer-based LLM using the accessed text as an input”, and “generating an image description based on the first image description and the second image description”. This claim language just broadly recites generating first and second image description using language model and generating an overall description using the first and second description. Examiner performed broadest reasonable interpretation on these limitations. Applicant’s interpretation (see above annotation part, especially (B) and (C)), however, imports the specification into the claim. For example, in part (C), the claim language merely recites “based on the first image description and the second image description”. Nothing regarding “comparison, reconciliation or synthesis” is reflected in the claim language.
For the argument part (A), Zhang discloses in para. [0047] “In the one or more embodiments, the natural language MLM (144) takes the input (102) and generates the text (108) as output. Again, the text (108) may be keywords, as described above”. The input 102, “may be related to one or more images, one or more texts, or both” (para. [0023]).
Claim Objections
The following informality issues are identified.
--- Claim 1 15th line “the extracted image” has no antecedent basis.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 13-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhang (US 20240256597 A1), in view of Amthor et al. (US 20250102788 A1, hereafter Amthor).
As per claim 13, Zhang teaches a method (Abstract), comprising:
accessing an image (FIG. 1A #102 “INPUT”, #104 “FIRST DATA STRUCTURE” and #108 “IMAGE”; FIG. 2 #200; para. [0021] “The data repository (100) stores an input (102). The input (102) is data in the form of a data structure that serves as input to one or more machine learning models. In particular, the 102 is one or more images, one or more texts, or both, that will be used to select images of interest using an index”; Therefore there are 3 choices for the input: text, or image, or both. Note Zhang further discloses a method for building an INDEX (FIG. 1A #112), which is a second data structure (#114) with which the first data structure compares, the INDEX including pre-determined image #118. Therefore either the image in the input or the pre-determined image associated with the INDEX can be interpreted as the recited image. Since Zhang uses similar algorithms to generate the embedding of INPUT and build the INDEX, in the following the citation may come from either path or both.);
activating a multi-modal 142), the natural language MLM (144), and the image processing MLM (146) (FIG. 1A). At least the combination of the natural language MLM and the image processing MLM can be regarded as a multi-modal language model. Zhang further discloses that image features may encoded by a vision portion of the CLIP model, and the CLIP model may return text features encoded by a language portion of the CLIP model (para. [0082]). That says Zhang trains and applies a CLIP (Contrastive Language-Image Pre-training) as the multi-modal language model. (para. [0067]; FIG. 1B)),
generating a first image description that the multi-modal 302 includes outputting, by the image processing machine learning model, an image vector that represents the images”; FIG. 4A #400);
accessing text that describes the image (Zhang FIG. 1A #116 PRE-DETERMINED TEXT; para. [0030] “In particular, with respect to the index (112), the pre-determined text (116) has known relationships to the pre-determined images (118). In other words, each pre-determined image in the pre-determined images (118) is related to one or more instances of pre-determined text (118) that describes the corresponding pre-determined image (118)”);
activating the multi-modal
generating a second image description based on the activated multi-modal 416) is the corpus. The corpus is provided as input to a KeyBERT natural language processing (NLP) machine learning model at step (420). The output of the NLP machine learning model is a series of keywords (422)”); and
generating an image description based on the first image description and the second image description (Zhang para. [0022] “The input (102) may take the form of a first data structure (104). The first data structure (104) is a vector. A vector is a type of data structure that a machine learning model may take as input when a processor executes the machine learning model. A vector may take form of a matrix …”; para. [0027] “For example, the first data structure (104) may be an M by N matrix, where each instance of text is a feature that is described by rows of the matrix, and each instance of an image is a feature that is described by columns of the matrix. Thus, each cell in the matrix is a pair of values, such as “I,” representing the value of the corresponding image feature, and “T,” representing the value of the corresponding text feature. In this specific example, the relationship between I and T in the cell is not defined explicitly”; para. [0028] “Nevertheless, it is possible that the explicit relationship between a given image and a given text is stored in the first data structure (104). For example, each cell of the M by N matrix described above may include a third entry describing a degree of match. In yet another alternative, the matrix may be a M by N by O matrix, where the third dimension of “O” values define the relationships between the images in row M and the texts in column N”. That says the model compares the first image description (image feature) and the second image description (text feature), and generate a composite vector (matrix) comprising the image feature, the text feature and the relationship between them).
Zhang teaches every limitation recited in claim 13 except that the multi-modal language model is a transformer-based Large Language Model (LLM).
Amthor in an analogous field discloses a method for retrieving a microscope image from a database by means of a large language model (para. [0125]). A textual input describing a desired microscope image is received for loading or retrieving a microscope (para. [0126]). The textual input is input into the large language model, which is trained to calculate desired microscope image properties from the textual input. A microscope image is subsequently loaded from a database that contains microscope images as a function of the desired microscope image properties (para. [0127]). In particular, the large language model can comprise a text encoder for mapping the microscope image features to a point in the feature space, and an image encoder for mapping the microscope images into the feature space. Both encoders can be transformer-based (para. [0130]).
It would have been obvious for a person with ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Zhang to incorporate the teaching of Amthor to use transformer-based multi-modal Large Language Model (LLM) for feature embedding and image retrieval. The motivation for applying a transformer-based LLM is that a transformer comprises attention blocks (self-attention blocks for transformer encoders) which respectively process an input sequence as a whole and not sequentially, a feature enables faster processing through parallel computation as recognized by Amthor (para. [0009]).
As per claim 14, dependent upon claim 13, Zhang in view of Amthor teaches generating the image description comprises:
activating the multi-modal transformer-based LLM based on an input instruction to compare the first image description and the second image description (See below); and
generating the image description as an output of the activated multi-modal transformer-based LLM based on the instruction to compare the first image description and the second image description
(Zhang para. [0022] “The input (102) may take the form of a first data structure (104). The first data structure (104) is a vector. A vector is a type of data structure that a machine learning model may take as input when a processor executes the machine learning model. A vector may take form of a matrix …”; para. [0027] “For example, the first data structure (104) may be an M by N matrix, where each instance of text is a feature that is described by rows of the matrix, and each instance of an image is a feature that is described by columns of the matrix. Thus, each cell in the matrix is a pair of values, such as “I,” representing the value of the corresponding image feature, and “T,” representing the value of the corresponding text feature. In this specific example, the relationship between I and T in the cell is not defined explicitly”; para. [0028] “Nevertheless, it is possible that the explicit relationship between a given image and a given text is stored in the first data structure (104). For example, each cell of the M by N matrix described above may include a third entry describing a degree of match. In yet another alternative, the matrix may be a M by N by O matrix, where the third dimension of “O” values define the relationships between the images in row M and the texts in column N”. That says the model compares the first image description (image feature) and the second image description (text feature), and generate a composite vector (matrix) comprising the image feature, the text feature and the relationship between them).
As per claim 15, dependent upon claim 13, Zhang in view of Amthor teaches method further comprising:
generating, based on the image description, a vector for the image (Zhang para. [0022] “The input (102) may take the form of a first data structure (104). The first data structure (104) is a vector”; para. [0029] “The second data structure (114) may have a similar structure as the first data structure (104)”; para. [0024] “The image (106), the text (108), or both, may be represented as a vector, as described above with respect to the first data structure (104)”; para. [0048] “the image processing MLM (146) may embed an image as a vector”); and
storing the vector in a semantically searchable image database to make the image semantically searchable in the semantically searchable image database based on the image description (Zhang FIG. 1A #118 PRE-DETERMIEND IMAGES; para. [0047]; para. [0093] “The nearest neighbor model indexes the image embeddings (i.e., the image vector) so that the images may be searched more easily. The function of the nearest neighbor machine learning model is to act as a search algorithm so that relevant results of images may be returned quickly, relative to other search techniques. Thus, the output of the execution of the nearest neighbor machine learning model is the index”; FIG. 4B #428, 430, 432 showing retrieved images).
As per claim 16, dependent upon claim 15, Zhang in view of Amthor teaches the method further comprising:
accessing an input query comprising an input image (Zhang FIG. 4C showing the input includes a text input #434 and an image input #436; para. [0021] “The data repository (100) stores an input (102). The input (102) is data in the form of a data structure that serves as input to one or more machine learning models. In particular, the 102 is one or more images, one or more texts, or both, that will be used to select images of interest using an index”; Therefore the input has 3 choices: text, or image, or both);
determining a description of the input image (Zhang FIG. 1A #108, #110; para. [0027]);
generating an input vector based on the description of the input image (Zhang FIG. 1A shows “INPUT” #102 being a first data structure #104, including image (106), which is a digital image that represents at least a portion of the input (102), and text (108), which is one or more keywords (including a phrase) that represent at least a portion of the input (102). The image (106), the text (108), or both, may be represented as a vector, as described above with respect to the first data structure (104) (para. [0024]). “The first data structure (104) also describes first relationships (110) among the image (106) and the text (108). The first relationships (110) take the form of data that describes how the image (106) and the text (108) are related” (para. [0026]). “For example, the first data structure (104) may be an M by N matrix, where each instance of text is a feature that is described by rows of the matrix, and each instance of an image is a feature that is described by columns of the matrix. Thus, each cell in the matrix is a pair of values, such as “I,” representing the value of the corresponding image feature, and “T,” representing the value of the corresponding text feature. In this specific example, the relationship between I and T in the cell is not defined explicitly. However, because every image feature in the matrix is associated with every text feature in the matrix, the matrix can still be used to identify the relationships of texts to images when compared to another matrix that also relates texts to images” (para. [0027]). Therefore, the input comprises an image input, and there is an associated text description. The image feature, the text feature and the relationship can be embedded into a vector (matrix)); and
identifying one or more semantically similar images in the semantically searchable image database based on semantic similarity between the input vector and a plurality of vectors in the semantically searchable image database, wherein each of the plurality of vectors corresponds to a respective image that was previously vectorized for semantic search (Zhang FIG. 1A #122 being retrieved matching images; FIG. 2 #204; para. [0081] Step 204 includes comparing the first data structure to an index including a second data structure that defines second relationships among pre-determined texts and pre-determined images. The first and second data structures may be compared by a number of different techniques”; para. [0096] “Using the index includes converting a raw input to a first data structure that defines relationships among input images and input texts. The resulting data structure is compared to the index. Then, a subset of the images is returned. The subset is those images of the images that correspond to first entries in the index which satisfy a matching criterion when compared to second entries in the index”).
As per claim 17, dependent upon claim 15, Zhang in view of Amthor teaches the method further comprising:
accessing an input query comprising an input text (Zhang FIG. 4B #416, #418);
generating an input vector based on the input text (Zhang FIG. 4B #420; para. [0102] “Step 420 also includes embedding the keywords using a CLIP machine learning model. The output of the CLIP machine learning model is a vector which may be compared to the index (414) shown in FIG. 4A”); and
identifying one or more semantically similar images in the semantically searchable image database based on semantic similarity between the input vector and a plurality of vectors in the semantically searchable image database, wherein each of the plurality of vectors corresponds to a respective image that was previously vectorized for semantic search (Zhang FIG. 4B #428, #430 and #432 being retrieved images; para. [0047]).
As per claim 18, an independent claim, Zhang teaches a non-transitory computer readable medium that stores instructions, the instructions when executed by a processor programs the processor to (Zhang FIG. 1A #138; para. [0041]-[0042]):
access an input query comprising an input to search for images in an image database (FIG. 1A #102 “INPUT”, #118 being an image database in the second data structure “INDEX”; FIG. 2 #200; FIG. 4B #416 “Text Input”; FIG. 4C #434 “TEXT INPUT”);
generate, based on an activated multi-modal language model, a text description to search based on the input (FIG. 4B #420 generating keywords based on text input, the generated keywords being text description; para. [0100] “The text input (416) is the corpus. The corpus is provided as input to a KeyBERT natural language processing (NLP) machine learning model at step (420). The output of the NLP machine learning model is a series of keywords (422), including “Halloween cats witches ghosts”, “Halloween cats,” and “ways Halloween cats witches.” Para. [0102] “Step 420 also includes embedding the keywords using a CLIP machine learning model. The output of the CLIP machine learning model is a vector which may be compared to the index (414) shown in FIG. 4A”; The combination of keyBERT and CLIP is a multi-modal language model);
generate an input vector based on the text description (FIG. 2 #202; para. [0078] “Step 202 includes embedding the input into a first data structure that defines first relationships among images and texts. As described above, embedding is the process of converting one form of data into a vector suitable for use as input to a machine learning model”; para. [0079] “For example, when the input is text, the input may be embedded by entering one or more values in a vector for a feature that corresponds to the text”);
compare the input vector against a plurality of vectors in the image database (FIG. 2 #204; para. [0081] “Step 204 includes comparing the first data structure to an index including a second data structure that defines second relationships among pre-determined texts and pre-determined images”), each vector from among the plurality of vectors in the image database being based on a text description of a corresponding image in the image database (FIG. 1A, #112 “INDEX”; FIG. 3; para. [0022] “The input (102) may take the form of a first data structure (104). The first data structure (104) is a vector”; para. [0029] “The data repository (100) also store an index (112). The index (112) is similar to the input (102), in that the index (112) is a second data structure (114) that defines the relationships between pre-determined text (116) and pre-determined images (118). The second data structure (114) may have a similar structure as the first data structure (104)”);
identify one or more images in the image database based on the comparison, wherein each of the one or more images has a corresponding text description that is semantically similar to the text description (FIG. 2 #206; FIG. 4B output images #428, #430, and #432; para. [0047] “In the one or more embodiments, the natural language MLM (144) takes the input (102) and generates the text (108) as output. Again, the text (108) may be keywords, as described above. In other words, a corpus of text may serve as input to the second data structure (114), which predicts which words in the corpus of text are keywords that represent the semantic meaning of the corpus of text”; para. [0089] “… a ranked subset of images may be returned to the user in order to show those pre-determined images predicted to be most likely to be similar to the input. Still other variations are possible”).
Zhang teaches every limitation recited in claim 18 except that the multi-modal language model is a transformer-based Large Language Model (LLM).
Amthor in an analogous field discloses a method for retrieving a microscope image from a database by means of a large language model (para. [0125]). A textual input describing a desired microscope image is received for loading or retrieving a microscope (para. [0126]). The textual input is input into the large language model, which is trained to calculate desired microscope image properties from the textual input. A microscope image is subsequently loaded from a database that contains microscope images as a function of the desired microscope image properties (para. [0127]). In particular, the large language model can comprise a text encoder for mapping the microscope image features to a point in the feature space, and an image encoder for mapping the microscope images into the feature space. Both encoders can be transformer-based (para. [0130]).
It would have been obvious for a person with ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Zhang to incorporate the teaching of Amthor to use transformer-based multi-modal Large Language Model (LLM) for feature embedding and image retrieval. The motivation for applying a transformer-based LLM is that a transformer comprises attention blocks (self-attention blocks for transformer encoders) which respectively process an input sequence as a whole and not sequentially, a feature enables faster processing through parallel computation as recognized by Amthor (para. [0009]).
As per claim 19, dependent upon claim 18, Zhang in view of Amthor teaches wherein the input comprises an image input (Zhang FIG. 4C showing the input includes a text input #434 and an image input #436; para. [0021] “The data repository (100) stores an input (102). The input (102) is data in the form of a data structure that serves as input to one or more machine learning models. In particular, the 102 is one or more images, one or more texts, or both, that will be used to select images of interest using an index”; Therefore the input has 3 choices: text, or image, or both.), and wherein the instructions when executed further programs the processor to:
determine the text description based on the input image (See below); and
generating an input vector based on the description of the input image (Zhang FIG. 1A shows “INPUT” #102 being a first data structure #104, including image (106), which is a digital image that represents at least a portion of the input (102), and text (108), which is one or more keywords (including a phrase) that represent at least a portion of the input (102). The image (106), the text (108), or both, may be represented as a vector, as described above with respect to the first data structure (104) (para. [0024]). “The first data structure (104) also describes first relationships (110) among the image (106) and the text (108). The first relationships (110) take the form of data that describes how the image (106) and the text (108) are related” (para. [0026]). “For example, the first data structure (104) may be an M by N matrix, where each instance of text is a feature that is described by rows of the matrix, and each instance of an image is a feature that is described by columns of the matrix. Thus, each cell in the matrix is a pair of values, such as “I,” representing the value of the corresponding image feature, and “T,” representing the value of the corresponding text feature. In this specific example, the relationship between I and T in the cell is not defined explicitly. However, because every image feature in the matrix is associated with every text feature in the matrix, the matrix can still be used to identify the relationships of texts to images when compared to another matrix that also relates texts to images” (para. [0027]). Therefore, the input comprises an image input, and there is an associated text description. The image feature, the text feature and the relationship can be embedded into a vector (matrix).)
As per claim 20, dependent upon claim 18, Zhang in view of Amthor teaches the input comprises a text input comprising the text description (Zhang FIG. 4B #416, #418; para. [0100]).
Allowable Subject Matter
Claims 1-12 are allowed. The following is an examiner’s statement of reasons for identification of allowable subject matter.
Claim 1 recites a system comprising one or more processors to:
identify an image in an electronic document and identify a location of the extracted image in the electronic document;
recognize text in the image based on optical character recognition and store the recognized text in association with the image and the location of the image in the electronic document;
execute one or more document layout models to extract: an image header in the electronic document that labels the image, a figure description that provides descriptive context about the image, and document text that from the electronic document in a location other than the location of the image in the electronic document;
activate a multi-modal transformer-based Large Language Model (LLM), using the document text as an input to the multi-modal transformer-based LLM, to identify relevant text, from among the document text, that the multi-modal transformer-based LLM deems to be descriptive of the image;
generate an image description based on the extracted image, the location, the image header, the figure description, and the relevant text; and
generate a vector for the image that is semantically searchable based on the image description.
The following prior art is considered related to the subject matter recited in claim 1.
Wells et al. (US 20240221411 A1) discloses a document evaluator comprising one or more of an OCR engine and one or more of an object detection engine (FIG. 3). When an image of a document is input into the document evaluator, the OCR engine detects text and textual characters, and the object detection engine detects other objects including holograms, seals, watermarks, laser perforations, etc. These objects includes images, such as the facial image 510 and ghost image 520 in FIG. 5. Bounding boxes associated with detected objects are generated and information describing the bounding boxes is derived (FIG. 6-7 showing bounding boxes for detected objects; FIG. 8 showing location information with respect to the bounding boxes; para. [0072], [0075], [0077]-[0078]).
Walker et al. (US 20120240039 A1) teaches using a layout model to extract layout information in a document image. The layout model segments the document image into various areas, such as paragraphs of words, photo or image captions, headers, headlines and title, tags or identifiers in charts or graphs, etc. (para. [0115]). The model then determinists a layout of the document (e.g., a positioning of each of the areas relative to one another and relative to any areas which do not include text, such as images, graphs, charts, etc.) and/or any aesthetic or layout characteristics of each area (e.g., font size, font style, font color, line numbering, line spacing, etc.) (FIG. 4; para. [0115]).
Zhang (US 20240256597 A1) discloses a multi-modal language model for generating image description based on an image and associated text (FIG. 1A, 4A; para. [0026]-[0029]).
Amthor et al. (US 20250102788 A1) discloses a transformer-based large language model (LLM) which comprises a text encoder for mapping image features to a point in the feature space, and an image encoder for mapping the images into the feature space (para. [0130]).
Either Wells, or Walker, or Zhang, or Amthor, or the combination, fails to disclose the limitations recited in claim 1. Claims 2-12 contain allowable subject matter by virtue of the dependency on claim 1.
Prior art searched but not cited is recorded in PTO-892.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Contact
Any inquiry concerning this communication or earlier communications from the examiner should be directed to XUEMEI G CHEN whose telephone number is (571)270-3480. The examiner can normally be reached Monday-Friday 9am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John M Villecco can be reached on (571) 272-7319. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/XUEMEI G CHEN/Primary Examiner, Art Unit 2661