Last updated: May 29, 2026
Application No. 18/345,990
ONE-SHOT DOCUMENT SNIPPET SEARCH

Non-Final OA §103
Filed
Jun 30, 2023
Examiner
DWIVEDI, MAHESH H
Art Unit
2168
Tech Center
2100 — Computer Architecture & Software
Assignee
Adobe Inc.
OA Round
3 (Non-Final)
Interview Optional

— +4.5% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 69% grant rate with +4.5% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 754 resolved cases, 2023–2026
Examiner Intelligence

DWIVEDI, MAHESH H View full profile →
Grants 69% — above average
Career Allowance Rate
523 granted / 754 resolved
+14.4% vs TC avg
Minimal +4% lift
Without
With
+4.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 7m
Avg Prosecution
20 currently pending
Career history
774
Total Applications
across all art units
Statute-Specific Performance

§101
5.9%
-34.1% vs TC avg
§103
76.0%
+36.0% vs TC avg
§102
11.2%
-28.8% vs TC avg
§112
4.4%
-35.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 754 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
2.	A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/21/2026 has been entered.
Response to Amendment
3.         Receipt of Applicant’s Amendment filed on 01/21/2026 is acknowledged.  The amendment includes the amending of claims 1, 8, and 15.
Claim Rejections - 35 USC § 103
4.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
5.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
6.	This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
7.	Claims 1-4, 8-11, 15, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Gkoumas et al. (Article entitled “Investigating Non-Classical Correlations Between Decision Fused Multi-Modal Documents”, dated 26 October 2018), in view of Chen et al. (Article entitled “Adaptive Image Transformer for One-Shot Object Detection”, dated 2021), and further in view of Jose et al. (Article entitled “A Retrieval Mechanism for Semi-Structured Photograph Collections”, dated 1997) and further in view of Chae (U.S. PGPUB 2020/0410397).
8.	Regarding claims 1 and 8, Gkoumas teaches a method and non-transitory computer-readable medium comprising: 
A)  obtaining a query snippet and a target document (Page 4, Figure 5).
	The examiner notes that Gkoumas teaches “obtaining a query snippet and a target document” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4).  The examiner further notes that the example received multi-modal query and multi-modal doc 1 (See Figure 5) teaches the claimed query snippet and target document respectively. 
	Gkoumas does not explicitly teach:
B)  combining, by a multi-modal snippet detection model, first multi-modal features from the query snippet and second multi-modal features from the target document to create a feature volume; 
C)  wherein the first multi-modal features and the second multi-modal features are multi-dimensional representations of the query snippet and the target document;
F)  identifying, by the multi-modal snippet detection model, one or more matching snippets from the target document that match the query snippet based on the feature volume;
H)  generating an augmented target document;
I)  the augmented target document including the one or more matching snippets extracted from the target document.
	Chen, however, teaches “combining, by a multi-modal snippet detection model, first multi-modal features from the query snippet and second multi-modal features from the target document to create a feature volume” as “As the multi-head attention [31] is defined to jointly attend to the information collected in parallel from various representation spaces, we express the attention function by… Given a target image I and a query patch Q, we can leverage (2) to establish the multi-head co-attention (MCA) between I and Q… where the superscript denotes the source of the feature. From (4) and (5), we see that unlike the self-attention mechanism, the MCA feature F I of target image I is obtained by considering feature embeddings from two different sources, I and Q, and the same applies to the other MCA feature F Q of the query patch Q” (Section 3.1), “wherein the first multi-modal features and the second multi-modal features are multi-dimensional representations of the query snippet and the target document” as “As the multi-head attention [31] is defined to jointly attend to the information collected in parallel from various representation spaces, we express the attention function by… Following the Transformer [31], we define a basic attention function f by… where v is an embedded feature vectors of dimension dv, and k, q are embedded feature vectors of dimension dk… we express the attention function by… In this work, we use the default number of attention heads, i.e., h = 8, and dv = dk = dm/h = 64.  Given a target image I and a query patch Q, we can leverage (2) to establish the multi-head co-attention (MCA) between I and Q… where the superscript denotes the source of the feature. From (4) and (5), we see that unlike the self-attention mechanism, the MCA feature F I of target image I is obtained by considering feature embeddings from two different sources, I and Q, and the same applies to the other MCA feature F Q of the query patch Q” (Section 3.1), “identifying, by the multi-modal snippet detection model, one or more matching snippets from the target document that match the query snippet based on the feature volume” as “we propose the Adaptive Image Transformer (AIT) module that deploys an attention based encoder-decoder architecture to simultaneously explore intra-coder and inter-coder (i.e., each proposal-query pair) attention. The adaptive nature of our design turns out to be flexible and effective in addressing the one-shot learning scenario. With the informative attention cues, the proposed model excels in predicting the class-similarity between the target image proposals and the query image patch” (Abstract) and “As the multi-head attention [31] is defined to jointly attend to the information collected in parallel from various representation spaces, we express the attention function by… Given a target image I and a query patch Q, we can leverage (2) to establish the multi-head co-attention (MCA) between I and Q… where the superscript denotes the source of the feature. From (4) and (5), we see that unlike the self-attention mechanism, the MCA feature F I of target image I is obtained by considering feature embeddings from two different sources, I and Q, and the same applies to the other MCA feature F Q of the query patch Q… Observe that, owing to the attended target feature F I involving the weighted features between I and Q, it is expected that RPN could generate proposals more relevant to the query Q and hence more suitable for the one-shot object detection task” (Section 3.1), “generating an augmented target document” as “As the multi-head attention [31] is defined to jointly attend to the information collected in parallel from various representation spaces, we express the attention function by… Given a target image I and a query patch Q, we can leverage (2) to establish the multi-head co-attention (MCA) between I and Q… where the superscript denotes the source of the feature. From (4) and (5), we see that unlike the self-attention mechanism, the MCA feature F I of target image I is obtained by considering feature embeddings from two different sources, I and Q, and the same applies to the other MCA feature F Q of the query patch Q” (Section 3.1), “Figure 5 visualizes the transformed deep visual feature maps in specific channels to realize the advantage of our adaptive image transformer. Each feature map means the specific feature channel selected from the ResNet-50 stage-4, and our AIT module translates it concerning the query feature and hence generates the translated feature. The results show that the correct proposal (red box in the target image) has a better translated feature quality, i.e., more similar to the query feature. As a result, our AIT module helps learn metrics for ranking target proposals. Figure 6 shows the usage of our one-shot object detection model. Our model is able to detect the correct query-class object even with the problematic query-patch covering the partial object regions” (Section 4.3), “The left-most column from top to bottom shows the query patch and target image; the right three columns of the top row show the 254th feature channel corresponding to the low-quality proposal (cyan box in the target image)” (Figure 5), and “Our one-shot object detection model is able to make a target image result in the different detected regions with respect to the different query image patches” (Figure 6), and “the augmented target document including the one or more matching snippets extracted from the target document” as “As the multi-head attention [31] is defined to jointly attend to the information collected in parallel from various representation spaces, we express the attention function by… Given a target image I and a query patch Q, we can leverage (2) to establish the multi-head co-attention (MCA) between I and Q… where the superscript denotes the source of the feature. From (4) and (5), we see that unlike the self-attention mechanism, the MCA feature F I of target image I is obtained by considering feature embeddings from two different sources, I and Q, and the same applies to the other MCA feature F Q of the query patch Q” (Section 3.1), “Figure 5 visualizes the transformed deep visual feature maps in specific channels to realize the advantage of our adaptive image transformer. Each feature map means the specific feature channel selected from the ResNet-50 stage-4, and our AIT module translates it concerning the query feature and hence generates the translated feature. The results show that the correct proposal (red box in the target image) has a better translated feature quality, i.e., more similar to the query feature. As a result, our AIT module helps learn metrics for ranking target proposals. Figure 6 shows the usage of our one-shot object detection model. Our model is able to detect the correct query-class object even with the problematic query-patch covering the partial object regions” (Section 4.3), “The left-most column from top to bottom shows the query patch and target image; the right three columns of the top row show the 254th feature channel corresponding to the low-quality proposal (cyan box in the target image)” (Figure 5), and “Our one-shot object detection model is able to make a target image result in the different detected regions with respect to the different query image patches” (Figure 6).
	The examiner further notes that the secondary reference of Chen teaches the concept of using symmetric attention to concatenate (i.e. combine) query and target features.  Such concatenated features are used to generate a multi-dimensional “feature volume” that is used to subsequently determine query matches at a target.  Moreover, Chen teaches an augmented (See the example bounded box in Figures 5-6) target document that depicts one or more matching snippets.  The combination would result in concatenating the multi-modal query and document features of Gkoumas via the use of symmetric attention to generate a “feature volume” to perform query matching resulting in displayed target documents that depict matching one or more snippets.  
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Chen’s would have allowed Gkoumas’s to provide a method for predicting query relevance, as noted by Chen (Section 5.3).
	Gkmoumas and Chen do not explicitly teach:
D)  including text features and spatial features from the query snippet and the target document;
E)  the spatial features representing a layout of the query snippet and the target document.
	Jose, however, teaches “including text features and spatial features from the query snippet and the target document” as “A novel indexing scheme for photographic materials is described. We use spatial features, which are objects and their location, as photographic features” (Abstract), “The various querying mechanisms are explained in the next section and for the sake of this discussion we assume that there are two types of querying: text and picture querying. Searchers can use either of these or both. Thus, a query can also be seen as a document: it also has two components, a picture query component and a text query component” (Section 4, Page 282), and “The upper left part of the figure is a multi-modal query interface and the bottom left is a result viewer. A document viewer is at the right hand side of the interface. The query interface supports various sorts of query mechanisms and the result viewer provides a thumb view of the retrieved documents. A searcher can select documents from the result viewer and view them in the document viewer” (Section 5, Page 288) and “One component of the query interface is a query canvas where searchers can sketch spatial queries. It also has a text-based query interface where searchers can issue text queries. Searchers can also specify their confidence in each query component” (Section 5, Pages 288-289) and “the spatial features representing a layout of the query snippet and the target document” as “A novel indexing scheme for photographic materials is described. We use spatial features, which are objects and their location, as photographic features” (Abstract), “The various querying mechanisms are explained in the next section and for the sake of this discussion we assume that there are two types of querying: text and picture querying. Searchers can use either of these or both. Thus, a query can also be seen as a document: it also has two components, a picture query component and a text query component” (Section 4, Page 282), and “The upper left part of the figure is a multi-modal query interface and the bottom left is a result viewer. A document viewer is at the right hand side of the interface. The query interface supports various sorts of query mechanisms and the result viewer provides a thumb view of the retrieved documents. A searcher can select documents from the result viewer and view them in the document viewer” (Section 5, Page 288) and “One component of the query interface is a query canvas where searchers can sketch spatial queries. It also has a text-based query interface where searchers can issue text queries. Searchers can also specify their confidence in each query component” (Section 5, Pages 288-289).
	The examiner further notes that although Gkmoumas clearly teaches textual features in its multi-modal querying system, there is no explicit teaching of its features including textual features and spatial features.  Nevertheless, the secondary reference of Jose teaches the concept of a multi-modal querying system including textual features and spatial features (which teaches the undefined claimed layout in the broadest reasonable interpretation).  The combination would result in expanding the multi-modal querying systems of Gkmoumas and Chen to expand its features.   
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Jose’s would have allowed Gkoumas’s and Chen’s to provide a method for improving the effectiveness of query systems, as noted by Jose (Section 1, Page 277).
Gkmoumas, Chen, and Jose do not explicitly teach:
J)  displaying the augmented target document.
	Chae, however, teaches “displaying the augmented target document” as “FIG. 7 is a diagram illustrating how a result of similar image search is displayed in the user terminal 20. As shown in FIG. 7, an image selection screen G1 for a user to select an input image is displayed on the display unit 25 of the user terminal 20. When the user selects an input image from the input form F10 and selects a button B11, the input image is uploaded to the server 10, and the similar image search is performed by the search unit 104.  Subsequently, the display control unit 105 sends the box information of the input image, the image data of the image to be searched, and the box information of the image to be searched to the user terminal 20. Upon receiving the data and information, the user terminal 20 displays a search result screen G2 for displaying the result of the similar image search on the display unit 25. A bounding box B22A is displayed on the input image selected by the user in a display area A20 of the search result screen G2, and bounding boxes B22B and B22C are displayed on each image to be searched a display area A21” (Paragraphs 130-131).
	The examiner further notes that although Chen clearly “augments” its target document via bounded boxes (See Figures 5-6), there is no explicit teaching that such augmented target documents are ever actually displayed to a querying user.  Nevertheless, the secondary reference of Chae teaches the concept of displaying an augmented document (See the example bounded boxes B22B and B22C) as shown in Figure 7.  The combination would result in displaying the augmented target documents of Chen.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Chae’s would have allowed Gkoumas’s and Chen’s to provide a method for readily recognizing similarities in search results, as noted by Chae (Paragraph 155).

Regarding claims 2 and 9, Gkoumas teaches a method and non-transitory computer-readable medium comprising:
A)  extracting, by a plurality of encoders, the first multi-modal features from the query snippet and the second multi-modal features from the target document (Pages 4 and 11).
	The examiner notes that Gkoumas teaches “extracting, by a plurality of encoders, the first multi-modal features from the query snippet and the second multi-modal features from the target document” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4), “Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description” (Page 11), and “feature extraction consists of using the representations learned by the VGG16 model [18], with weights pre-trained on ImageNet to extract features from images, resulting in a feature vector of 2048 floating values for each image. After feature vector extractions, we compute the similarity scores between a submitted visual query and images in the dataset based on Cosine function. For textual information, a query expansion approach has been applied extending the query with the ten most frequent terms according to the ground truth text-based documents. This indeed corresponds to a simulated explicit relevance feedback scenario. Then, the TF-IDF vector representation is used for calculating the text-based Cosine similarity between the a query and text documents” (Page 11).  The examiner further notes that extracted textual features ((i.e. first modal features) that are subsequently vectorized) and image features ((i.e. second modal features) that are subsequently vectorized) of a multi-modal query and a multi-modal document teaches the claimed extracting.

Regarding claims 3 and 10, Gkoumas teaches a method and non-transitory computer-readable medium comprising:
A)  wherein the plurality of encoders includes one or more of a text encoder, an image encoder, and a layout encoder (Pages 4 and 11).
	The examiner notes that Gkoumas teaches “wherein the plurality of encoders includes one or more of a text encoder, an image encoder, and a layout encoder” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4), “Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description” (Page 11), and “feature extraction consists of using the representations learned by the VGG16 model [18], with weights pre-trained on ImageNet to extract features from images, resulting in a feature vector of 2048 floating values for each image. After feature vector extractions, we compute the similarity scores between a submitted visual query and images in the dataset based on Cosine function. For textual information, a query expansion approach has been applied extending the query with the ten most frequent terms according to the ground truth text-based documents. This indeed corresponds to a simulated explicit relevance feedback scenario. Then, the TF-IDF vector representation is used for calculating the text-based Cosine similarity between the a query and text documents” (Page 11).  The examiner further notes that extracted textual features and image features of a multi-modal query and a multi-modal document teaches are vectorized via the use of textual and image encoders respectively.

	Regarding claims 4 and 11, Gkoumas further teaches a method and non-transitory computer-readable medium comprising:
A)  wherein combining, by a multi-modal snippet detection model, first multi-modal features from the query snippet and second multi-modal features from the target document to create a feature volume further comprises:  obtaining a first plurality of feature vectors from the first multi-modal features, wherein each feature vector from the first plurality of feature vectors is associated with a different feature type (Pages 4 and 11); 
B)  obtaining a second plurality of feature vectors from the second multi-modal features, wherein the second plurality of feature vectors include feature vectors corresponding to feature types of the first plurality of feature vectors (Pages 4 and 11). 
	The examiner notes that Gkoumas teaches “wherein combining, by a multi-modal snippet detection model, first multi-modal features from the query snippet and second multi-modal features from the target document to create a feature volume further comprises:  obtaining a first plurality of feature vectors from the first multi-modal features, wherein each feature vector from the first plurality of feature vectors is associated with a different feature type” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4), “Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description” (Page 11), and “feature extraction consists of using the representations learned by the VGG16 model [18], with weights pre-trained on ImageNet to extract features from images, resulting in a feature vector of 2048 floating values for each image. After feature vector extractions, we compute the similarity scores between a submitted visual query and images in the dataset based on Cosine function. For textual information, a query expansion approach has been applied extending the query with the ten most frequent terms according to the ground truth text-based documents. This indeed corresponds to a simulated explicit relevance feedback scenario. Then, the TF-IDF vector representation is used for calculating the text-based Cosine similarity between the a query and text documents” (Page 11).  The examiner further notes that extracted textual & image features of a multi-modal query are vectorized (i.e. a first plurality of feature vectors are obtained).  The examiner further notes that Gkoumas teaches “obtaining a second plurality of feature vectors from the second multi-modal features, wherein the second plurality of feature vectors include feature vectors corresponding to feature types of the first plurality of feature vectors” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4), “Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description” (Page 11), and “feature extraction consists of using the representations learned by the VGG16 model [18], with weights pre-trained on ImageNet to extract features from images, resulting in a feature vector of 2048 floating values for each image. After feature vector extractions, we compute the similarity scores between a submitted visual query and images in the dataset based on Cosine function. For textual information, a query expansion approach has been applied extending the query with the ten most frequent terms according to the ground truth text-based documents. This indeed corresponds to a simulated explicit relevance feedback scenario. Then, the TF-IDF vector representation is used for calculating the text-based Cosine similarity between the a query and text documents” (Page 11).  The examiner further notes that extracted textual & image features of a multi-modal document are vectorized (i.e. a second plurality of feature vectors are obtained).
Gkoumas does not explicitly teach:
C)  generating, by a co-attention module, a plurality of co-attention feature sets by combining feature vectors of like feature types from the first plurality of feature vectors and the second plurality of feature vectors.
	Chen, however, teaches “generating, by a co-attention module, a plurality of co-attention feature sets by combining feature vectors of like feature types from the first plurality of feature vectors and the second plurality of feature vectors” as “As the multi-head attention [31] is defined to jointly attend to the information collected in parallel from various representation spaces, we express the attention function by… Given a target image I and a query patch Q, we can leverage (2) to establish the multi-head co-attention (MCA) between I and Q… where the superscript denotes the source of the feature. From (4) and (5), we see that unlike the self-attention mechanism, the MCA feature F I of target image I is obtained by considering feature embeddings from two different sources, I and Q, and the same applies to the other MCA feature F Q of the query patch Q” (Section 3.1).
	The examiner further notes that the secondary reference of Chen teaches the concept of using symmetric attention to concatenate (i.e. combine) query and target features via co-attention.  The combination would result in concatenating the multi-modal query and document feature vectors (of different types) of Gkoumas.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Chen’s would have allowed Gkoumas’s to provide a method for predicting query relevance, as noted by Chen (Section 5.3).

Regarding claim 15, Gkoumas teaches a system comprising: 
A)  a memory component (Sections 1 and 5.1); and 
B)  a processing device coupled to the memory component, the processing device to perform operations comprising: generating, by a plurality of encoders, first multi-modal features for a query snippet and second multi-modal features for a target document (Pages 4 and 11).
The examiner notes that Gkoumas teaches “a memory component” as the Web surrounding us often involves multiple modalities - we read texts, watch images and videos, and listen to sounds. In general terms, modality refers to a certain type of information and/or the representation format in which information is stored. A research problem is characterized as multi-modal when it includes multiple such modalities… representation, called multi-modal fusion, offers a possibility of understanding in-depth real world problems. For instance, in information retrieval, suppose a user types in a text query to retrieve multi-modal documents consisting of an image and a caption as shown in Fig. 1” (Pages 1-2) and “The proposed model is tested on the ImageCLEF2007 data collection [12], the purpose of which is to investigate the effectiveness of combining image and text for retrieval tasks. Out of 60 test queries we randomly picked up 30 ones, together with the ground truth data. Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description. For every query, we created a subset of 300 relevant and irrelevant documents, which includes firstly all the relevant documents for the query, and the rest being irrelevant documents. The dataset is used for investigating both the Bell states (Equations (10) and (11)). The number of relevant documents per query ranges from 11 to 98” (Page 11).  The examiner further notes that a user typing in queries (which can be multi-modal) entails the use of a computer (which includes memory).  The examiner further notes that Gkoumas teaches “a processing device coupled to the memory component, the processing device to perform operations comprising: generating, by a plurality of encoders, first multi-modal features for a query snippet and second multi-modal features for a target document” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4), “Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description” (Page 11), and “feature extraction consists of using the representations learned by the VGG16 model [18], with weights pre-trained on ImageNet to extract features from images, resulting in a feature vector of 2048 floating values for each image. After feature vector extractions, we compute the similarity scores between a submitted visual query and images in the dataset based on Cosine function. For textual information, a query expansion approach has been applied extending the query with the ten most frequent terms according to the ground truth text-based documents. This indeed corresponds to a simulated explicit relevance feedback scenario. Then, the TF-IDF vector representation is used for calculating the text-based Cosine similarity between the a query and text documents” (Page 11).  The examiner further notes that extracted textual features and image features of a multi-modal query and a multi-modal document teaches are vectorized via the use of textual and image encoders respectively.
	Gkoumas does not explicitly teach:
C)  wherein the first multi-modal features and the second multi-modal features are multi-dimensional representations of the query snippet and the target document;
G)  generating, by a multi-modal snippet detection model, a multi-dimensional feature volume based on the first multi-modal features from the query snippet and the second multi-modal features from the target document;
H)  predicting, by the multi-modal snippet detection model, one or more bounding boxes corresponding to matching snippets from the target document that match the query snippet based on the multi-dimensional feature volume;
I)  generating an augmented target document;
J)  the augmented target document including matching snippets extracted from the target document.
	Chen, however, teaches “wherein the first multi-modal features and the second multi-modal features are multi-dimensional representations of the query snippet and the target document” as “As the multi-head attention [31] is defined to jointly attend to the information collected in parallel from various representation spaces, we express the attention function by… Following the Transformer [31], we define a basic attention function f by… where v is an embedded feature vectors of dimension dv, and k, q are embedded feature vectors of dimension dk… we express the attention function by… In this work, we use the default number of attention heads, i.e., h = 8, and dv = dk = dm/h = 64.  Given a target image I and a query patch Q, we can leverage (2) to establish the multi-head co-attention (MCA) between I and Q… where the superscript denotes the source of the feature. From (4) and (5), we see that unlike the self-attention mechanism, the MCA feature F I of target image I is obtained by considering feature embeddings from two different sources, I and Q, and the same applies to the other MCA feature F Q of the query patch Q” (Section 3.1), “generating, by a multi-modal snippet detection model, a multi-dimensional feature volume based on the first multi-modal features from the query snippet and the second multi-modal features from the target document” as “As the multi-head attention [31] is defined to jointly attend to the information collected in parallel from various representation spaces, we express the attention function by… Following the Transformer [31], we define a basic attention function f by… where v is an embedded feature vectors of dimension dv, and k, q are embedded feature vectors of dimension dk… we express the attention function by… In this work, we use the default number of attention heads, i.e., h = 8, and dv = dk = dm/h = 64.  Given a target image I and a query patch Q, we can leverage (2) to establish the multi-head co-attention (MCA) between I and Q… where the superscript denotes the source of the feature. From (4) and (5), we see that unlike the self-attention mechanism, the MCA feature F I of target image I is obtained by considering feature embeddings from two different sources, I and Q, and the same applies to the other MCA feature F Q of the query patch Q” (Section 3.1), “predicting, by the multi-modal snippet detection model, one or more bounding boxes corresponding to matching snippets from the target document that match the query snippet based on the multi-dimensional feature volume” as “we propose the Adaptive Image Transformer (AIT) module that deploys an attention based encoder-decoder architecture to simultaneously explore intra-coder and inter-coder (i.e., each proposal-query pair) attention. The adaptive nature of our design turns out to be flexible and effective in addressing the one-shot learning scenario. With the informative attention cues, the proposed model excels in predicting the class-similarity between the target image proposals and the query image patch” (Abstract), “As the multi-head attention [31] is defined to jointly attend to the information collected in parallel from various representation spaces, we express the attention function by… Following the Transformer [31], we define a basic attention function f by… where v is an embedded feature vectors of dimension dv, and k, q are embedded feature vectors of dimension dk… we express the attention function by… In this work, we use the default number of attention heads, i.e., h = 8, and dv = dk = dm/h = 64.  Given a target image I and a query patch Q, we can leverage (2) to establish the multi-head co-attention (MCA) between I and Q… where the superscript denotes the source of the feature. From (4) and (5), we see that unlike the self-attention mechanism, the MCA feature F I of target image I is obtained by considering feature embeddings from two different sources, I and Q, and the same applies to the other MCA feature F Q of the query patch Q… Observe that, owing to the attended target feature F I involving the weighted features between I and Q, it is expected that RPN could generate proposals more relevant to the query Q and hence more suitable for the one-shot object detection task” (Section 3.1), and “Each feature map means the specific feature channel selected from the ResNet-50 stage-4, and our AIT module translates it concerning the query feature and hence generates the translated feature. The results show that the correct proposal (red box in the target image) has a better translated feature quality, i.e., more similar to the query feature” (Section 4.3), “generating an augmented target document” as “As the multi-head attention [31] is defined to jointly attend to the information collected in parallel from various representation spaces, we express the attention function by… Given a target image I and a query patch Q, we can leverage (2) to establish the multi-head co-attention (MCA) between I and Q… where the superscript denotes the source of the feature. From (4) and (5), we see that unlike the self-attention mechanism, the MCA feature F I of target image I is obtained by considering feature embeddings from two different sources, I and Q, and the same applies to the other MCA feature F Q of the query patch Q” (Section 3.1), “Figure 5 visualizes the transformed deep visual feature maps in specific channels to realize the advantage of our adaptive image transformer. Each feature map means the specific feature channel selected from the ResNet-50 stage-4, and our AIT module translates it concerning the query feature and hence generates the translated feature. The results show that the correct proposal (red box in the target image) has a better translated feature quality, i.e., more similar to the query feature. As a result, our AIT module helps learn metrics for ranking target proposals. Figure 6 shows the usage of our one-shot object detection model. Our model is able to detect the correct query-class object even with the problematic query-patch covering the partial object regions” (Section 4.3), “The left-most column from top to bottom shows the query patch and target image; the right three columns of the top row show the 254th feature channel corresponding to the low-quality proposal (cyan box in the target image)” (Figure 5), and “Our one-shot object detection model is able to make a target image result in the different detected regions with respect to the different query image patches” (Figure 6), and “the augmented target document including matching snippets extracted from the target document” as “As the multi-head attention [31] is defined to jointly attend to the information collected in parallel from various representation spaces, we express the attention function by… Given a target image I and a query patch Q, we can leverage (2) to establish the multi-head co-attention (MCA) between I and Q… where the superscript denotes the source of the feature. From (4) and (5), we see that unlike the self-attention mechanism, the MCA feature F I of target image I is obtained by considering feature embeddings from two different sources, I and Q, and the same applies to the other MCA feature F Q of the query patch Q” (Section 3.1), “Figure 5 visualizes the transformed deep visual feature maps in specific channels to realize the advantage of our adaptive image transformer. Each feature map means the specific feature channel selected from the ResNet-50 stage-4, and our AIT module translates it concerning the query feature and hence generates the translated feature. The results show that the correct proposal (red box in the target image) has a better translated feature quality, i.e., more similar to the query feature. As a result, our AIT module helps learn metrics for ranking target proposals. Figure 6 shows the usage of our one-shot object detection model. Our model is able to detect the correct query-class object even with the problematic query-patch covering the partial object regions” (Section 4.3), “The left-most column from top to bottom shows the query patch and target image; the right three columns of the top row show the 254th feature channel corresponding to the low-quality proposal (cyan box in the target image)” (Figure 5), and “Our one-shot object detection model is able to make a target image result in the different detected regions with respect to the different query image patches” (Figure 6).
	The examiner further notes that the secondary reference of Chen teaches the concept of using symmetric attention to concatenate (i.e. combine) query and target features.  Such concatenated features are used to generate a “feature volume” that is used to subsequently determine query matches at a target (that includes a predicted “bounded box”).  Moreover, Chen teaches an augmented (See the example bounded box in Figures 5-6) target document that depicts one or more matching snippets.  The combination would result in concatenating the multi-modal query and document features of Gkoumas via the use of symmetric attention to generate a “feature volume” to perform query matching resulting in displayed target documents that depict matching one or more snippets.  The combination would result in concatenating the multi-modal query and document features of Gkoumas via the use of symmetric attention to generate a “feature volume” to perform query matching.  
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Chen’s would have allowed Gkoumas’s to provide a method for predicting query relevance, as noted by Chen (Section 5.3).
Gkmoumas and Chen do not explicitly teach:
D)  including text features and spatial features from the query snippet and the target document;
E)  the spatial features representing a layout of the query snippet and the target document.
	Jose, however, teaches “including text features and spatial features from the query snippet and the target document” as “A novel indexing scheme for photographic materials is described. We use spatial features, which are objects and their location, as photographic features” (Abstract), “The various querying mechanisms are explained in the next section and for the sake of this discussion we assume that there are two types of querying: text and picture querying. Searchers can use either of these or both. Thus, a query can also be seen as a document: it also has two components, a picture query component and a text query component” (Section 4, Page 282), and “The upper left part of the figure is a multi-modal query interface and the bottom left is a result viewer. A document viewer is at the right hand side of the interface. The query interface supports various sorts of query mechanisms and the result viewer provides a thumb view of the retrieved documents. A searcher can select documents from the result viewer and view them in the document viewer” (Section 5, Page 288) and “One component of the query interface is a query canvas where searchers can sketch spatial queries. It also has a text-based query interface where searchers can issue text queries. Searchers can also specify their confidence in each query component” (Section 5, Pages 288-289) and “the spatial features representing a layout of the query snippet and the target document” as “A novel indexing scheme for photographic materials is described. We use spatial features, which are objects and their location, as photographic features” (Abstract), “The various querying mechanisms are explained in the next section and for the sake of this discussion we assume that there are two types of querying: text and picture querying. Searchers can use either of these or both. Thus, a query can also be seen as a document: it also has two components, a picture query component and a text query component” (Section 4, Page 282), and “The upper left part of the figure is a multi-modal query interface and the bottom left is a result viewer. A document viewer is at the right hand side of the interface. The query interface supports various sorts of query mechanisms and the result viewer provides a thumb view of the retrieved documents. A searcher can select documents from the result viewer and view them in the document viewer” (Section 5, Page 288) and “One component of the query interface is a query canvas where searchers can sketch spatial queries. It also has a text-based query interface where searchers can issue text queries. Searchers can also specify their confidence in each query component” (Section 5, Pages 288-289).
	The examiner further notes that although Gkmoumas clearly teaches textual features in its multi-modal querying system, there is no explicit teaching of its features including textual features and spatial features.  Nevertheless, the secondary reference of Jose teaches the concept of a multi-modal querying system including textual features and spatial features (which teaches the undefined claimed layout in the broadest reasonable interpretation).  The combination would result in expanding the multi-modal querying systems of Gkmoumas and Chen to expand its features.   
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Jose’s would have allowed Gkoumas’s and Chen’s to provide a method for improving the effectiveness of query systems, as noted by Jose (Section 1, Page 277).
	Gkmoumas, Chen, and Jose do not explicitly teach:
K)  displaying the augmented target document.
	Chae, however, teaches “displaying the augmented target document” as “FIG. 7 is a diagram illustrating how a result of similar image search is displayed in the user terminal 20. As shown in FIG. 7, an image selection screen G1 for a user to select an input image is displayed on the display unit 25 of the user terminal 20. When the user selects an input image from the input form F10 and selects a button B11, the input image is uploaded to the server 10, and the similar image search is performed by the search unit 104.  Subsequently, the display control unit 105 sends the box information of the input image, the image data of the image to be searched, and the box information of the image to be searched to the user terminal 20. Upon receiving the data and information, the user terminal 20 displays a search result screen G2 for displaying the result of the similar image search on the display unit 25. A bounding box B22A is displayed on the input image selected by the user in a display area A20 of the search result screen G2, and bounding boxes B22B and B22C are displayed on each image to be searched a display area A21” (Paragraphs 130-131).
	The examiner further notes that although Chen clearly “augments” its target document via bounded boxes (See Figures 5-6), there is no explicit teaching that such augmented target documents are ever actually displayed to a querying user.  Nevertheless, the secondary reference of Chae teaches the concept of displaying an augmented document (See the example bounded boxes B22B and B22C) as shown in Figure 7.  The combination would result in displaying the augmented target documents of Chen.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Chae’s would have allowed Gkoumas’s and Chen’s to provide a method for readily recognizing similarities in search results, as noted by Chae (Paragraph 155).

Regarding claim 17, Gkoumas further teaches a system comprising:
A)  wherein the operation of generating, by a multi-modal snippet detection model, a multi-dimensional feature volume based on first multi-modal features from the query snippet and second multi-modal features from the target document further comprises: obtaining a first plurality of feature vectors from the first multi-modal features, wherein each feature vector from the first plurality of feature vectors is associated with a different feature type (Pages 4 and 11); 
B)  obtaining a second plurality of feature vectors from the second multi-modal features, wherein the second plurality of feature vectors include feature vectors corresponding to feature types of the first plurality of feature vectors (Pages 4 and 11). 
	The examiner notes that Gkoumas teaches “wherein the operation of generating, by a multi-modal snippet detection model, a multi-dimensional feature volume based on first multi-modal features from the query snippet and second multi-modal features from the target document further comprises: obtaining a first plurality of feature vectors from the first multi-modal features, wherein each feature vector from the first plurality of feature vectors is associated with a different feature type” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4), “Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description” (Page 11), and “feature extraction consists of using the representations learned by the VGG16 model [18], with weights pre-trained on ImageNet to extract features from images, resulting in a feature vector of 2048 floating values for each image. After feature vector extractions, we compute the similarity scores between a submitted visual query and images in the dataset based on Cosine function. For textual information, a query expansion approach has been applied extending the query with the ten most frequent terms according to the ground truth text-based documents. This indeed corresponds to a simulated explicit relevance feedback scenario. Then, the TF-IDF vector representation is used for calculating the text-based Cosine similarity between the a query and text documents” (Page 11).  The examiner further notes that extracted textual & image features of a multi-modal query are vectorized (i.e. a first plurality of feature vectors are obtained).  The examiner further notes that Gkoumas teaches “obtaining a second plurality of feature vectors from the second multi-modal features, wherein the second plurality of feature vectors include feature vectors corresponding to feature types of the first plurality of feature vectors” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4), “Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description” (Page 11), and “feature extraction consists of using the representations learned by the VGG16 model [18], with weights pre-trained on ImageNet to extract features from images, resulting in a feature vector of 2048 floating values for each image. After feature vector extractions, we compute the similarity scores between a submitted visual query and images in the dataset based on Cosine function. For textual information, a query expansion approach has been applied extending the query with the ten most frequent terms according to the ground truth text-based documents. This indeed corresponds to a simulated explicit relevance feedback scenario. Then, the TF-IDF vector representation is used for calculating the text-based Cosine similarity between the a query and text documents” (Page 11).  The examiner further notes that extracted textual & image features of a multi-modal document are vectorized (i.e. a second plurality of feature vectors are obtained).
Gkoumas does not explicitly teach:
C)  generating, by a co-attention module, a plurality of co-attention feature sets by combining feature vectors of like feature types from the first plurality of feature vectors and the second plurality of feature vectors.
	Chen, however, teaches “generating, by a co-attention module, a plurality of co-attention feature sets by combining feature vectors of like feature types from the first plurality of feature vectors and the second plurality of feature vectors” as “As the multi-head attention [31] is defined to jointly attend to the information collected in parallel from various representation spaces, we express the attention function by… Given a target image I and a query patch Q, we can leverage (2) to establish the multi-head co-attention (MCA) between I and Q… where the superscript denotes the source of the feature. From (4) and (5), we see that unlike the self-attention mechanism, the MCA feature F I of target image I is obtained by considering feature embeddings from two different sources, I and Q, and the same applies to the other MCA feature F Q of the query patch Q” (Section 3.1).
	The examiner further notes that the secondary reference of Chen teaches the concept of using symmetric attention to concatenate (i.e. combine) query and target features via co-attention.  The combination would result in concatenating the multi-modal query and document feature vectors (of different types) of Gkoumas.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Chen’s would have allowed Gkoumas’s to provide a method for predicting query relevance, as noted by Chen (Section 5.3).
9.	Claims 5-6, 6-13, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Gkoumas et al. (Article entitled “Investigating Non-Classical Correlations Between Decision Fused Multi-Modal Documents”, dated 26 October 2018), in view of Chen et al. (Article entitled “Adaptive Image Transformer for One-Shot Object Detection”, dated 2021), and further in view of Jose et al. (Article entitled “A Retrieval Mechanism for Semi-Structured Photograph Collections”, dated 1997) and further in view of Chae (U.S. PGPUB 2020/0410397) as applied to claims 1-4, 8-11, 15, and 17 above, and further in view of He et al. (Article entitled “Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine Grained Image-Text Retrieval”, dated 15 July 2021).
10.	Regarding claims 5 and 12, Gkoumas further teaches a method and non-transitory computer-readable medium comprising:
A)  obtaining the first plurality of feature vectors from the first multi-modal features (Pages 4 and 11); 
B)  obtaining the second plurality of feature vectors from the second multi-modal features (Pages 4 and 11).
	The examiner notes that Gkoumas teaches “obtaining the first plurality of feature vectors from the first multi-modal features” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4), “Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description” (Page 11), and “feature extraction consists of using the representations learned by the VGG16 model [18], with weights pre-trained on ImageNet to extract features from images, resulting in a feature vector of 2048 floating values for each image. After feature vector extractions, we compute the similarity scores between a submitted visual query and images in the dataset based on Cosine function. For textual information, a query expansion approach has been applied extending the query with the ten most frequent terms according to the ground truth text-based documents. This indeed corresponds to a simulated explicit relevance feedback scenario. Then, the TF-IDF vector representation is used for calculating the text-based Cosine similarity between the a query and text documents” (Page 11).  The examiner further notes that extracted textual & image features of a multi-modal query are vectorized (i.e. a first plurality of feature vectors are obtained).  The examiner further notes that Gkoumas teaches “obtaining the second plurality of feature vectors from the second multi-modal features” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4), “Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description” (Page 11), and “feature extraction consists of using the representations learned by the VGG16 model [18], with weights pre-trained on ImageNet to extract features from images, resulting in a feature vector of 2048 floating values for each image. After feature vector extractions, we compute the similarity scores between a submitted visual query and images in the dataset based on Cosine function. For textual information, a query expansion approach has been applied extending the query with the ten most frequent terms according to the ground truth text-based documents. This indeed corresponds to a simulated explicit relevance feedback scenario. Then, the TF-IDF vector representation is used for calculating the text-based Cosine similarity between the a query and text documents” (Page 11).  The examiner further notes that extracted textual & image features of a multi-modal document are vectorized (i.e. a second plurality of feature vectors are obtained).
	Gkoumas, Chen, Jose, and Chae do not explicitly teach:
C)  generating, by a cross-attention module, a plurality of cross-attention feature sets by combining feature vectors of unlike feature types from the first plurality of feature vectors and the second plurality of feature vectors.
	He, however, teaches “generating, by a cross-attention module, a plurality of cross-attention feature sets by combining feature vectors of unlike feature types from the first plurality of feature vectors and the second plurality of feature vectors” as “we design a cross graph attention model to smooth the semantic discrepancy between image and text, and simultaneously enhance the shared shared semantic concepts to refine the feature represenations. First, we combine the feature representation of image and text…we build semantic connections in both inter modalities and intra modalities, whereby four fully connected weighted graphs are obtained: GI→I = (V1, E1), GI→T = (V2, E2), GT→I = (V3, E3) and GT→T = (V4, E4), whereV denotes the node sets of the graph, E represents the associated edge between each node pairs. For graph attention learning, it is noted that the semantic connection of node pairs is bi-directed, which can explicitly characterize the destination and source information for cross-modal matching task…the shared semantic concepts contribute significantly to the learning of commond embedding space, and therefore we combine the four sub-matrix to aggregate the attention features across different modalities” (Section 2.2).
	The examiner further notes that the secondary reference of He teaches the concept of a cross-attention model that combines features of multiple different modalities.  The combination would result in combining the multiple different modality features of Gkoumas.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching He’s would have allowed Gkoumas’s, Chen’s, Jose’s, and Chae’s to provide a method for improving the learning of embedded space of images and text, as noted by He (Section 1).

	Regarding claims 6 and 13, Gkoumas, Chen, Jose, and Chae do not explicitly teach a method and non-transitory computer-readable medium comprising:
A)  generating the feature volume by combining the plurality of co-attention feature sets with the plurality of cross-attention feature sets.
	He, however, teaches “generating the feature volume by combining the plurality of co-attention feature sets with the plurality of cross-attention feature sets” as “we design a cross graph attention model to smooth the semantic discrepancy between image and text, and simultaneously enhance the shared shared semantic concepts to refine the feature represenations. First, we combine the feature representation of image and text…we build semantic connections in both inter modalities and intra modalities, whereby four fully connected weighted graphs are obtained: GI→I = (V1, E1), GI→T = (V2, E2), GT→I = (V3, E3) and GT→T = (V4, E4), whereV denotes the node sets of the graph, E represents the associated edge between each node pairs. For graph attention learning, it is noted that the semantic connection of node pairs is bi-directed, which can explicitly characterize the destination and source information for cross-modal matching task…the shared semantic concepts contribute significantly to the learning of commond embedding space, and therefore we combine the four sub-matrix to aggregate the attention features across different modalities” (Section 2.2).
	The examiner further notes that the secondary reference of He teaches the concept of combining inter modalities (i.e. co-attention) and intra modalities (i.e. cross-attention).  The combination would result in combining the different and same modality features of Gkoumas.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching He’s would have allowed Gkoumas’s, Chen’s, Jose’s, and Chae’s to provide a method for improving the learning of embedded space of images and text, as noted by He (Section 1).

Regarding claim 18, Gkoumas further teaches a system comprising:
A)  wherein the operations further comprise: obtaining the first plurality of feature vectors from the first multi-modal features (Pages 4 and 11); 
B)  obtaining the second plurality of feature vectors from the second multi-modal features (Pages 4 and 11).
	The examiner notes that Gkoumas teaches “wherein the operations further comprise: obtaining the first plurality of feature vectors from the first multi-modal features” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4), “Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description” (Page 11), and “feature extraction consists of using the representations learned by the VGG16 model [18], with weights pre-trained on ImageNet to extract features from images, resulting in a feature vector of 2048 floating values for each image. After feature vector extractions, we compute the similarity scores between a submitted visual query and images in the dataset based on Cosine function. For textual information, a query expansion approach has been applied extending the query with the ten most frequent terms according to the ground truth text-based documents. This indeed corresponds to a simulated explicit relevance feedback scenario. Then, the TF-IDF vector representation is used for calculating the text-based Cosine similarity between the a query and text documents” (Page 11).  The examiner further notes that extracted textual & image features of a multi-modal query are vectorized (i.e. a first plurality of feature vectors are obtained).  The examiner further notes that Gkoumas teaches “obtaining the second plurality of feature vectors from the second multi-modal features” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4), “Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description” (Page 11), and “feature extraction consists of using the representations learned by the VGG16 model [18], with weights pre-trained on ImageNet to extract features from images, resulting in a feature vector of 2048 floating values for each image. After feature vector extractions, we compute the similarity scores between a submitted visual query and images in the dataset based on Cosine function. For textual information, a query expansion approach has been applied extending the query with the ten most frequent terms according to the ground truth text-based documents. This indeed corresponds to a simulated explicit relevance feedback scenario. Then, the TF-IDF vector representation is used for calculating the text-based Cosine similarity between the a query and text documents” (Page 11).  The examiner further notes that extracted textual & image features of a multi-modal document are vectorized (i.e. a second plurality of feature vectors are obtained).
	Gkoumas, Chen, Jose, and Chae do not explicitly teach:
C)  generating, by a cross-attention module, a plurality of cross-attention feature sets by combining feature vectors of unlike feature types from the first plurality of feature vectors and the second plurality of feature vectors.
	He, however, teaches “generating, by a cross-attention module, a plurality of cross-attention feature sets by combining feature vectors of unlike feature types from the first plurality of feature vectors and the second plurality of feature vectors” as “we design a cross graph attention model to smooth the semantic discrepancy between image and text, and simultaneously enhance the shared shared semantic concepts to refine the feature represenations. First, we combine the feature representation of image and text…we build semantic connections in both inter modalities and intra modalities, whereby four fully connected weighted graphs are obtained: GI→I = (V1, E1), GI→T = (V2, E2), GT→I = (V3, E3) and GT→T = (V4, E4), whereV denotes the node sets of the graph, E represents the associated edge between each node pairs. For graph attention learning, it is noted that the semantic connection of node pairs is bi-directed, which can explicitly characterize the destination and source information for cross-modal matching task…the shared semantic concepts contribute significantly to the learning of commond embedding space, and therefore we combine the four sub-matrix to aggregate the attention features across different modalities” (Section 2.2).
	The examiner further notes that the secondary reference of He teaches the concept of a cross-attention model that combines features of multiple different modalities.  The combination would result in combining the multiple different modality features of Gkoumas.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching He’s would have allowed Gkoumas’s, Chen’s, Jose’s, and Chae’s to provide a method for improving the learning of embedded space of images and text, as noted by He (Section 1).

	Regarding claim 19, Gkoumas, Chen, Jose, and Chae do not explicitly teach a system comprising:
A)  wherein the operations further comprise: generating the multi-dimensional feature volume by combining the plurality of co-attention feature sets with the plurality of cross-attention feature sets.
	He, however, teaches “wherein the operations further comprise: generating the multi-dimensional feature volume by combining the plurality of co-attention feature sets with the plurality of cross-attention feature sets” as “we design a cross graph attention model to smooth the semantic discrepancy between image and text, and simultaneously enhance the shared shared semantic concepts to refine the feature represenations. First, we combine the feature representation of image and text…we build semantic connections in both inter modalities and intra modalities, whereby four fully connected weighted graphs are obtained: GI→I = (V1, E1), GI→T = (V2, E2), GT→I = (V3, E3) and GT→T = (V4, E4), whereV denotes the node sets of the graph, E represents the associated edge between each node pairs. For graph attention learning, it is noted that the semantic connection of node pairs is bi-directed, which can explicitly characterize the destination and source information for cross-modal matching task…the shared semantic concepts contribute significantly to the learning of commond embedding space, and therefore we combine the four sub-matrix to aggregate the attention features across different modalities” (Section 2.2).
	The examiner further notes that the secondary reference of He teaches the concept of combining inter modalities (i.e. co-attention) and intra modalities (i.e. cross-attention).  The combination would result in combining the different and same modality features of Gkoumas to generate the multi-dimensional feature volume of Chen.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching He’s would have allowed Gkoumas’s, Chen’s, Jose’s, and Chae’s to provide a method for improving the learning of embedded space of images and text, as noted by He (Section 1).
11.	Claims 7, 14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Gkoumas et al. (Article entitled “Investigating Non-Classical Correlations Between Decision Fused Multi-Modal Documents”, dated 26 October 2018), in view of Chen et al. (Article entitled “Adaptive Image Transformer for One-Shot Object Detection”, dated 2021), and further in view of Jose et al. (Article entitled “A Retrieval Mechanism for Semi-Structured Photograph Collections”, dated 1997) and further in view of Chae (U.S. PGPUB 2020/0410397) as applied to claims 1-4, 8-11, 15, and 17 above, and further in view of Yang et al. (Article entitled “Balanced and Hierarchical Relation Learning for One-shot Object Detection”, dated 24 June 2022), and further in view of Ren et al. (Article entitled “Faster R-CNN:  Towards Real-Time Object Detection with Region Proposal Networks”, dated 2015).
12.	Regarding claims 7 and 14, Gkoumas, Chen, Jose, and Chae do not explicitly teach a method and non-transitory computer-readable medium comprising:
A)  wherein identifying, by the multi-modal snippet detection model, one or more matching snippets from the target document based on the feature volume further comprises:  identifying hierarchical features from the feature volume. 
	Yang, however, teaches “wherein identifying, by the multi-modal snippet detection model, one or more matching snippets from the target document based on the feature volume further comprises:  identifying hierarchical features from the feature volume” as “we first introduce a novel Instance-level Hierarchical Relation (IHR) module that can infer multilevel semantic relations for generating query-target similarity features. Specifically, we initially use region proposal network to extract instance-level feature maps. Then, the IHR module decomposes query-target feature matching into three hierarchical semantic levels, which are responsible to capture the global difference, local salient region, and local discriminative part, respectively. The global difference reveals that the target object should be described by using its contrastive characteristics when being compared with the query object” (Section 1) and “the proposed IHR module eliminates the above shortcomings and adopts a hierarchical manner to comprehensively describe the semantic relations” (Section 3.3).
	The examiner further notes that the secondary reference of Yang teaches the concept of Yang teaches the concept of ascertaining hierarchal features (which are undefined in the claims) in a query environment.  The combination would result in ascertaining the hierarchical features of the feature volume of Gkoumas.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Yang’s would have allowed Gkoumas’s, Chen’s, and Chae’s to provide a method to tie query semantics and a target in a compositional way, as noted by Yang (Section 1).
	Gkoumas, Chen, Jose, Chae, and Yang do not explicitly teach:
B)  determining, by a region proposal network, one or more regions of the target document based on the hierarchical features; and 
C)  determining, by a region of interest network, bounding data associated with the one or more regions of interest, the bounding data corresponding to the one or more matching snippets from the target document.
	Ren, however, teaches “determining, by a region proposal network, one or more regions of the target document based on the hierarchical features” as “An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look” (Abstract) and “Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [2] that uses the proposed regions… the RPN module tells the Fast R-CNN module where to look. In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared” (Section 3) and “determining, by a region of interest network, bounding data associated with the one or more regions of interest, the bounding data corresponding to the one or more matching snippets from the target document” as “An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look” (Abstract) and “our method achieves bounding-box regression by a different manner from previous RoIbased (Region of Interest) methods [1], [2]. In [1], [2], bounding-box regression is performed on features pooled from arbitrarily sized RoIs, and the regression weights are shared by all region sizes. In our formulation, the features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors” (Section 3.1.2).
	The examiner further notes that the secondary reference of Ren teaches the concept of using a Region proposal network (which is undefined in the claims) and a region of interest network (which is undefined in the claims) to determine regions and predict bounding boxes.  The combination would result in using the hierarchical features of Yang on the feature volume of Gkoumas to predict bounding boxes.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Ren’s would have allowed Gkoumas’s, Chen’s, Jose’s, Chae’s, and Yang’s to provide a more efficient and effective region proposal network, as noted by Ren (Section 5).

Regarding claim 20, Gkoumas, Chen, Jose, and Chae do not explicitly teach a system comprising:
A)  wherein the operation of predicting, by the multi-modal snippet detection model, one or more bounding boxes corresponding to matching snippets from the target document based on the multi-dimensional feature volume further comprises: identifying hierarchical features from the multi-dimensional feature volume. 
	Yang, however, teaches “wherein the operation of predicting, by the multi-modal snippet detection model, one or more bounding boxes corresponding to matching snippets from the target document based on the multi-dimensional feature volume further comprises: identifying hierarchical features from the multi-dimensional feature volume” as “we first introduce a novel Instance-level Hierarchical Relation (IHR) module that can infer multilevel semantic relations for generating query-target similarity features. Specifically, we initially use region proposal network to extract instance-level feature maps. Then, the IHR module decomposes query-target feature matching into three hierarchical semantic levels, which are responsible to capture the global difference, local salient region, and local discriminative part, respectively. The global difference reveals that the target object should be described by using its contrastive characteristics when being compared with the query object” (Section 1) and “the proposed IHR module eliminates the above shortcomings and adopts a hierarchical manner to comprehensively describe the semantic relations” (Section 3.3).
	The examiner further notes that the secondary reference of Yang teaches the concept of Yang teaches the concept of ascertaining hierarchal features (which are undefined in the claims) in a query environment.  The combination would result in ascertaining the hierarchical features of the feature volume of Gkoumas and the multi-dimensional feature volume of Chen.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Yang’s would have allowed Gkoumas’s, Chen’s, Jose’s, and Chae’s to provide a method to tie query semantics and a target in a compositional way, as noted by Yang (Section 1).
	Gkoumas, Chen, Jose, Chae, and Yang do not explicitly teach:
B)  determining, by a region proposal network; one or more regions of the target document based on the hierarchical features; and 
C)  determining, by a region of interest network, the one or more bounding boxes based on the one or more regions of interest.
	Ren, however, teaches “determining, by a region proposal network; one or more regions of the target document based on the hierarchical features” as “An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look” (Abstract) and “Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [2] that uses the proposed regions… the RPN module tells the Fast R-CNN module where to look. In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared” (Section 3) and “determining, by a region of interest network, the one or more bounding boxes based on the one or more regions of interest” as “An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look” (Abstract) and “our method achieves bounding-box regression by a different manner from previous RoIbased (Region of Interest) methods [1], [2]. In [1], [2], bounding-box regression is performed on features pooled from arbitrarily sized RoIs, and the regression weights are shared by all region sizes. In our formulation, the features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors” (Section 3.1.2).
	The examiner further notes that the secondary reference of Ren teaches the concept of using a Region proposal network (which is undefined in the claims) and a region of interest network (which is undefined in the claims) to determine regions and predict bounding boxes.  The combination would result in using the hierarchical features of Yang on the feature volume of Gkoumas to predict bounding boxes.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Ren’s would have allowed Gkoumas’s, Chen’s, Jose’s, Chae’s, and Yang’s to provide a more efficient and effective region proposal network, as noted by Ren (Section 5).
13.	Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Gkoumas et al. (Article entitled “Investigating Non-Classical Correlations Between Decision Fused Multi-Modal Documents”, dated 26 October 2018), in view of Chen et al. (Article entitled “Adaptive Image Transformer for One-Shot Object Detection”, dated 2021), and further in view of Jose et al. (Article entitled “A Retrieval Mechanism for Semi-Structured Photograph Collections”, dated 1997) and further in view of Chae (U.S. PGPUB 2020/0410397) as applied to claims 1-4, 8-11, 15, and 17 above, and further in view of Huang et al. (Article entitled “LayoutLMv3:  Pre-Training for Document AI with Unified Text and Image Masking”, dated 19 July 2022.
14.	Regarding claim 16, Gkoumas further teaches a system comprising:
A)  wherein the plurality of encoders includes a text encoder, an image encoder (Pages 4 and 11).
	The examiner notes that Gkoumas teaches “wherein the plurality of encoders includes a text encoder, an image encoder” as “At first, we calculate the probability of relevance for each document, with respect to both text-based and image-based modality concerning a multimodal query as shown in Fig. 5” (Page 4), “Each query describing user information need consists of three sample images and a text description, whereas each document consists of an image and a text description” (Page 11), and “feature extraction consists of using the representations learned by the VGG16 model [18], with weights pre-trained on ImageNet to extract features from images, resulting in a feature vector of 2048 floating values for each image. After feature vector extractions, we compute the similarity scores between a submitted visual query and images in the dataset based on Cosine function. For textual information, a query expansion approach has been applied extending the query with the ten most frequent terms according to the ground truth text-based documents. This indeed corresponds to a simulated explicit relevance feedback scenario. Then, the TF-IDF vector representation is used for calculating the text-based Cosine similarity between the a query and text documents” (Page 11).  The examiner further notes that extracted textual features and image features of a multi-modal query and a multi-modal document teaches are vectorized via the use of textual and image encoders respectively.
	Gkoumas, Chen, Jose, and Chae do not explicitly teach:
A)  wherein the plurality of encoders includes a layout encoder.
	Huang, however, teaches “wherein the plurality of encoders includes a layout encoder” as “LayoutLM incorporates layout information by adding word-level spatial embeddings to embeddings of BERT [49]…We extract text and layout information using Microsoft Read API. We fine-tune LayoutLMv3 for 20,000 steps with a batch size of 64 and a learning rate of 2𝑒 − 5” (Section 3.3) and “LayoutLMv3 is a pre-trained multimodal Transformer for Document AI with unified text and image masking objectives. Given an input document image and its corresponding text and layout position information, the model takes the linear projection of patches and word tokens as inputs and encodes them into contextualized vector representations” (Figure 3).
	The examiner further notes that the secondary reference of Huang teaches the concept of encoding layout information (i.e. the use of a layout encoder).  The combination would result in expanding Gkoumas to also encode layout information.
It would have been obvious to one of ordinary skill in the art before the effective filing date of instant invention to combine the teachings of the cited references because teaching Huang’s would have allowed Gkoumas’s, Chen’s, Jose’s, and Chae’s to provide a method to for improving QA performance, as noted by Huang (Abstract).
Response to Arguments
15.	Applicant’s arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument (See newly applied secondary reference of Jose).
Conclusion
16.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
U.S. PGPUB 2024/0331423 issued to Zeng et al. on 03 October 2024.  The subject matter disclosed therein is pertinent to that of claims 1-20 (e.g., methods to query target documents).
U.S. PGPUB 2007/0067345 issued to Li et al. on 22 March 2007.  The subject matter disclosed therein is pertinent to that of claims 1-20 (e.g., methods to query target documents).
Contact Information
17.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to Mahesh Dwivedi whose telephone number is (571) 272-2731.  The examiner can normally be reached on Monday to Friday 8:20 am – 4:40 pm.
	If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Charles Rones can be reached (571) 272-4085.  The fax number for the organization where this application or proceeding is assigned is (571) 273-8300.
	Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).


Mahesh Dwivedi
Primary Examiner
Art Unit 2168

March 23, 2026
/MAHESH H DWIVEDI/Primary Examiner, Art Unit 2168
Read full office action
Prosecution Timeline

Show 3 earlier events
Oct 01, 2025
Response Filed
Oct 01, 2025
Applicant Interview (Telephonic)
Oct 24, 2025
Final Rejection mailed — §103
Jan 21, 2026
Applicant Interview (Telephonic)
Jan 21, 2026
Examiner Interview Summary
Jan 21, 2026
Request for Continued Examination
Jan 28, 2026
Response after Non-Final Action
Mar 25, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/795,142
Patent 12639257
FILE SYSTEM CONTENT ARCHIVING BASED ON THIRD-PARTY APPLICATION ARCHIVING RULES AND METADATA
1y 9m to grant Granted May 26, 2026
18/459,144
Patent 12626160
MANAGING IMPACT OF POISONED INFERENCES ON DEPLOYMENTS OF HARDWARE TO DOWNSTREAM CONSUMERS
2y 8m to grant Granted May 12, 2026
18/809,378
Patent 12613837
SYSTEM AND METHOD FOR CLOUD-BASED READ-ONLY FOLDER SYNCHRONIZATION
1y 8m to grant Granted Apr 28, 2026
18/539,424
Patent 12608403
EXTRACTION MACHINE LEARNING FRAMEWORK
2y 4m to grant Granted Apr 21, 2026
18/171,704
Patent 12591818
FORECASTING AND MITIGATING CONCEPT DRIFT USING NATURAL LANGUAGE PROCESSING
3y 1m to grant Granted Mar 31, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
69%
Grant Probability
74%
With Interview (+4.5%)
3y 7m (~8m remaining)
Median Time to Grant
High
PTA Risk
Based on 754 resolved cases by this examiner. Grant probability derived from career allowance rate.
ONE-SHOT DOCUMENT SNIPPET SEARCH

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email