Prosecution Insights
Last updated: April 19, 2026
Application No. 18/536,121

Dual Attention Network Using Transformers for Cross-Modal Retrieval

Non-Final OA §102§103
Filed
Dec 11, 2023
Examiner
JIA, XIN
Art Unit
2663
Tech Center
2600 — Communications
Assignee
Mayo Foundation for Medical Education and Research
OA Round
1 (Non-Final)
85%
Grant Probability
Favorable
1-2
OA Rounds
2y 6m
To Grant
98%
With Interview

Examiner Intelligence

Grants 85% — above average
85%
Career Allow Rate
510 granted / 601 resolved
+22.9% vs TC avg
Moderate +13% lift
Without
With
+12.8%
Interview Lift
resolved cases with interview
Typical timeline
2y 6m
Avg Prosecution
23 currently pending
Career history
624
Total Applications
across all art units

Statute-Specific Performance

§101
3.2%
-36.8% vs TC avg
§103
73.2%
+33.2% vs TC avg
§102
7.8%
-32.2% vs TC avg
§112
6.3%
-33.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 601 resolved cases

Office Action

§102 §103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claim Rejections - 35 USC § 102 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. Claim(s) 1-7 and 10 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Kollada (PGPUB: 20220392637 A1). Regarding claim 1. Kollada teaches a method for extracting features from cross-modal data from an input data source, the method comprising: (a) accessing, with a computer system, first input data of a first modality (see Fig. 1B, item 101, first data modality); (b) accessing, with the computer system, second input data of a second modality (see Fig. 1B, item 103, second data modality); (c) accessing, with the computer system, a dual attention network trained on training data to extract feature data from cross-modal input data (see Fig. 2, 5A, an d6, paragraph 111, 118, and 119-120, the processor executing the method 500 includes a trained multimodal dynamic fusion model, such as model 500 at FIG. 5A; the method 600 includes generating unimodal dynamic embeddings via unimodal dynamic encoders, which are 1D convolutional encoders for capturing timing information across all features from all input modalities. For example, upon extracting mental health features from each dataset, the features from each dataset may be passed through a corresponding trained dynamic encoding subnetwork (e.g., FIG. 3B) to generate unimodal dynamic embeddings for each dataset. For example, a set of mental features extracted from a dataset may be input into a trained dynamic encoding neural network to generate unimodal dynamic embeddings, which are vector representations of the input features including timing information for a given modality; The transformer applies a self-attention mechanism that directly models relationships between input vectors. The transformer allows for significantly more parallelization. The transformer encodes each position and applies the attention mechanism to relate input features at each time-step, which may be parallelized for all time-steps; the method 600 proceeds to 614 to generate a low dimensional representation. In one example, a cross-attention mechanism may be utilized to generate the low dimensional representation. In other examples, any other dimensionality reduction method may be implemented. In particular, since the interactions between the different modalities are captured in the high dimensional representation, any dimensionality reduction mechanism may be used and the interacting features for mental health determination would still be preserved); (d) inputting the first input data and the second input data to the dual attention network by the computer system (see Fig. 5A, items 101,103, …105), generating outputs as feature data comprising feature representations of the first modality and the second modality (see Fig. 10, paragraph 156, a device comprises a modality processing logic to process data output from at least two types of sensors to output a set of data representations for each of the at least two types of sensors, wherein each of the set of data representations comprising a vector comprising a set of features; modality combination logic to process the set of data representation to output a combined data representation); and (e) storing the feature data or displaying the feature data to a user with the computer system (see paragraph 71, one or more selected features (e.g., features having a relevance or importance to a current mental health evaluation greater than a threshold relevance) and corresponding time points may be displayed along with the mental health output. For example, the mental health evaluation output and one or more SHAP—based indications may be provided. The one or more SHAP indications may include selected features contributing to the output and/or plots of SHAP scores for all input features for all input data modalities (with or without selected feature highlights) may be displayed to the clinician on a clinician user interface). Regarding claim 2. Kollada teach the method of claim 1, wherein the first input data comprise text data and the first modality comprises textual information, and the second input data comprise image data and the second modality comprises image information (see Kollada, Fig. 1A, paragraph 53, , when data from at least two sensors are acquired, the data from the at least two sensors may be acquired simultaneously. In another example, when at least two data modalities are obtained (e.g., video data, audio data, utterance (that is, text/ language data from speech to text conversion)), the data modalities are acquired simultaneously). Regarding claim 3. Kollada teach the method of claim 1, wherein the first input data comprise first feature representation data comprising feature representations extracted from a dataset of the first modality (see Kollada, Fig. 3 and 4, paragraph 97, a rich representation of EEG features corresponding to mental health condition may be generated from EEG data from a EEG sensor; a rich representation of text features associated with mental condition may be generated using text data corresponding to spoken language (or based on user input entered via a user input device); and so on. Feature extraction may be performed using a trained neural network model or any feature extraction method depending on the modality data and/or features extracted from the modality data, where the extracted features include markers for mental health evaluation). Regarding claim 4. Kollada teach the method of claim 3, wherein the first feature representation data are extracted from the dataset of the first modality using a first transformer model (see, Kollada, Fig. 4 and 5, paragraph 100 , The multimodal fusion model 500 comprises unimodal encoders 404, 414, 424 for generating unimodal dynamic embeddings. As discussed above, unimodal encoders include stacked convolution blocks (1D convolution layer followed by batch normalization and ReLU activation) for extracting dynamic embeddings from each of the individual modalities. The dynamic embeddings are concatenated (502) and passed on to a Transformer Encoder, which performs dynamic attention fusion (512) to learn a multimodal representation for classification). Regarding claim 5. Kollada teach the method of claim 3, wherein the second input data comprise second feature representation data comprising feature representations extracted from a dataset of the second modality (see, Kollada, Fig. 4 and 5, paragraph 100 , The multimodal fusion model 500 comprises unimodal encoders 404, 414, 424 for generating unimodal dynamic embeddings. As discussed above, unimodal encoders include stacked convolution blocks (1D convolution layer followed by batch normalization and ReLU activation) for extracting dynamic embeddings from each of the individual modalities. The dynamic embeddings are concatenated (502) and passed on to a Transformer Encoder, which performs dynamic attention fusion (512) to learn a multimodal representation for classification). Regarding claim 6. Kollada teach the method of claim 5, wherein the second feature representation data are extracted from the dataset of the second modality using a second transformer model (see Kollada, Fig. 4s, paragraph 98, Audio features 402, video features 412, and text features 422 may be extracted from audio, video, and text data at a desired time resolution (e.g., 0.1 sec). Further, audio dynamic embeddings 406, video dynamic embeddings 416, and text dynamic embeddings 426 may be generated using an audio dynamic encoder 404, a video dynamic encoder 414, and a text dynamic encoder 424 respectively). Regarding claim 7. The method of claim 1, wherein the first modality comprises a text modality and the second modality comprises an image modality (see Kollada, Fig. 4B, 4C, and Fig. 5s). Regarding claim 10. Kollada teach the method of claim 7, wherein the second input data comprise second feature representation data comprising feature representations extracted from a dataset of the second modality using a vision transformer model (see, Kollada, Fig. 4s and 5s, paragraph 100 and 101, The multimodal fusion model 500 comprises unimodal encoders 404, 414, 424 for generating unimodal dynamic embeddings. As discussed above, unimodal encoders include stacked convolution blocks (1D convolution layer followed by batch normalization and ReLU activation) for extracting dynamic embeddings from each of the individual modalities. The dynamic embeddings are concatenated (502) and passed on to a Transformer Encoder, which performs dynamic attention fusion (512) to learn a multimodal representation for classification; the transformer encoder 550 comprises stacked transformer encoder blocks. Each transformer encoder block includes a Multi-Head Self Attention layer 556 and a feedforward network 560. In particular, the transformer encoder comprises of multi-head self-attention blocks. The self-attention mechanism encodes the context for each temporal step. This is done through learning a key, query and value for each step and the context is learnt through passing the query and key to a mathematical function (usually matrix multiplication followed by SoftMax). Rather than only computing the attention once, the multi-head mechanism runs through the attention multiple times in parallel). Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 15-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kollada (PGPUB: 20220392637 A1) in view of KOLLADA (WO 2022232382 A1). Regarding claim 15. Kollada does not expressly teach the method of claim 1, wherein the dual attention network comprises a first attention network associated with the first modality and a second attention network associated with the second modality. KOLLADA teaches that the first attention based subnetwork 342 receives the first modality embedding 332 as input and outputs a first modality modified embedding 352, the second attention based subnetwork 344 receives the second modality embedding 334 as input and outputs a second modality modified embedding 354, and so on until Nth attention based subnetwork 346 receives the Nth modality embedding 356 and outputs a Nth modality modified embedding (see Fig. 3, paragraph 96). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Kollada by KOLLADA to obtain the first attention based subnetwork 342 receives the first modality embedding 332 as input and the second attention based subnetwork 344 receives the second modality embedding 334 as input, in order to provide wherein the dual attention network comprises a first attention network associated with the first modality and a second attention network associated with the second modality. Therefore, combining the elements from prior arts according to known methods and technique would yield predictable results. Regarding claim 16. Kollada teach the method of claim 15, wherein each of the first attention network and the second attention network include a first multi-head self-attention module that is configured to learn about the first modality and a second multi-head self-attention module that is configured to learn about the second modality (see, Kollada, Fig. 4s and 5s, paragraph 100 and 101, The multimodal fusion model 500 comprises unimodal encoders 404, 414, 424 for generating unimodal dynamic embeddings. As discussed above, unimodal encoders include stacked convolution blocks (1D convolution layer followed by batch normalization and ReLU activation) for extracting dynamic embeddings from each of the individual modalities. The dynamic embeddings are concatenated (502) and passed on to a Transformer Encoder, which performs dynamic attention fusion (512) to learn a multimodal representation for classification; the transformer encoder 550 comprises stacked transformer encoder blocks. Each transformer encoder block includes a Multi-Head Self Attention layer 556 and a feedforward network 560. In particular, the transformer encoder comprises of multi-head self-attention blocks. The self-attention mechanism encodes the context for each temporal step. This is done through learning a key, query and value for each step and the context is learnt through passing the query and key to a mathematical function (usually matrix multiplication followed by SoftMax). Rather than only computing the attention once, the multi-head mechanism runs through the attention multiple times in parallel; see Fig. 600, paragraph 118, generating unimodal dynamic embeddings via unimodal dynamic encoders, which are 1D convolutional encoders for capturing timing information across all features from all input modalities. For example, upon extracting mental health features from each dataset, the features from each dataset may be passed through a corresponding trained dynamic encoding subnetwork (e.g., FIG. 3B) to generate unimodal dynamic embeddings for each dataset). Regarding claim 17. Kollada teach the method of claim 16, wherein an output of the first multi-head self-attention module and an output of the second multi-head self-attention module (see Kollada, Fig. 5s, paragraph 101, the transformer encoder 550 comprises stacked transformer encoder blocks. Each transformer encoder block includes a Multi-Head Self Attention layer 556 and a feedforward network 560. In particular, the transformer encoder comprises of multi-head self-attention blocks. The self-attention mechanism encodes the context for each temporal step. This is done through learning a key, query and value for each step and the context is learnt through passing the query and key to a mathematical function) are inputting to a cross attention module to identify alignments between the first modality and the second modality (see KOLLADA, Fig. 3B, paragraph 96 and 98, the first attention based subnetwork 342 receives the first modality embedding 332 as input and outputs a first modality modified embedding 352, the second attention based subnetwork 344 receives the second modality embedding 334 as input and outputs a second modality modified embedding 354, and so on until Nth attention based subnetwork 346 receives the Nth modality embedding 356 and outputs a Nth modality modified embedding. Each modified modality embedding includes context information relevant to each modality. In this way, by passing each modality embedding through a multi-head self-attention mechanism, contextualized unimodal representations (that is, modified embeddings) may be generated; the post-fusion module 371 may be implemented by a cross attention mechanism. For example, given a number of input streams ( m ), where the input streams can be individual modalities). Claim(s) 8-9, 14, and 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kollada (PGPUB: 20220392637 A1) in view of KUNZ (20230131675 A1). Regarding claim 8. Kollada does not expressly teach the method of claim 7, wherein the image modality comprises histopathological images. KUNZ teaches that determining a correct treatment type and treatment amount for a patient may be challenging. In particular, determining an effective treatment for a previously untreated patient might be difficult, especially when the treatment determination is determined based on the analysis of digital medical images (e.g., histopathological slides sampled from the patient). Techniques disclosed herein may support such determining by, for example, recommending amounts/dosages of a single or potential combination of treatments (e.g., drugs, medical interventions, etc.) for treating an untreated patient based on one or more digital medical images (see paragraph 35); according to implementations of the embedding representation module, one or more modalities may be received as input(s). A modality may refer to any type of input data received by the system such as a digital medical image, a synoptic report, and/or information relating to treatment (see Fig. 4, paragraph 80). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Kollada by KUNZ to obtain the treatment determination is determined based on the analysis of digital medical images (e.g., histopathological slides sampled from the patient) and one or more modalities may be received as input(s). A modality may refer to any type of input data received by the system such as a digital medical image, in order to provide wherein the image modality comprises histopathological images. Therefore, combining the elements from prior arts according to known methods and technique would yield predictable results. Regarding claim 9. Kollada teach the method of claim 8, wherein the dataset of the second modality comprises whole slide images (see KUNZ, paragraph 35, teaches that determining a correct treatment type and treatment amount for a patient may be challenging. In particular, determining an effective treatment for a previously untreated patient might be difficult, especially when the treatment determination is determined based on the analysis of digital medical images (e.g., histopathological slides sampled from the patient). Techniques disclosed herein may support such determining by, for example, recommending amounts/dosages of a single or potential combination of treatments (e.g., drugs, medical interventions, etc.) for treating an untreated patient based on one or more digital medical images). Regarding claim 14. Kollada does not expressly teach the method of claim 1, wherein the first input data comprise genomic data and the first modality comprises genomic information. KUNZ teaches that the treatment recommendation module 136 may receive one or more digital medical images as input. In one example, the inputted digital medical images may include sets of images, wherein each set is for an individual patient. The sets may include digital medical images of the same one or more medical specimen at various times. The system may further receive metadata associated with the inputted digital medical image. In one example, the metadata may include time stamps and metadata to distinguish which medical digital images belong to the same individual and what time/location the sampling of the medical image took place. Further, the metadata may include the tissue type for the inputted images. In another example, the metadata may include list of past treatments and dosages of treatment for each medical slide over time. The metadata may further include clinical data about the patient over time, such as past treatment dosage levels (if any), time between past treatments, and genomic information for the one or more individuals associated with the digital medical images inputted (see Fig. 1s, paragraph 119). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Kollada by KUNZ to obtain the metadata may include list of past treatments and dosages of treatment for each medical slide over time. The metadata may further include clinical data about the patient over time, such as past treatment dosage levels (if any), time between past treatments, and genomic information for the one or more individuals associated with the digital medical images inputted, in order to provide wherein the first input data comprise genomic data and the first modality comprises genomic information. Therefore, combining the elements from prior arts according to known methods and technique would yield predictable results. Regarding claim 18. Kollada teach the method of claim 1, wherein the first input data comprises text data and the second input data comprises image data (model (see Kollada, Fig. 4s, paragraph 98, Audio features 402, video features 412, and text features 422 may be extracted from audio, video, and text data at a desired time resolution (e.g., 0.1 sec). Further, audio dynamic embeddings 406, video dynamic embeddings 416, and text dynamic embeddings 426 may be generated using an audio dynamic encoder 404, a video dynamic encoder 414, and a text dynamic encoder 424 respectively), further comprising generating a report based on the feature data using the computer system (see Kollada, Fig. 10, paragraph 156, a device comprises a modality processing logic to process data output from at least two types of sensors to output a set of data representations for each of the at least two types of sensors, wherein each of the set of data representations comprising a vector comprising a set of features; modality combination logic to process the set of data representation to output a combined data representation). However, Kollada does comprising histopathological images. KUNZ teaches that determining a correct treatment type and treatment amount for a patient may be challenging. In particular, determining an effective treatment for a previously untreated patient might be difficult, especially when the treatment determination is determined based on the analysis of digital medical images (e.g., histopathological slides sampled from the patient). Techniques disclosed herein may support such determining by, for example, recommending amounts/dosages of a single or potential combination of treatments (e.g., drugs, medical interventions, etc.) for treating an untreated patient based on one or more digital medical images (see paragraph 35); according to implementations of the embedding representation module, one or more modalities may be received as input(s). A modality may refer to any type of input data received by the system such as a digital medical image, a synoptic report, and/or information relating to treatment (see Fig. 4, paragraph 80). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Kollada by KUNZ to obtain the treatment determination is determined based on the analysis of digital medical images (e.g., histopathological slides sampled from the patient) and one or more modalities may be received as input(s). A modality may refer to any type of input data received by the system such as a digital medical image, in order to provide wherein the image modality comprises histopathological images. Therefore, combining the elements from prior arts according to known methods and technique would yield predictable results. Regarding claim 19. Kollada teach the method of claim 18, wherein the report comprises a diagnostic report based on cross-model features in the feature data (see Kollada, paragraph 156, a device comprises a modality processing logic to process data output from at least two types of sensors to output a set of data representations for each of the at least two types of sensors, wherein each of the set of data representations comprising a vector comprising a set of features; modality combination logic to process the set of data representation to output a combined data representation; diagnosis determination logic to determine a health diagnosis based on the relevance of the combined data representation to a mental health diagnosis; and a feature relevance determination logic to process the mental health diagnosis and determine a relevance of each of the set of features to the mental health diagnose). Regarding claim 20. Kollada teach the method of claim 1, wherein the first input data comprises text data and the second input data comprises image data (see Kollada, Fig. 4s, paragraph 98, Audio features 402, video features 412, and text features 422 may be extracted from audio, video, and text data at a desired time resolution (e.g., 0.1 sec). Further, audio dynamic embeddings 406, video dynamic embeddings 416, and text dynamic embeddings 426 may be generated using an audio dynamic encoder 404, a video dynamic encoder 414, and a text dynamic encoder 424 respectively), further comprising retrieving additional image data from a database based on the feature data (see Kollada, paragraph 90, The server 234 may include a multimodal database 232 for storing the plurality of modality data for each patient. The multimodal database may also store plurality of training and/or validation datasets for training and/or validating the multimodal fusion model for performing mental health evaluation. Further, the mental health evaluation output from the multimodal fusion model 238 may be stored at the multimodal database 232. Additionally, or alternatively, the mental health evaluation output may be transmitted from the server to the computing device, and displayed and/or stored at the computing device 212). However, Kollada does comprising histopathological images. KUNZ teaches that determining a correct treatment type and treatment amount for a patient may be challenging. In particular, determining an effective treatment for a previously untreated patient might be difficult, especially when the treatment determination is determined based on the analysis of digital medical images (e.g., histopathological slides sampled from the patient). Techniques disclosed herein may support such determining by, for example, recommending amounts/dosages of a single or potential combination of treatments (e.g., drugs, medical interventions, etc.) for treating an untreated patient based on one or more digital medical images (see paragraph 35); according to implementations of the embedding representation module, one or more modalities may be received as input(s). A modality may refer to any type of input data received by the system such as a digital medical image, a synoptic report, and/or information relating to treatment (see Fig. 4, paragraph 80). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Kollada by KUNZ to obtain the treatment determination is determined based on the analysis of digital medical images (e.g., histopathological slides sampled from the patient) and one or more modalities may be received as input(s). A modality may refer to any type of input data received by the system such as a digital medical image, in order to provide wherein the image modality comprises histopathological images. Therefore, combining the elements from prior arts according to known methods and technique would yield predictable results. Allowable Subject Matter Claims 11-13 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIN JIA whose telephone number is (571)270-5536. The examiner can normally be reached 9:00 am-7:30pm. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Gregory Morse can be reached at (571)272-3838. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /XIN JIA/Primary Examiner, Art Unit 2663
Read full office action

Prosecution Timeline

Dec 11, 2023
Application Filed
Feb 25, 2026
Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602786
FREE FLUID ESTIMATION
2y 5m to grant Granted Apr 14, 2026
Patent 12602782
IMAGE PROCESSING DEVICE, IMAGE PROCESSING METHOD, AND STORAGE MEDIUM
2y 5m to grant Granted Apr 14, 2026
Patent 12602923
METHODS AND APPARATUS TO PROVIDE AN EFFICIENT SAFETY MECHANISM FOR SIGNAL PROCESSING HARDWARE
2y 5m to grant Granted Apr 14, 2026
Patent 12597137
DIGITAL SYNTHESIS OF HISTOLOGICAL STAINS USING MULTIPLEXED IMMUNOFLUORESCENCE IMAGING
2y 5m to grant Granted Apr 07, 2026
Patent 12592311
SYSTEMS AND METHODS FOR PERFORMING OPTIMAL ANCHOR-PRIOR MATCHING OPERATIONS
2y 5m to grant Granted Mar 31, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
85%
Grant Probability
98%
With Interview (+12.8%)
2y 6m
Median Time to Grant
Low
PTA Risk
Based on 601 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month