Last updated: May 29, 2026
Application No. 17/188,338
NEURAL NETWORK TRAINING TECHNIQUE

Final Rejection §101§103§112
Filed
Mar 01, 2021
Examiner
SPRAUL III, VINCENT ANTON
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Nvidia Corporation
OA Round
4 (Final)
Interview Optional

— +26.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 60% grant rate with +26.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 37 resolved cases, 2023–2026
Examiner Intelligence

SPRAUL III, VINCENT ANTON View full profile →
Grants 60% of resolved cases
Career Allowance Rate
22 granted / 37 resolved
+4.5% vs TC avg
Strong +27% interview lift
Without
With
+26.7%
Interview Lift
resolved cases with interview
Typical timeline
4y 4m
Avg Prosecution
19 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
2.8%
-37.2% vs TC avg
§103
93.8%
+53.8% vs TC avg
§102
0.6%
-39.4% vs TC avg
§112
1.7%
-38.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 37 resolved cases
Office Action

§101 §103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Regarding the rejection of claims under 35 U.S.C. 112(b), amendments to the claims have overcome the previous rejections; however, new grounds of rejection under 35 U.S.C. 112(b) have been found, which are given below.

Regarding the rejection of claims under 35 U.S.C. 101, Applicant submits that amended claims “provide a technological solution that is more than mere mental operations being performed by a computer, but rather are concrete improvements to the classification of objects in images and generation of associated text.”
Examiner respectfully disagrees. Claim 1 recites a process of examining an image, detecting and classifying objects in that image, and further identifying locations in the image associated with the classified objects: “detect one or more objects in one or more images […] determine a classification of the one or more detected objects […] based on the one or more images and textual data corresponding to the one or more detected objects […]and generate a saliency map, indicating locations associated with the one or more detected objects, based on the classification.” This process can be performed by a person using observation, evaluation, and judgement. The remaining elements of the claim (“circuitry to use one or more neural networks to,” “using a decoder of the one or more neural networks,” and “using a cross-attention encoder of the one or more neural networks”) merely broadly recite the inclusion of machine learning techniques in implementing the steps of the mental process, without any detail that would make the claimed process particular or that would be recognized as an improvement.
The argument is therefore found unpersuasive.

Regarding the rejection of claims under 35 U.S.C. 102 and 103, Applicant’s arguments are directed towards amended claims that have not been previously examined. New grounds of rejection under 35 U.S.C. 103 are provided below.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 3 and 17-23 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Regarding claim 3:
	Claim 3 recites “wherein a first portion and a second portion of the one or more neural networks are trained in parallel to encode features of one or more images and the textual data to a shared latent space.” Examiner considers “to encode features of one or more images” indefinite as to whether these “one or more images” are the same “one or more images” recited in claim 1. If such identification is intended, Examiner respectfully suggests amending the claim to recite “to encode features of the one or more images.”

Regarding claim 17:
	Claim 17 recites in relevant part:

infer a condition of one or more objects in one or more images using a decoder of the one or more neural networks; 

determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects

	Examiner finds the phrase “the one or more detected objects” renders the claim indefinite as no prior limitation recites “detected objects,” and therefore it is not definite whether “the one or more detected objects” are the same as the “one or more objects in one or more images” of the previous limitation.

Regarding claims 18-23:
	These claims are rejected by dependency on claim 17.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-31 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Analysis is provided for the claims under the guidelines of MPEP 2106.

Regarding claim 1:
Step 1:
The claim recites “One or more processors, comprising.” Thus the claim is to a manufacture, which is a statutory category of invention.
Step 2A prong 1:
The element (bold only) “circuitry to use one or more neural networks to detect one or more objects in one or more images using a decoder of the one or more neural networks,” in its broadest reasonable interpretation, recites a mental process. This process could be performed using observation and judgment.
The element (bold only) “determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects,“ in its broadest reasonable interpretation, recites a mental process. A person could classify objects in an image based on the images and associated textual data, using observation and judgment.
The element “and generate a saliency map, indicating locations associated with the one or more detected objects, based on the classification,” in its broadest reasonable interpretation, recites a mental process. A saliency map highlights locations of interest in an image. A person could highlight locations of interest in an image based on a classification of the one or more objects in the one or more images, using both the image itself and textual data related to the image, using observation and judgment.
Thus, the claim recites an abstract idea.
Step 2A prong 2:
The element (bold only) “circuitry to use one or more neural networks to detect one or more objects in one or more images using a decoder of the one or more neural networks,” recites the use of neural networks at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
The element (bold only) “circuitry to use one or more neural networks to detect one or more objects in one or more images using a decoder of the one or more neural networks” recites the use of a decoder at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
The element (bold only) “determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects” recites the use of a cross-attention encoder at a high level of generality. No particular method of cross-attention encoding or use of the resulting encoding is described. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
Thus, the additional elements merely recite the use of a computer as a tool to perform the abstract idea. Taken alone, the additional elements do not integrate the abstract idea into a practical application. Considering the elements together as an ordered combination adds nothing that is not present from examining the elements individually. The elements, individually or together, do not describe an improvement in the functioning of technology.
Step 2B:
The claim as a whole does not amount to significantly more than the recited judicial exception. 
These additional claim elements recite mere instructions to apply the abstract idea:
(bold only) “circuitry to use one or more neural networks to detect one or more objects in one or more images using a decoder of the one or more neural networks”
(bold only) “circuitry to use one or more neural networks to detect one or more objects in one or more images using a decoder of the one or more neural networks” 
(bold only) “determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects” 
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 2:
For step 2A prong 1, claim 2 further limits claim 1 and the same elements in claim 2 still recite an abstract idea.
The element (bold only) “wherein the decoder is to generate the saliency map based on a prediction of one or more portions of the textual data, wherein the prediction uses the one or more images and the textual data” in its broadest reasonable interpretation, recites a mental process. A person could highlight locations of interest in an image based on a prediction produced from the images and associated text, using observation and judgment.
Thus the element adds to the abstract idea.
For step 2A prong 2, the element (bold only) “wherein the decoder is to generate the saliency map based on a prediction of one or more portions of the textual data, wherein the prediction uses the one or more images and the textual data” recites the use of a decoder at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element  (bold only) “wherein the decoder is to generate the saliency map based on a prediction of one or more portions of the textual data, wherein the prediction uses the one or more images and the textual data” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 3:
For step 2A prong 1, claim 3 further limits claim 1 and the same elements in claim 3 still recite an abstract idea.
Step 2A prong 2:
The element “wherein a first portion and a second portions of the one or more neural networks are trained in parallel” recites simultaneous training of neural networks at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
The element “to encode features of one or more images and the textual data to a shared latent space” recite a combined encoding at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
Step 2B:
The claim as a whole does not amount to significantly more than the recited judicial exception. 
These additional claim elements recite mere instructions to apply the abstract idea:
“wherein a first portion and a second portions of the one or more neural networks are trained in parallel” 
“to encode features of one or more images and the textual data to a shared latent space”
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 4:
For step 2A prong 1, claim 4 further limits claim 1 and the same elements in claim 4 still recite an abstract idea.
The element “wherein the textual data comprises textual descriptions of the one or more images” further limits the mental processes in claim 1 but they remain mental processes. 
For step 2A prong 2, and step 2B, no further elements remain to be considered. The claim as a whole does not amount to significantly more than the recited judicial exception and is ineligible under 35 U.S.C. 101.

Regarding claim 5:
For step 2A prong 1, claim 5 further limits claim 1 and the same elements in claim 5 still recite an abstract idea.
For step 2A prong 2, the further element “wherein the one or more neural networks comprise a cross-attention encoder, wherein a query input to the cross-attention encoder comprises output from a second portion of the one or more neural networks, and wherein key and value input to the cross-attention encoder comprises output from a first portion of the recites the application of a cross-attention encoder at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element “wherein the one or more neural networks comprise a cross- attention encoder, wherein a query input to the cross-attention encoder comprises output from a second portion of the one or more neural networks, and wherein key and value input to the cross-attention encoder comprises output from a first portion of therecites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 6:
For step 2A prong 1, claim 6 further limits claim 1 and the same elements in claim 6 still recite an abstract idea.
For step 2A prong 2, the further element “wherein the one or more neural networks comprise a decoder to generate a saliency map based, at least in part, on output of a cross-attention encoder” recites the application, at a high level of generality, of a cross-attention encoder to the abstract idea of creating a saliency map. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element “wherein the one or more neural networks comprise a decoder to generate a saliency map based, at least in part, on output of a cross-attention encoder” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 7:
For step 2A prong 1, claim 7 further limits claim 1 and the same elements in claim 7 still recite an abstract idea.
The element “wherein the textual data represents a textual document” further limits the mental processes in claim 1 but they remain mental processes.
The element (bold only) “wherein an output of the one or more neural networks comprises a classification of a condition depicted in the one or more images and described in the textual document,” in its broadest reasonable interpretation, recites a mental process. A person could examine an image and a textual description of the image and identify a class of condition depicted in the image, using observation, evaluation, and judgement.
For step 2A prong 2, the element (bold only) “wherein an output of the one or more neural networks comprises a classification of a condition depicted in the one or more images and described in the textual document” recites the use of neural networks at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element (bold only) “wherein an output of the one or more neural networks comprises a classification of a condition depicted in the one or more images and described in the textual document” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 8:
For step 2A prong 1, claim 8 further limits claim 1 and the same elements in claim 8 still recite an abstract idea.
The element (bold only) “wherein an output of the one or more neural networks comprises information identifying a condition depicted in the one or more images,” in its broadest reasonable interpretation, recites a mental process. A person could examine an image identify a condition depicted in the image, using observation, evaluation, and judgement.
For step 2A prong 2, the element (bold only) “wherein an output of the one or more neural networks comprises information identifying a condition depicted in the one or more images” recites the use of neural networks at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element (bold only) “wherein an output of the one or more neural networks comprises information identifying a condition depicted in the one or more images” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 9:
For step 1, the claim recites “A system, comprising: one or more processors to.” Thus the claim is to a machine, which is a statutory category of invention. The claim is otherwise analogous to claim 1 and is rejected by the same arguments.

Regarding claim 10:
For step 2A prong 1, claim 10 further limits claim 9 and the same elements in claim 10 still recite an abstract idea.
Step 2A prong 2:
The element “wherein a first portion and a second portions of the one or more neural networks are trained in parallel” recites simultaneous training of neural networks at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
The element “wherein the second portion of the one or more neural networks is taught during training to provide information for training the first portion of the one or more neural network” recites training a neural network with the output of another neural network at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
Step 2B:
The claim as a whole does not amount to significantly more than the recited judicial exception. 
These additional claim elements recite mere instructions to apply the abstract idea:
“wherein a first portion and a second portions of the one or more neural networks are trained in parallel” 
“wherein the second portion of the one or more neural networks is taught during training to provide information for training the first portion of the one or more neural network”
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 11:
For step 1, the claim recites “A system, comprising: one or more processors to.” Thus the claim is to a machine, which is a statutory category of invention. The claim is otherwise analogous to claim 2 and is rejected by the same arguments.

Regarding claim 12:
For step 2A prong 1, claim 10 further limits claim 9 and the same elements in claim 10 still recite an abstract idea.
The element “wherein the textual data comprises textual descriptions of the one or more images” further limits the mental processes in claim 1 but they remain mental processes.
For step 2A prong 2, and step 2B, no further elements remain to be considered. The claim as a whole does not amount to significantly more than the recited judicial exception and is ineligible under 35 U.S.C. 101.

Regarding claim 13:
For step 2A prong 1, claim 10 further limits claim 9 and the same elements in claim 10 still recite an abstract idea.
For step 2A prong 2, the further element “wherein the one or more neural networks comprise a cross-attention encoder, wherein a query input to the cross-attention encoder comprises output from a second portion of the one or more neural networks, and wherein key and value input to the cross-attention encoder comprises output from a first portion of the recites the application of a cross-attention encoder at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element “wherein the one or more neural networks comprise a cross- attention encoder, wherein a query input to the cross-attention encoder comprises output from a second portion of the one or more neural networks, and wherein key and value input to the cross-attention encoder comprises output from a first portion of therecites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 14:
For step 2A prong 1, claim 10 further limits claim 9 and the same elements in claim 10 still recite an abstract idea.
The element (bold only) “wherein the one or more neural networks comprise a decoder to generate information indicative of a region of an image” in its broadest reasonable interpretation, recites a mental process. This process could be performed using observation and judgment.
For step 2A prong 2, the further element (bold only) “wherein the one or more neural networks comprise a decoder to generate information indicative of a region of an image” recites a neural network acting as a decoder at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element “(bold only) “wherein the one or more neural networks comprise a decoder to generate information indicative of a region of an image”” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 15:
For step 2A prong 1, claim 10 further limits claim 9 and the same elements in claim 10 still recite an abstract idea.
The element (bold only) “wherein an output of the one or more neural networks comprises a classification of a condition depicted in an image,” in its broadest reasonable interpretation, recites a mental process. A person could examine an image and identify a class of condition depicted in the image, using observation, evaluation, and judgement.
For step 2A prong 2, the element (bold only) “wherein an output of the one or more neural networks comprises a classification of a condition depicted in an image” recites the use of neural networks at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element (bold only) “wherein an output of the one or more neural networks comprises a classification of a condition depicted in an image” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 16:
For step 2A prong 1, claim 10 further limits claim 9 and the same elements in claim 10 still recite an abstract idea.
The element “wherein the one or more images comprise a diagnostic image and the textual data comprises a diagnostic report corresponding to the diagnostic image” further limits the mental processes in claim 1 but they remain mental processes.
For step 2A prong 2, and step 2B, no further elements remain to be considered. The claim as a whole does not amount to significantly more than the recited judicial exception and is ineligible under 35 U.S.C. 101.

Regarding claim 17:
Step 1:
The claim recites “One or more processors, comprising.” Thus the claim is to a manufacture, which is a statutory category of invention.
Step 2A prong 1:
The element (bold only) “circuitry to use one or more neural networks to infer a condition of one or more objects in one or more images using a decoder of the one or more neural networks,” in its broadest reasonable interpretation, recites a mental process. This process could be performed using observation and judgment.
The element (bold only) “determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects,“ in its broadest reasonable interpretation, recites a mental process. A person could classify objects in an image based on the images and associated textual data. This process could be performed using observation and judgment.
The element (bold only) “generate a saliency map, indicating locations associated with the one or more detected objects, based on the classification” in its broadest reasonable interpretation, recites a mental process. A saliency map highlights locations of interest in an image. A person could highlight locations of interest in an image based on a classification of the one or more objects in the one or more images, using both the image itself and textual data related to the image, using observation and judgment.
Thus, the claim recites an abstract idea.
Step 2A prong 2:
The element (bold only) “circuitry to use one or more neural networks to infer a condition of one or more objects in one or more images using a decoder of the one or more neural networks,” recites the use of neural networks at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
The element (bold only) “circuitry to use one or more neural networks to infer a condition of one or more objects in one or more images using a decoder of the one or more neural networks” recites the use of a decoder at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
The element (bold only) “determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects” recites the use of a cross-attention encoder at a high level of generality. No particular method of cross-attention encoding or use of the resulting encoding is described. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
Thus, the additional elements merely recite the use of a computer as a tool to perform the abstract idea. Taken alone, the additional elements do not integrate the abstract idea into a practical application. Considering the elements together as an ordered combination adds nothing that is not present from examining the elements individually. The elements, individually or together, do not describe an improvement in the functioning of technology.
Step 2B:
The claim as a whole does not amount to significantly more than the recited judicial exception. 
These additional claim elements recite mere instructions to apply the abstract idea:
(bold only) “circuitry to use one or more neural networks to infer a condition of one or more objects in one or more images using a decoder of the one or more neural networks” 
(bold only) “circuitry to use one or more neural networks to infer a condition of one or more objects in one or more images using a decoder of the one or more neural networks” 
(bold only) “determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects”
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 18:
For step 2A prong 1, claim 18 further limits claim 17 and the same elements in claim 18 still recite an abstract idea.
For step 2A prong 2, the further element “wherein a first portion of the one or more neural networks is trained to encode features of the one or more images and a second portion of the one or more neural networks is trained to encode features of the textual data” recites feature extraction at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element “wherein a first portion of the one or more neural networks is trained to encode features of the one or more images and a second portion of the one or more neural networks is trained to encode features of the textual data” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 19:
For step 2A prong 1, claim 19 further limits claim 18 and the same elements in claim 19 still recite an abstract idea.
For step 2A prong 2, the element “wherein the first portion of the one or more neural networks, and the second portion of the one or more neural networks, encode their respective inputs to a common latent space” recite a combined encoding at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element (bold only) “wherein the first portion of the one or more neural networks, and the second portion of the one or more neural networks, encode their respective inputs to a common latent space” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 20:
For step 2A prong 1, claim 20 further limits claim 17 and the same elements in claim 20 still recite an abstract idea.
For step 2A prong 2, the further element “wherein the one or more neural networks are is trained based, at least in part, on output of a cross-attention encoder using, as input to the cross-attention encoder, output of an image encoder and output of a language encoder” recites the application, at a high level of generality, of a cross-attention encoder to multi-modal input. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element “wherein the one or more neural networks are is trained based, at least in part, on output of a cross-attention encoder using, as input to the cross-attention encoder, output of an image encoder and output of a language encoder” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 21:
For step 2A prong 1, claim 21 further limits claim 17 and the same elements in claim 21 still recite an abstract idea.
The element “wherein the one or more images comprises diagnostic images and the textual data comprises diagnostic reports corresponding to the diagnostic images” further limits the mental processes in claim 1 but they remain mental processes.
For step 2A prong 2, and step 2B, no further elements remain to be considered. The claim as a whole does not amount to significantly more than the recited judicial exception and is ineligible under 35 U.S.C. 101.

Regarding claim 22:
For step 2A prong 1, claim 22 further limits claim 17 and the same elements in claim 22 still recite an abstract idea.
The element “wherein the inferred condition comprises information indicative of an area of interest in the one or more images” further limits the mental processes in claim 1 but they remain mental processes.
For step 2A prong 2, and step 2B, no further elements remain to be considered. The claim as a whole does not amount to significantly more than the recited judicial exception and is ineligible under 35 U.S.C. 101.

Regarding claim 23:
For step 2A prong 1, claim 23 further limits claim 17 and the same elements in claim 23 still recite an abstract idea.
	Step 2A prong 2:
The element “wherein a first portion of the one or more neural networks is trained to encode features of the one or more images and a second portion of the one or more neural networks is trained to encode features of the textual data” recites feature extraction at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
The element “wherein the first portion of the neural network one or more neural networks, after training, is capable of inferring the information condition independently of the second portion” recites independent neural network inference at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
Step 2B:
The claim as a whole does not amount to significantly more than the recited judicial exception. 
These additional claim elements recite mere instructions to apply the abstract idea:
“wherein a first portion of the one or more neural networks is trained to encode features of the one or more images and a second portion of the one or more neural networks is trained to encode features of the textual data” 
“wherein the first portion of the neural network one or more neural networks, after training, is capable of inferring the information condition independently of the second portion”
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 24:
Step 1:
The claim recites “A method, comprising.” Thus the claim is to a process, which is a statutory category of invention.
Step 2A prong 1:
The element (bold only) “using one or more neural networks to diagnose a condition depicted in a diagnostic image using a decoder of the one or more neural networks,” in its broadest reasonable interpretation, recites a mental process. This process could be performed using observation and judgment.
The element (bold only) “determine a classification of the condition using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the condition,“ in its broadest reasonable interpretation, recites a mental process. A person could classify a condition depicted in an image based on the images and associated textual data, using observation and judgement.
The element “and generate a saliency map, indicating locations associated with the condition, based on the classification and a prediction of one or more portions of a set of diagnostic reports, wherein the prediction uses the one or more images and the textual data” in its broadest reasonable interpretation, recites a mental process. A saliency map highlights locations of interest in an image. A person could highlight locations of interest in an image based on a classification of the one or more objects in the one or more images, using the image itself, the textual data related to the image, and a prediction produced from a set of diagnostic reports, using observation and judgment.
Thus, the claim recites an abstract idea.
Step 2A prong 2:
The element (bold only) “using one or more neural networks to diagnose a condition depicted in a diagnostic image using a decoder of the one or more neural networks” recites the use of neural networks at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
The element (bold only) “using one or more neural networks to diagnose a condition depicted in a diagnostic image using a decoder of the one or more neural networks” recites the use of a decoder at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
The element (bold only) “determine a classification of the condition using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the condition” recites the use of a cross-attention encoder at a high level of generality. No particular method of cross-attention encoding or use of the resulting encoding is described. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
Thus, the additional elements merely recite the use of a computer as a tool to perform the abstract idea. Taken alone, the additional elements do not integrate the abstract idea into a practical application. Considering the elements together as an ordered combination adds nothing that is not present from examining the elements individually. The elements, individually or together, do not describe an improvement in the functioning of technology.
Step 2B:
The claim as a whole does not amount to significantly more than the recited judicial exception. 
These additional claim elements recite mere instructions to apply the abstract idea:
(bold only) “using one or more neural networks to diagnose a condition depicted in a diagnostic image using a decoder of the one or more neural networks” 
(bold only) “using one or more neural networks to diagnose a condition depicted in a diagnostic image using a decoder of the one or more neural networks”
(bold only) “determine a classification of the condition using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the condition”
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 25:
For step 2A prong 1, claim 25 further limits claim 24 and the same elements in claim 25 still recite an abstract idea.
The element “wherein a first portion of the one or more neural networks is trained in parallel with a second portion of the one or more neural networks” recites simultaneous training of neural networks at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
The element “wherein the second portion of the one or more neural networks is trained to encode features of the set of diagnostic reports” recites feature extraction at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
Step 2B:
The claim as a whole does not amount to significantly more than the recited judicial exception. 
These additional claim elements recite mere instructions to apply the abstract idea:
“wherein a first portion of the one or more neural networks is trained in parallel with a second portion of the one or more neural networks” 
“wherein the second portion of the one or more neural networks is trained to encode features of the set of diagnostic reports”
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 26:
For step 2A prong 1, claim 26 further limits claim 25 and the same elements in claim 26 still recite an abstract idea.
For step 2A prong 2, the element “wherein the first and second portions of the one or more neural networks are trained to encode features of the one or more images and the textual to a shared latent space” recites a combined encoding at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element “wherein the first and second portions of the one or more neural networks are trained to encode features of the one or more images and the textual to a shared latent space” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 27:
For step 2A prong 1, claim 27 further limits claim 24 and the same elements in claim 27 still recite an abstract idea.
For step 2A prong 2, the further element “providing, as input to a cross-attention encoder, a query input comprising output from a language encoder, and key and value input comprising output from an image encode” recites the application of a cross-attention encoder at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element “providing, as input to a cross-attention encoder, a query input comprising output from a language encoder, and key and value input comprising output from an image encode” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 28:
For step 2A prong 1, claim 28 further limits claim 24 and the same elements in claim 28 still recite an abstract idea.
For step 2A prong 2, the element “training a language encoder of the one or more neural networks to encode features of the diagnostic reports to a latent space shared with output of an image encoder” recites a combined encoding at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element “training a language encoder of the one or more neural networks to encode features of the diagnostic reports to a latent space shared with output of an image encoder” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 29:
For step 2A prong 1, claim 29 further limits claim 24 and the same elements in claim 29 still recite an abstract idea.
The element (bold only) “decoding output of an encoder to generate information summarizing the condition,” in its broadest reasonable interpretation, recites a mental process. This process could be performed using observation and judgment.
For step 2A prong 2, the element (bold only) “decoding output of an encoder to generate information summarizing the condition,” recites a combined encoding at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element (bold only) “decoding output of an encoder to generate information summarizing the condition” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 30:
For step 2A prong 1, claim 30 further limits claim 24 and the same elements in claim 30 still recite an abstract idea.
The element (bold only) “wherein the one or more neural networks comprises a decoder to generate information indicative of a region in the diagnostic image that depicts the condition,” in its broadest reasonable interpretation, recites a mental process. This process could be performed using observation and judgment.
For step 2A prong 2, the element (bold only) “wherein the one or more neural networks comprises a decoder to generate information indicative of a region in the diagnostic image that depicts the condition,” recites a neural network acting as a decoder at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element (bold only) “wherein the one or more neural networks comprises a decoder to generate information indicative of a region in the diagnostic image that depicts the condition” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Regarding claim 31:
For step 2A prong 1, claim 31 further limits claim 24 and the same elements in claim 31 still recite an abstract idea.
The element (bold only) “wherein diagnoses of the condition comprises identifying one or more categories of conditions determined, by the one or more neural networks, to be associated with a region of the diagnostic image,” in its broadest reasonable interpretation, recites a mental process. This process could be performed using observation and judgment.
For step 2A prong 2, the element (bold only) “wherein diagnoses of the condition comprises identifying one or more categories of conditions determined, by the one or more neural networks, to be associated with a region of the diagnostic image,” recites using neural network acting at a high level of generality. The element thus merely recites the use of a computer as a tool to perform the abstract idea, and is equivalent to adding the words “apply it” or the equivalent to the judicial exception (MPEP 2106.05(f)).
For step 2B, the claim as a whole does not amount to significantly more than the recited judicial exception. The additional element (bold only) “wherein diagnoses of the condition comprises identifying one or more categories of conditions determined, by the one or more neural networks, to be associated with a region of the diagnostic image” recites mere instructions to apply the abstract idea.
Even when considered in combination, the additional elements represent mere instructions to apply the abstract idea to a computer, which do not provide an inventive concept. The claim is not eligible under 35 U.S.C. 101.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-6, 9-10, and 11-14 rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al., “Learning Deep Features for Discriminative Localization,” 2015, arXiv:1512.04150v1 (hereafter Zhou) in view of Wei et al., “Multi-Modality Cross Attention Network for Image and Sentence Matching,” 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, doi: 10.1109/CVPR42600.2020.01095 (hereafter Wei).

Regarding claim 1 and analogous claim 9:
Zhou teaches:
(bold only) “One or more processors, comprising circuitry to use one or more neural networks detect one or more objects in one or more images using a decoder of the one or more neural networks”: Zhou, section 1.1, paragraph 5, “There has been a number of recent works [29, 14, 4, 33] that visualize the internal representation learned by CNNs in an attempt to better understand their properties. Zeiler et al [29] use deconvolutional networks to visualize what patterns activate each unit. Zhou et al. [33] show that CNNs learn object detectors [detect one or more objects in one or more images using a decoder of the one or more neural networks] while being trained to recognize scenes, and demonstrate that the same network can perform both scene recognition and object localization in a single forward-pass.”
(bold only) “determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects”: Zhou, section 2., paragraph 3, “For a given image, let fk(x, y) represent the activation of unit k in the last convolutional layer at spatial location (x, y). Then, for unit k, the result of performing global average pooling, Fk is Σx, y fk(x, y). Thus, for a given class c, the input to the softmax, Sc, is Σx, y wck Fk where wck is the weight corresponding to class c for unit k. Essentially, wck indicates the importance of Fk for class c [determine a classification of the one or more detected objects using … one or more neural networks and based on the one or more images … corresponding to the one or more detected objects].”
“generate a saliency map, indicating locations associated with the one or more detected objects, based on the classification”: Zhou, section 2, paragraph 1, “In this section, we describe the procedure for generating class activation maps (CAM) using global average pooling (GAP) in CNNs. A class activation map for a particular category indicates the discriminative image regions used by the CNN to identify that category (e.g., Fig. 3) [generate a saliency map, indicating locations associated with the one or more detected objects, based on the classification].”
Zhou does not explicitly teach:
(bold only) “One or more processors, comprising circuitry to use one or more neural networks detect one or more objects in one or more images using a decoder of the one or more neural networks”
(bold only) “determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects”
Wei teaches:
(bold only) “One or more processors, comprising circuitry to use one or more neural networks detect one or more objects in one or more images using a decoder of the one or more neural networks”: Wei, section 4.2, paragraph 1, “The proposed Multi-Modality Cross-Attention Network is implemented in PyTorch framework [27] with a NVIDIA GeForce GTX 2080Ti GPU [One or more processors, comprising circuitry].”
(bold only) “determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects”: Wei, section 3.1, paragraph 1, “As shown in Figure 2, our Multi-Modality Cross Attention Network mainly consists of two modules, the self-attention module and the cross-attention module, demonstrated in the green dashed blocks and red dash dashed block in Figure 2, respectively. Given an image and sentence pair, we first feed the image into the bottom-up attention model [32] pre-trained on Visual Genome [20] to extract features for image regions. Meanwhile, we use Word-Piece tokens of each sentence as the fragments in the textual modality. Based on these extracted fine-grained representations for image regions and sentence words [one or more images and textual data], we model the intra-modality relationship with the Self-Attention Module, and adopt the Cross-Attention Module [cross-attention] to model the intermodality and intra-modality relationships for image regions and sentence words. […] As shown in Figure 2, we get two pairs of embeddings for the given image-sentence pair (i0, c0) and (i1, c1), which are used for image and sentence matching [a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data].”
Wei and Zhou are analogous arts as they are both related to image analysis. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the multi-modal cross-attention of Wei with the teachings of Zhou to arrive at the present invention, in order to improve model performance, as stated in Wei, Abstract, “In the proposed MM-CA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.”

Regarding claim 2 and analogous claim 11:
Zhou as modified by Wei teaches “The one or more processors of claim 1.”
Zhou further teaches (bold only) “wherein the decoder is to generate the saliency map based on a prediction of one or more portions of the textual data, wherein the prediction uses the one or more images and the textual data”: Zhou, section 2., paragraph 3, “For a given image, let fk(x, y) represent the activation of unit k in the last convolutional layer at spatial location (x, y). Then, for unit k, the result of performing global average pooling, Fk is Σx, y fk(x, y). Thus, for a given class c, the input to the softmax, Sc, is Σx, y wck Fk where wck is the weight corresponding to class c for unit k. Essentially, wck indicates the importance of Fk for class c. Finally the output of the softmax for class c, Pc is given by
    PNG
    media_image1.png
    28
    67
    media_image1.png
    Greyscale
. Here we ignore the bias term: we explicitly set the input bias of the softmax to 0 as it has little to no impact on the classification performance. By plugging Fk = Σx, y fk(x, y) into the class score, Sc, we obtain

    PNG
    media_image2.png
    87
    160
    media_image2.png
    Greyscale

We define Mc as the class activation map for class c, where each spatial element is given by

    PNG
    media_image3.png
    37
    187
    media_image3.png
    Greyscale

Thus, Sc = Σx, y Mc(x, y), and hence Mc(x, y) directly indicates the importance of the activation at spatial grid (x, y) leading to the classification of an image to class c.”
Wei further teaches (bold only) “wherein the decoder is to generate the saliency map based on a prediction of one or more portions of the textual data, wherein the prediction uses the one or more images and the textual data”: Wei, section 3.1, paragraph 1, “As shown in Figure 2, our Multi-Modality Cross Attention Network mainly consists of two modules, the self-attention module and the cross-attention module, demonstrated in the green dashed blocks and red dash dashed block in Figure 2, respectively. Given an image and sentence pair, we first feed the image into the bottom-up attention model [32] pre-trained on Visual Genome [20] to extract features for image regions. Meanwhile, we use Word-Piece tokens of each sentence as the fragments in the textual modality. Based on these extracted fine-grained representations for image regions and sentence words, we model the intra-modality relationship with the Self-Attention Module, and adopt the Cross-Attention Module to model the intermodality and intra-modality relationships for image regions and sentence words [a prediction of one or more portions of the textual data, wherein the prediction uses the one or more images and the textual data]. […] As shown in Figure 2, we get two pairs of embeddings for the given image-sentence pair (i0, c0) and (i1, c1), which are used for image and sentence matching.”
Wei and Zhou are combinable are the rationale given under claim 1.

Regarding claim 3:
Zhou as modified by Wei teaches “The one or more processors of claim 1.”
Wei further teaches “wherein a first portion and a second portion of the one or more neural networks are trained in parallel to encode features of one or more images and the textual data to a shared latent space”: Wei, Fig. 2, 

    PNG
    media_image4.png
    313
    631
    media_image4.png
    Greyscale

[showing the upper portion of the network training on image data and the lower portion training on textual data, combining to created embeddings, hence, wherein a first portion and a second portion of the one or more neural networks are trained in parallel to encode features of one or more images and the textual data to a shared latent space].
Wei and Zhou are combinable are the rationale given under claim 1.

Regarding claim 4 and analogous claim 12:
Zhou as modified by Wei teaches “The one or more processors of claim 1.”
Wei further teaches “wherein the textual data comprises textual descriptions of the one or more images”: Wei, section 1, paragraph 1,”This task has drawn remarkable attention and has been widely adopted to various applications [12, 43, 13, 48], e.g., finding similar sentences given an image query for image annotation and caption, and retrieving matched images with a sentence query for image search [wherein the textual data comprises textual descriptions of the one or more images].”
Wei and Zhou are combinable are the rationale given under claim 1.

Regarding claim 5 and analogous claim 13:
Zhou as modified by Wei teaches “The one or more processors of claim 1.”
Wei further teaches “wherein the one or more neural networks comprise a cross-attention encoder, wherein a query input to the cross-attention encoder comprises output from a second portion of the one or more neural networks, and wherein key and value input to the cross-attention encoder comprises output from a first portion of the one or more neural networks”: Wei, section 3.3, “for image I with the fine grained representation R = {r1, r2, ..., rk} [showing that R is an output representation of the first portion of the network] […] The BERT consists of multiple Transformer units, and its output E = {e1, e2, ..., en} naturally includes the intra-modality information [showing that E is an output representation of the second portion of the network]”; Wei, section 3.4, “In this section, we introduce how to model both the inter-modality and intra-modality relationships in a unified model with our Cross-Attention Module [comprise a cross-attention encoder] […]Here, the query [a query input to the cross-attention encoder comprises output from the second portion of the one or more neural networks], key and value [key and value input to the cross-attention encoder comprises output from the first portion of the one or more neural networks] for the fragments are formed with the following equations:

    PNG
    media_image5.png
    217
    404
    media_image5.png
    Greyscale
”
Wei and Zhou are combinable are the rationale given under claim 1.

Regarding claim 6:
Zhou as modified by Wei teaches “The one or more processors of claim 1.”
Zhou further teaches (bold only) “wherein the one or more neural networks comprise a decoder to generate a saliency map based, at least in part, on output of a cross-attention encoder”: Zhou, section 2., paragraph 3, “For a given image, let fk(x, y) represent the activation of unit k in the last convolutional layer at spatial location (x, y) [output of a … encoder]. Then, for unit k, the result of performing global average pooling, Fk is Σx, y fk(x, y). Thus, for a given class c, the input to the softmax, Sc, is Σx, y wck Fk where wck is the weight corresponding to class c for unit k. Essentially, wck indicates the importance of Fk for class c. Finally the output of the softmax for class c, Pc is given by
    PNG
    media_image1.png
    28
    67
    media_image1.png
    Greyscale
. Here we ignore the bias term: we explicitly set the input bias of the softmax to 0 as it has little to no impact on the classification performance. By plugging Fk = Σx, y fk(x, y) into the class score, Sc, we obtain

    PNG
    media_image2.png
    87
    160
    media_image2.png
    Greyscale

We define Mc as the class activation map for class c, where each spatial element is given by

    PNG
    media_image3.png
    37
    187
    media_image3.png
    Greyscale

Thus, Sc = Σx, y Mc(x, y), and hence Mc(x, y) directly indicates the importance of the activation at spatial grid (x, y) leading to the classification of an image to class c [comprise a decoder to generate a saliency map].”
Wei further teaches (bold only) “wherein the one or more neural networks comprise a decoder to generate a saliency map based, at least in part, on output of a cross-attention encoder”: Wei, section 3.1, paragraph 1, “As shown in Figure 2, our Multi-Modality Cross Attention Network mainly consists of two modules, the self-attention module and the cross-attention module, demonstrated in the green dashed blocks and red dash dashed block in Figure 2, respectively. Given an image and sentence pair, we first feed the image into the bottom-up attention model [32] pre-trained on Visual Genome [20] to extract features for image regions. Meanwhile, we use Word-Piece tokens of each sentence as the fragments in the textual modality. Based on these extracted fine-grained representations for image regions and sentence words, we model the intra-modality relationship with the Self-Attention Module, and adopt the Cross-Attention Module [cross-attention] to model the intermodality and intra-modality relationships for image regions and sentence words. […] As shown in Figure 2, we get two pairs of embeddings for the given image-sentence pair (i0, c0) and (i1, c1), which are used for image and sentence matching [cross-attention encoder].”
Wei and Zhou are combinable are the rationale given under claim 1.

Regarding claim 10:
	Zhou as modified by Wei teaches “The system of claim 9.” 
Wei further teaches “wherein a first portion and a second portion of the one or more neural networks are trained in parallel, and wherein the second portion of the one or more neural networks is taught during training to provide information for training the first portion of the one or more neural networks”: Wei, Fig. 2, 

    PNG
    media_image4.png
    313
    631
    media_image4.png
    Greyscale

[showing the upper portion of the network training on image data and the lower portion training on textual data, combining to created embeddings, hence, wherein a first portion and a second portion of the one or more neural networks are trained in parallel; the training of the portions is combined via the cross-attention module, hence, wherein the second portion of the one or more neural networks is taught during training to provide information for training the first portion of the one or more neural networks].
Wei and Zhou are combinable are the rationale given under claim 9.

Regarding claim 14:
Zhou as modified by Wei teaches “The system of claim 9.” 
Zhou further teaches “wherein the one or more neural networks comprise a decoder to generate information indicative of a region of an image”: Zhou, section 2, paragraph 1, “In this section, we describe the procedure for generating class activation maps (CAM) using global average pooling (GAP) in CNNs. A class activation map for a particular category indicates the discriminative image regions used by the CNN to identify that category (e.g., Fig. 3) [comprise a decoder to generate information indicative of a region of an image].”

Claims 7-8 and 15-31 rejected under 35 U.S.C. 103 as being unpatentable over Zhou as modified by Wei in view of PanQiao et al., “Image and Text Fusion for Character-based Breast Cancer Classification,” 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications (hereafter PanQiao).

Regarding claim 7:
Zhou as modified by Wei teaches “The one or more processors of claim 1.”
Wei further teaches “wherein the textual data represents a textual document”: Wei, section 4.1, paragraph 1, “MS-COCO [23] is one of the most popular dataset for the image and sentence matching task. It contains 123287 images, and each image is annotated with five text descriptions [wherein the textual data represents a textual document].”
Zhou as modified by Wei does not explicitly teach “and wherein an output of the one or more neural networks comprises a classification of a condition depicted in the one or more images and described in the textual document.”
PanQiao teaches “and wherein an output of the one or more neural networks comprises a classification of a condition depicted in the one or more images and described in the textual document”: PanQiao, section I, “Character-based feature representation method is used to fully capture the complicated matching relations between image and sentence and are fully captured in our proposed f-CNN. We validate the effectiveness of f-CNNs on the classification tasks for breast cancer [output of the one or more neural networks comprises a classification of a condition], and demonstrate that f-CNNs can achieve performances superior to the state-of-the-art approaches by letting image and the composed fragments of the sentence meet and interact at different levels”; PanQiao, section III. A., “Our task is to extract and mine the semantic interactions of radiology images and reports [a classification of a condition depicted in the one or more images and described in the textual document], and generate fusion feature vector with better performance. We will train our model on a training set of N images and N corresponding sentences that describe their content (Figure 1).”
PanQiao and Zhou as modified by Wei are analogous arts as both are related to the analysis of image-sentence pairs. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the medical condition classification of PanQiao to the teachings of Zhou as modified by Wei to arrive at the present invention, in order to leverage machine learning to improve medical treatment, as stated in PanQiao, section I,  “The efficient analysis of images and electronic health records (EHRs) is significant for improving reliability of laboratory results, which is used to help assessing disease risks and monitoring the treatment.”

Regarding claim 8 and analogous claim 15:
Zhou as modified by Wei teaches “The one or more processors of claim 1.”
Zhou as modified by Wei does not explicitly teach “wherein an output of the one or more neural networks comprises information identifying a condition depicted in the one or more images.”
PanQiao teaches “wherein an output of the one or more neural networks comprises information identifying a condition depicted in the one or more images e”: PanQiao, section I, “Character-based feature representation method is used to fully capture the complicated matching relations between image and sentence and are fully captured in our proposed f-CNN. We validate the effectiveness of f-CNNs on the classification tasks for breast cancer [information identifying a condition], and demonstrate that f-CNNs can achieve performances superior to the state-of-the-art approaches by letting image and the composed fragments of the sentence meet and interact at different levels”; PanQiao, section III. A., “Our task is to extract and mine the semantic interactions of radiology images and reports [output of the one or more neural networks comprises information identifying a condition depicted in the one or more images], and generate fusion feature vector with better performance. We will train our model on a training set of N images and N corresponding sentences that describe their content (Figure 1).”
PanQiao and Zhou as modified by Wei are analogous arts as both are related to the analysis of image-sentence pairs. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the medical condition classification of PanQiao to the teachings of Zhou as modified by Wei to arrive at the present invention, in order to leverage machine learning to improve medical treatment, as stated in PanQiao, section I,  “The efficient analysis of images and electronic health records (EHRs) is significant for improving reliability of laboratory results, which is used to help assessing disease risks and monitoring the treatment.”

Regarding claim 16:
Zhou as modified by Wei teaches “The one or more processors of claim 1.”
Zhou as modified by Wei does not explicitly teach “wherein the one or more images comprise a diagnostic image and the textual data comprises a diagnostic report corresponding to the diagnostic image.”
PanQiao teaches “wherein the one or more images comprise a diagnostic image and the textual data comprises a diagnostic report corresponding to the diagnostic image”: PanQiao, Section V. A., “The experimental datasets in this paper came from a major hospital in Shanghai, including medical documents of clinical breast cases and related mammography image [image data comprises a diagnostic]. The document report mainly consists of two parts: one is basic personal information of the patient, the other is the description of x-ray images, the diagnostic opinion of the doctor and the result of pathologic diagnosis [textual data comprises a diagnostic report corresponding to the diagnostic image].”
PanQiao and Zhou as modified by Wei are analogous arts as both are related to the analysis of image-sentence pairs. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the medical condition classification of PanQiao to the teachings of Zhou as modified by Wei to arrive at the present invention, in order to leverage machine learning to improve medical treatment, as stated in PanQiao, section I,  “The efficient analysis of images and electronic health records (EHRs) is significant for improving reliability of laboratory results, which is used to help assessing disease risks and monitoring the treatment.”

Regarding claim 17:
Zhou teaches:
(bold only) “determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects”: Zhou, section 2., paragraph 3, “For a given image, let fk(x, y) represent the activation of unit k in the last convolutional layer at spatial location (x, y). Then, for unit k, the result of performing global average pooling, Fk is Σx, y fk(x, y). Thus, for a given class c, the input to the softmax, Sc, is Σx, y wck Fk where wck is the weight corresponding to class c for unit k. Essentially, wck indicates the importance of Fk for class c [determine a classification of the one or more detected objects using … one or more neural networks and based on the one or more images … corresponding to the one or more detected objects].”
“generate a saliency map, indicating locations associated with the one or more detected objects, based on the classification”: Zhou, section 2, paragraph 1, “In this section, we describe the procedure for generating class activation maps (CAM) using global average pooling (GAP) in CNNs. A class activation map for a particular category indicates the discriminative image regions used by the CNN to identify that category (e.g., Fig. 3) [generate a saliency map, indicating locations associated with the one or more detected objects, based on the classification].”
Zhou does not explicitly teach:
“One or more processors, comprising:  circuitry to use one or more neural networks to infer a condition of one or more objects in one or more images using a decoder of the one or more neural networks”
(bold only) “determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects”
Wei teaches:
(bold only) “One or more processors, comprising:  circuitry to use one or more neural networks to”: Wei, section 4.2, paragraph 1, “The proposed Multi-Modality Cross-Attention Network is implemented in PyTorch framework [27] with a NVIDIA GeForce GTX 2080Ti GPU [One or more processors, comprising circuitry].”
(bold only) “determine a classification of the one or more detected objects using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the one or more detected objects”: Wei, section 3.1, paragraph 1, “As shown in Figure 2, our Multi-Modality Cross Attention Network mainly consists of two modules, the self-attention module and the cross-attention module, demonstrated in the green dashed blocks and red dash dashed block in Figure 2, respectively. Given an image and sentence pair, we first feed the image into the bottom-up attention model [32] pre-trained on Visual Genome [20] to extract features for image regions. Meanwhile, we use Word-Piece tokens of each sentence as the fragments in the textual modality. Based on these extracted fine-grained representations for image regions and sentence words [one or more images and textual data], we model the intra-modality relationship with the Self-Attention Module, and adopt the Cross-Attention Module [cross-attention] to model the intermodality and intra-modality relationships for image regions and sentence words. […] As shown in Figure 2, we get two pairs of embeddings for the given image-sentence pair (i0, c0) and (i1, c1), which are used for image and sentence matching [a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data].”
Wei and Zhou are analogous arts as they are both related to image analysis. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the multi-modal cross-attention of Wei with the teachings of Zhou to arrive at the present invention, in order to improve model performance, as stated in Wei, Abstract, “In the proposed MM-CA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.”
PanQiao teaches “use one or more neural networks to infer a condition of one or more objects in one or more images using a decoder of the one or more neural networks”: PanQiao, section I, “Character-based feature representation method is used to fully capture the complicated matching relations between image and sentence and are fully captured in our proposed f-CNN [using a decoder of the one or more neural networks]. We validate the effectiveness of f-CNNs on the classification tasks for breast cancer [a condition], and demonstrate that f-CNNs can achieve performances superior to the state-of-the-art approaches by letting image and the composed fragments of the sentence meet and interact at different levels”; PanQiao, section III. A., “Our task is to extract and mine the semantic interactions of radiology images and reports [e one or more neural networks to infer a condition of one or more objects in one or more images], and generate fusion feature vector with better performance. We will train our model on a training set of N images and N corresponding sentences that describe their content (Figure 1).”
PanQiao and Zhou as modified by Wei are analogous arts as both are related to the analysis of image-sentence pairs. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the medical condition classification of PanQiao to the teachings of Zhou as modified by Wei to arrive at the present invention, in order to leverage machine learning to improve medical treatment, as stated in PanQiao, section I,  “The efficient analysis of images and electronic health records (EHRs) is significant for improving reliability of laboratory results, which is used to help assessing disease risks and monitoring the treatment.”

Regarding claim 18:
Zhou as modified by Wei and PanQiao teaches “The one or more processors of claim 17.”
Wei further teaches “wherein a first portion of the one or more neural networks is trained to encode features of the one or more images and a second portion of the one or more neural networks is trained to encode features of the textual data”:  Wei, Fig. 2, 

    PNG
    media_image4.png
    313
    631
    media_image4.png
    Greyscale

[showing the upper portion of the network training on image data and the lower portion training on textual data, combining to created embeddings, hence, wherein a first portion of the one or more neural networks is trained to encode features of the one or more images and a second portion of the one or more neural networks is trained to encode features of the textual data].
Wei and Zhou are combinable are the rationale given under claim 17.

Regarding claim 19:
Zhou as modified by Wei and PanQiao teaches “The one or more processors of claim 18.”
Wei further teaches “wherein the first portion of the one or more neural networks, and the second portion of the one or more neural networks, encode their respective inputs to a common latent space”: Wei, Fig. 2, 

    PNG
    media_image4.png
    313
    631
    media_image4.png
    Greyscale

[showing the upper portion of the network training on image data and the lower portion training on textual data, combining to created embeddings, hence, wherein the first portion of the one or more neural networks, and the second portion of the one or more neural networks, encode their respective inputs to a common latent space].
Wei and Zhou are combinable are the rationale given under claim 17.

Regarding claim 20:
Zhou as modified by Wei and PanQiao teaches “The one or more processors of claim 17.”
Wei further teaches “wherein the one or more neural networks are is trained based, at least in part, on output of a cross-attention encoder using, as input to the cross-attention encoder, output of an image encoder and output of a language encoder”: Wei, section 3.1, “As shown in Figure 2, our Multi-Modality Cross Attention Network mainly consists of two modules, the self attention module and the cross-attention module, demonstrated in the green dashed blocks and red dash dashed block in Figure 2, respectively. Given an image and sentence pair, we first feed the image into the bottom-up attention model [32] [output of an image encoder]  pre-trained on Visual Genome [20] to extract features for image regions. Meanwhile, we use Word-Piece tokens of each sentence as the fragments in the textual modality [output of a language encoder]. Based on these extracted fine-grained representations for image regions and sentence words, we model the intra-modality relationship with the Self-Attention Module, and adopt the Cross-Attention Module [a cross-attention encoder] to model the inter-modality and intra-modality relationships for image regions and sentence words”; Wei, section 4.2, “In the self-attention module, for the image branch, the image region feature vector extracted by a bottom-up attention [1] is 2048-dimensional, and we add a fully-connect layer to transform it to a d-dimensional vector before feeding them into a Transformer unit with 16 heads. As for the textual data  in the self-attention module, we use the pretrained BERT model [4] including 12 self-attention layers, 12 heads, 768 hidden units for each token. For simplicity, the weights of BERT model is fixed during the training stage. In the 1-dim convolution neural networks, we use 256 filters for each filter size. In the cross-attention module [a cross-attention encoder], we apply a Transformer unit with 16 heads for implementation. The model is trained [the neural network is trained] for 20 epochs with the Adam optimizer [17].”
Wei and Zhou are combinable for the rationale given under claim 17. 

Regarding claim 21:
Zhou as modified by Wei and PanQiao teaches “The one or more processors of claim 17.”
PanQiao further teaches “wherein the one or more images comprise diagnostic images and the textual data comprises diagnostic reports corresponding to the diagnostic images”: PanQiao, Section V. A., “The experimental datasets in this paper came from a major hospital in Shanghai, including medical documents of clinical breast cases and related mammography image [the one or more images comprise diagnostic images]. The document report mainly consists of two parts: one is basic personal information of the patient, the other is the description of x-ray images, the diagnostic opinion of the doctor and the result of pathologic diagnosis [textual data comprises diagnostic reports corresponding to the diagnostic images].”
PanQiao and Zhou as modified by Wei are combinable for the rationale given under claim 17. 

Regarding claim 22:
Zhou as modified by Wei and PanQiao teaches “The one or more processors of claim 17.”
Zhou further teaches (bold only) “wherein the inferred condition comprises information indicative of an area of interest in the one or more images”: Zhou, section 2, paragraph 1, “In this section, we describe the procedure for generating class activation maps (CAM) using global average pooling (GAP) in CNNs. A class activation map for a particular category indicates the discriminative image regions used by the CNN to identify that category (e.g., Fig. 3) [wherein the inferred condition comprises information indicative of an area of interest in the one or more images].”
PanQiao further teaches (bold only) “wherein the inferred condition comprises information indicative of an area of interest in the one or more images”: PanQiao, section III. A., “Our task is to extract and mine the semantic interactions of radiology images and reports [inferred condition], and generate fusion feature vector with better performance. We will train our model on a training set of N images and N corresponding sentences that describe their content (Figure 1)”;  PanQiao, section III. C. 1, “Moreover, as most convolutional models [25], we consider the convolution unit with a local ‘receptive field’ [an area of interest in the one or more images] and shared weights to adequately model the rich structures for word composition and intermodal interaction.”
PanQiao and Zhou as modified by Wei are combinable for the rationale given under claim 17. 

Regarding claim 23:
Zhou as modified by Wei and PanQiao teaches “The one or more processors of claim 17.”
Wei further teaches:
“wherein a first portion of the one or more neural networks is trained to encode features of the one or more images and a second portion of the one or more neural networks is trained to encode features of the textual data”: Wei, Fig. 2, 

    PNG
    media_image4.png
    313
    631
    media_image4.png
    Greyscale

[showing the upper portion of the network training on image data and the lower portion training on textual data, combining to created embeddings, hence, wherein a first portion of the one or more neural networks is trained to encode features of the one or more images and a second portion of the one or more neural networks is trained to encode features of the textual data].
“and wherein the first portion of the neural network one or more neural networks, after training, is capable of inferring the information condition independently of the second portion.”: Wei, Fig. 2, 
    PNG
    media_image6.png
    383
    787
    media_image6.png
    Greyscale

[showing that the first portion of the neural network, top, through its self-attention module, can infer the information independently from the combination of first and second portions used in the cross-attention module]; Wei, section 3.1, “As shown in Figure 2, we get two pairs of embeddings for the given image-sentence pair (i0, c0) and (i1, c1), which are used for image and sentence matching.”
Wei and Zhou are combinable are the rationale given under claim 17.

Regarding claim 24:
Zhou teaches:
“A method, comprising: using one or more neural networks to”: Zhou, section 2, paragraph 1, “In this section, we describe the procedure for generating class activation maps (CAM) using global average pooling (GAP) in CNNs [A method, comprising: using one or more neural networks to]. A class activation map for a particular category indicates the discriminative image regions used by the CNN to identify that category (e.g., Fig. 3)
(bold only) “determine a classification of the condition using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the condition”: Zhou, section 2., paragraph 3, “For a given image, let fk(x, y) represent the activation of unit k in the last convolutional layer at spatial location (x, y). Then, for unit k, the result of performing global average pooling, Fk is Σx, y fk(x, y). Thus, for a given class c, the input to the softmax, Sc, is Σx, y wck Fk where wck is the weight corresponding to class c for unit k. Essentially, wck indicates the importance of Fk for class c [determine a classification … using … the one or more neural networks and based on the one or more images].”
(bold only) “and generate a saliency map, indicating locations associated with the condition, based on the classification and a prediction of one or more portions of a set of diagnostic reports, wherein the prediction uses the one or more images and the textual data”: Zhou, section 2, paragraph 1, “In this section, we describe the procedure for generating class activation maps (CAM) using global average pooling (GAP) in CNNs. A class activation map for a particular category indicates the discriminative image regions used by the CNN to identify that category (e.g., Fig. 3) [and generate a saliency map, indicating locations …, based on the classification and a prediction of one or more portions of a set of … reports, wherein the prediction uses the one or more images].”
Zhou does not explicitly teach:
“diagnose a condition depicted in a diagnostic image using a decoder of the one or more neural networks”
(bold only) “determine a classification of the condition using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the condition”
(bold only) “and generate a saliency map, indicating locations associated with the condition, based on the classification and a prediction of one or more portions of a set of diagnostic reports, wherein the prediction uses the one or more images and the textual data”
Wei teaches (bold only) “determine a classification of the condition using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the condition”: Wei, section 3.1, paragraph 1, “As shown in Figure 2, our Multi-Modality Cross Attention Network mainly consists of two modules, the self-attention module and the cross-attention module, demonstrated in the green dashed blocks and red dash dashed block in Figure 2, respectively. Given an image and sentence pair, we first feed the image into the bottom-up attention model [32] pre-trained on Visual Genome [20] to extract features for image regions. Meanwhile, we use Word-Piece tokens of each sentence as the fragments in the textual modality. Based on these extracted fine-grained representations for image regions and sentence words, we model the intra-modality relationship with the Self-Attention Module, and adopt the Cross-Attention Module [cross-attention] to model the intermodality and intra-modality relationships for image regions and sentence words. […] As shown in Figure 2, we get two pairs of embeddings for the given image-sentence pair (i0, c0) and (i1, c1), which are used for image and sentence matching [a cross-attention encoder].”
Wei and Zhou are analogous arts as they are both related to image analysis. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the multi-modal cross-attention of Wei with the teachings of Zhou to arrive at the present invention, in order to improve model performance, as stated in Wei, Abstract, “In the proposed MM-CA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.”
PanQiao teaches “diagnose a condition depicted in a diagnostic image using a decoder of the one or more neural networks,” (bold only) “determine a classification of the condition using a cross-attention encoder of the one or more neural networks and based on the one or more images and textual data corresponding to the condition,” and (bold only) “and generate a saliency map, indicating locations associated with the condition, based on the classification and a prediction of one or more portions of a set of diagnostic reports, wherein the prediction uses the one or more images and the textual data”: PanQiao, section 1, paragraph 6, “In this paper, we propose a novel fusion convolutional neural network (f-CNN) framework [using a decoder of the one or more neural network] to make full use of the image and text resources. Trained on a set of image and sentence pairs, the proposed f-CNNs are able to mine the semantic interactions of radiology images and reports [textual data corresponding to the condition] and mutually enhance feature learning process [diagnose a condition depicted in a diagnostic image]”; PanQiao, Section V. A., “The experimental datasets in this paper came from a major hospital in Shanghai, including medical documents of clinical breast cases and related mammography image [associated with the condition][of the condition]. The document report mainly consists of two parts: one is basic personal information of the patient, the other is the description of x-ray images, the diagnostic opinion of the doctor and the result of pathologic diagnosis [portions of a set of diagnostic reports].”
PanQiao and Zhou as modified by Wei are analogous arts as both are related to the analysis of image-sentence pairs. It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the claimed invention to have combined the medical condition classification of PanQiao to the teachings of Zhou as modified by Wei to arrive at the present invention, in order to leverage machine learning to improve medical treatment, as stated in PanQiao, section I,  “The efficient analysis of images and electronic health records (EHRs) is significant for improving reliability of laboratory results, which is used to help assessing disease risks and monitoring the treatment.”

Regarding claim 25:
Zhou as modified by Wei and PanQiao teaches “The method of claim 24.”
Wei further teaches:
“wherein a first portion of the one or more neural networks is trained in parallel with a second portion of the one or more neural networks“: Wei, Fig. 2, 

    PNG
    media_image4.png
    313
    631
    media_image4.png
    Greyscale

[showing the upper portion of the network training on image data and the lower portion training on textual data, combining to created embeddings, hence, wherein a first portion of the one or more neural networks is trained in parallel with a second portion of the one or more neural networks].
(bold only) “and wherein the second portion of the one or more neural networks is trained to encode features of the set of diagnostic reports”: Wei, Fig. 2, 

    PNG
    media_image4.png
    313
    631
    media_image4.png
    Greyscale

[showing the lower portion training on textual data, used to created embeddings, hence, and wherein the second portion of the one or more neural networks is trained to encode features of the set of … reports].
Wei and Zhou are combinable for the rationale given under claim 24.
PanQiao teaches (bold only) “and wherein the second portion of the one or more neural networks is trained to encode features of the set of diagnostic reports”: PanQiao, Section V. A., “The experimental datasets in this paper came from a major hospital in Shanghai, including medical documents of clinical breast cases and related mammography image. The document report mainly consists of two parts: one is basic personal information of the patient, the other is the description of x-ray images, the diagnostic opinion of the doctor and the result of pathologic diagnosis [the set of diagnostic reports].”
PanQiao and Zhou as modified by Wei are combinable for the rationale given under claim 24.

Regarding claim 26:
Zhou as modified by Wei and PanQiao teaches “The method of claim 25.”
Wei further teaches “wherein the first and second portions of the one or more neural networks are trained to encode features of the one or more images and the textual data to a shared latent space”: Wei, Fig. 2, 

    PNG
    media_image4.png
    313
    631
    media_image4.png
    Greyscale

[showing the upper portion of the network training on image data and the lower portion training on textual data, combining to created embeddings, hence, wherein a first portion and a second portion of the one or more neural networks are trained in parallel to encode features of the one or more images and the textual data to a shared latent space].
Wei and Zhou are combinable are the rationale given under claim 24.

Regarding claim 27:
Zhou as modified by Wei and PanQiao teaches “The method of claim 24.”
Wei further teaches “providing, as input to a cross-attention encoder, a query input comprising output from a language encoder, and key and value input comprising output from an image encoder”: Wei, section 3.3, “for image I with the fine grained representation R = {r1, r2, ..., rk} [showing that R is an output representation from an image encoder] […] The BERT consists of multiple Transformer units, and its output E = {e1, e2, ..., en} naturally includes the intra-modality information [showing that E includes an output representation from a language encoder]”; Wei, section 3.4, “In this section, we introduce how to model both the inter-modality and intra-modality relationships in a unified model with our Cross-Attention Module [a cross-attention encoder] […] Here, the query [a query input comprising output from a language encoder], key and value [key and value input comprising output from an image encoder] for the fragments are formed with the following equations:

    PNG
    media_image5.png
    217
    404
    media_image5.png
    Greyscale
”
Wei and Zhou are combinable are the rationale given under claim 24.

Regarding claim 28:
Zhou as modified by Wei and PanQiao teaches “The method of claim 24.”
Wei further teaches  (bold only) “training a language encoder of the one or more neural networks to encode features of the diagnostic reports to a latent space shared with output of an image encoder”: Wei, Fig. 2, 

    PNG
    media_image4.png
    313
    631
    media_image4.png
    Greyscale

[showing the upper portion of the network training on image data and the lower portion training on textual data, combining to created embeddings, hence, training a language encoder of the one or more neural networks to encode features of the … reports to a latent space shared with output of an image encoder].
Wei and Zhou are combinable are the rationale given under claim 24.
PanQiao teaches (bold only) “training a language encoder of the one or more neural networks to encode features of the diagnostic reports to a latent space shared with output of an image encoder”: PanQiao, Section V. A., “The experimental datasets in this paper came from a major hospital in Shanghai, including medical documents of clinical breast cases and related mammography image. The document report mainly consists of two parts: one is basic personal information of the patient, the other is the description of x-ray images, the diagnostic opinion of the doctor and the result of pathologic diagnosis [the diagnostic reports].”
PanQiao and Zhou as modified by Wei are combinable for the rationale given under claim 24.

Regarding claim 29:
Zhou as modified by Wei and PanQiao teaches “The method of claim 24.”
PanQiao further teaches “decoding output of an encoder to generate information summarizing the condition”: PanQiao, section III. A., “Our task is to extract and mine the semantic interactions of radiology images and reports, and generate fusion feature vector with better performance. We will train our model on a training set of N images and N corresponding sentences that describe their content (Figure 1). Given this set of correspondences, we train the weights of a neural network to output a high score when a compatible image-sentence pair is fed through the network, and low score otherwise [decoding output of an encoder to generate information summarizing the condition].”
PanQiao and Zhou as modified by Wei are combinable for the rationale given under claim 24.

Regarding claim 30:
Zhou as modified by Wei and PanQiao teaches “The method of claim 24.”
Zhou further teaches (bold only) “wherein the one or more neural networks comprises a decoder to generate information indicative of a region in the diagnostic image that depicts the condition”: Zhou, section 2, paragraph 1, “In this section, we describe the procedure for generating class activation maps (CAM) using global average pooling (GAP) in CNNs. A class activation map for a particular category indicates the discriminative image regions used by the CNN to identify that category (e.g., Fig. 3) [wherein the one or more neural networks comprises a decoder to generate information indicative of a region in the … image].”
PanQiao further teaches (bold only) “wherein the one or more neural networks comprises a decoder to generate information indicative of a region in the diagnostic image that depicts the condition”: PanQiao teaches (bold only) “training a language encoder of the one or more neural networks to encode features of the diagnostic reports to a latent space shared with output of an image encoder”: PanQiao, Section V. A., “The experimental datasets in this paper came from a major hospital in Shanghai, including medical documents of clinical breast cases and related mammography image [the diagnostic image that depicts the condition]. The document report mainly consists of two parts: one is basic personal information of the patient, the other is the description of x-ray images, the diagnostic opinion of the doctor and the result of pathologic diagnosis.”
PanQiao and Zhou as modified by Wei are combinable for the rationale given under claim 24.

Regarding claim 31:
Zhou as modified by Wei and PanQiao teaches “The method of claim 24.”
Zhou further teaches (bold only) “wherein diagnoses of the condition comprises identifying one or more categories of conditions determined, by the one or more neural networks, to be associated with a region of the diagnostic image”: Zhou, section 2, paragraph 1, “In this section, we describe the procedure for generating class activation maps (CAM) using global average pooling (GAP) in CNNs. A class activation map for a particular category indicates the discriminative image regions used by the CNN to identify that category (e.g., Fig. 3) [determined, by the one or more neural networks, to be associated with a region of the … image].”
PanQiao further teaches (bold only) “wherein diagnoses of the condition comprises identifying one or more categories of conditions determined, by the one or more neural networks, to be associated with a region of the diagnostic image”: PanQiao, section III. A., “Our task is to extract and mine the semantic interactions of radiology images and reports, and generate fusion feature vector with better performance. We will train our model on a training set of N images and N corresponding sentences that describe their content (Figure 1). Given this set of correspondences, we train the weights of a neural network to output a high score when a compatible image-sentence pair is fed through the network, and low score otherwise. Once the training is complete, the evaluation will score all image-sentence pairs, sort images/sentences in order of decreasing score and record the location of a ground truth result in the list [identifying one or more categories of conditions determined, by the one or more neural networks, to be associated with a region of the diagnostic image].”
PanQiao and Zhou as modified by Wei are combinable for the rationale given under claim 24.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
	Yu et al., “MAttNet: Modular Attention Network for Referring Expression Comprehension,” 2018, arXiv:1801.08186v3, describes a machine-learning method for identifying a region in an image matching a text description, using an attention-based model.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to VINCENT SPRAUL whose telephone number is (703)756-1511. The examiner can normally be reached M-F 9:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MICHAEL HUNTLEY can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/VAS/               Examiner, Art Unit 2129                                                                                                                                                                                         
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

Show 7 earlier events
May 01, 2025
Response after Non-Final Action
Jul 30, 2025
Non-Final Rejection mailed — §101, §103, §112
Oct 23, 2025
Applicant Interview (Telephonic)
Oct 23, 2025
Examiner Interview Summary
Oct 30, 2025
Response Filed
Nov 24, 2025
Final Rejection mailed — §101, §103, §112
Apr 07, 2026
Request for Continued Examination
Apr 11, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

17/163,396
Patent 12619905
FLEXIBLE EMBEDDING SYSTEMS AND METHODS FOR REAL-TIME COMPARISONS
5y 3m to grant Granted May 05, 2026
17/557,599
Patent 12608446
DETERMINING PERFORMANCE CHANGE WITHIN A DATASET WITH AN APPLIED CONDITION USING MACHINE LEARNING MODELS
4y 4m to grant Granted Apr 21, 2026
17/163,383
Patent 12591634
COMPOSITE EMBEDDING SYSTEMS AND METHODS FOR MULTI-LEVEL GRANULARITY SIMILARITY RELEVANCE SCORING
5y 2m to grant Granted Mar 31, 2026
17/249,028
Patent 12591796
INTELLIGENT DISTANCE PROMPTING
5y 1m to grant Granted Mar 31, 2026
17/353,931
Patent 12572620
RELIABLE INFERENCE OF A MACHINE LEARNING MODEL
4y 8m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

5-6
Expected OA Rounds
60%
Grant Probability
86%
With Interview (+26.7%)
4y 4m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 37 resolved cases by this examiner. Grant probability derived from career allowance rate.