Last updated: May 29, 2026

Application No. 18/438,551

ANATOMICALLY AWARE VISION-LANGUAGE MODELS FOR MEDICAL IMAGING ANALYSIS

Non-Final OA §102§103

Filed

Feb 12, 2024

Examiner

ZHANG, WAYNE

Art Unit

2672

Tech Center

2600 — Communications

Assignee

Siemens Healthineers AG

OA Round

1 (Non-Final)

This examiner grants 59% of cases after interview

— +31.6% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 22 resolved cases, 2023–2026

Examiner Intelligence

ZHANG, WAYNE View full profile →

Grants 59% of resolved cases

Career Allowance Rate

13 granted / 22 resolved

-2.9% vs TC avg

Strong +32% interview lift

Without

With

+31.6%

Interview Lift

resolved cases with interview

Typical timeline

2y 11m

Avg Prosecution

12 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§103

91.4%

+51.4% vs TC avg

§112

8.6%

-31.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 22 resolved cases

Office Action

§102 §103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The IDS dated 2/12/2025 has been considered and placed in the application file.  

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f), is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f):
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f). The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f), is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f), except as otherwise indicated. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f), except as otherwise indicated in this Office action.
Such claim limitation(s) is/are:
“means for receiving one or more input medical images; 
means for extracting image embeddings from the one or more input medical images; 
means for performing one or more medical imaging analysis tasks based on the image embeddings extracted from the one or more input medical images using a trained vision-language model; 
and means for outputting results of the one or more medical imaging analysis tasks” in claim 8;
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f), they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f), applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f).


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-3, 5-10, 12-19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Oktay (EP4266195A1).
Regarding claim 1, Oktay discloses a computer-implemented method comprising: receiving one or more input medical images (Oktay, paragraph [0082], "At step R10, an input image is received into the input of the image encoder 402, and at step R20 an input text query is input into the input of the text encoder 404"),
extracting image embeddings from the one or more input medical images (Oktay, pargraph [0085], "At step R30 the image encoder 402 generates the local image embeddings Ṽ , for the input image 202"), 
performing one or more medical imaging analysis tasks based on the image embeddings extracted from the one or more input medical images using a trained vision- language model (Oktay, paragraph [0090], "As separate text embedding has been generated from each of three sentences in the draft report, each making a separate proposition about a possible condition visible in the image”),
and outputting results of the one or more medical imaging analysis tasks (Oktay, paragraph [0089], "For example the UI may display the image on screen, and highlight the region or regions in question, or drawing an outline around the region or regions"),
wherein the trained vision-language model is trained by: receiving one or more training medical images and a text-based report associated with the one or more training medical images (Oktay, paragraph [0068], "In the joint training, the ML engine 118 inputs each of a plurality of image-text combinations into the ML model 116"),
extracting image embeddings from the one or more training medical images (Oktay, paragraph [0039], "The image encoder 402 is arranged to encode each input image 202 into at least one respective embedding"),
generating one or more instructions based on the text-based report using a language model (Oktay, paragraph [0040], "The text encoder 404 is arranged to encode each of one or more text portions in the input text 204, e.g. each clause, sentence, paragraph or whole document, into a respective text embedding t̃"),
and training the vision-language model to perform the one or more medical imaging analysis tasks based on the image embeddings extracted from the one or more training medical images and the one or more generated instructions (Oktay, paragraph [0098], "At step S40) the image data of the one or more current scans 202 are input into the machine learning model 116, together with at least the text of the corresponding report 204. Based on the report and image(s) together, the machine learning model 116 generates one or more suggestions for amendments 206 to the text of the report 204").

Regarding claim 2, Oktay discloses the computer-implemented method of claim 1, wherein generating one or more instructions based on the text-based report using a language model comprises: generating the one or more instructions further based on a plurality of predefined templates, the plurality of predefined templates comprising different initial instructions for extracting information from the text-based report and generating the one or more instructions (Oktay, paragraph [0059], "The image embedding captures a kind of "summary" of the visually relevant information content in the image data, and the text embedding captures the semantically relevant information in the text. The concept of an embedding or latent vector will, in itself, be familiar to a person skilled in the art", embedding is a form of extraction of information and it captures data according to patterns, as additionally evidenced by Wikipedia below).

    PNG
    media_image1.png
    413
    1220
    media_image1.png
    Greyscale


Regarding claim 3, Oktay discloses the computer-implemented method of claim 1, wherein generating one or more instructions based on the text-based report using a language model comprises: generating the one or more instructions for associating anatomical features depicted in the one or more training medical images with textual anatomical descriptors (Oktay, paragraph [0090], "Each of the regions in the image found to align with one of the sentences in the text query is outlined (or instead could be highlighted or such like)"*).

Regarding claim 5, Oktay discloses the computer-implemented method of claim 1, wherein generating one or more instructions based on the text-based report using a language model comprises: generating the one or more instructions for associating textual anatomical descriptors with image findings of the one or more training medical images (Oktay, paragraph [0090], Fig. 3a below, "As separate text embedding has been generated from each of three sentences in the draft report, each making a separate proposition about a possible condition visible in the image").

    PNG
    media_image2.png
    480
    585
    media_image2.png
    Greyscale


Regarding claim 6, Oktay discloses the computer-implemented method of claim 5, wherein the image findings comprise quantitative image findings of the one or more input medical images (Oktay, paragraph [0090], Fig. 3a above, "As separate text embedding has been generated from each of three sentences in the draft report, each making a separate proposition about a possible condition visible in the image").

Regarding claim 7, Oktay discloses the computer-implemented method of claim 1, wherein: generating one or more instructions based on the text-based report using a language model comprises: generating instruction embeddings representing the one or more instructions (Oktay, paragraph [0086], "At step R40 the text encoder 404 generates at least one text embedding t̃ from the input text query 204. In embodiments the input query may be broken down into more than one portion, such as separate sentences, clauses, paragraphs or sections, and the text encoder 404 may generate an individual text embedding t̃ for each such"),
and training the vision-language model to perform the one or more medical imaging analysis tasks based on the image embeddings extracted from the one or more training medical images and the one or more generated instructions comprises: combining the image embeddings extracted from the one or more training medical images and the instruction embeddings (Oktay, paragraph [0088], Fig. 2b below R50, "At step R50, the ML engine 118 performs a comparison between the local image embeddings generated at step R30 and the one or more text embeddings generated at step D40"), 

    PNG
    media_image3.png
    481
    468
    media_image3.png
    Greyscale

and generating results of the one or more medical imaging analysis tasks based on the combined image embeddings and instruction embeddings (Oktay, paragraph [0089], " At step R60, the ML engine 118 outputs data which can be used to control the UI on the UI console to render an indication of the connection between the text query 204 and the linked region of the image 202, or between each portion of the text query (e.g. each sentence) and the linked regions of the image").

Claims 8-10 corresponds to claims 1-3 respectively, additionally reciting an apparatus (Oktay, paragraph [0031], “The computer equipment 104 comprises processing apparatus 108 comprising one or more processors, and memory 110 comprising one or more memory units”). Thus, they are rejected for the same reasons of obviousness.

Claims 12-16 corresponds to claims 1-2, 5-7 respectively, additionally reciting a non-transitory computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out operations (Oktay, paragraph [0032], “Computer storage media include volatile and non-volatile, removable, and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or the like”). Thus, they are rejected for the same reasons of obviousness.

Regarding claim 17, Oktay discloses a computer-implemented method comprising: receiving one or more training medical images and a text-based report associated with the one or more training medical images (Oktay, paragraph [0082], "At step R10, an input image is received into the input of the image encoder 402, and at step R20 an input text query is input into the input of the text encoder 404"),
extracting image embeddings from the one or more training medical images (Oktay, paragraph [0085], “At step R30 the image encoder 402 generates the local image embeddings Ṽ , for the input image 202.”),
generating one or more instructions based on the text-based report using a language model (Oktay, paragraph [0086], “At step R40 the text encoder 404 generates at least one text embedding t̃ from the input text query 204”),
and training a vision-language model to perform one or more medical imaging analysis tasks based on the image embeddings extracted from the one or more training medical images and the one or more generated instructions (Oktay, paragraph [0098], "At step S40) the image data of the one or more current scans 202 are input into the machine learning model 116, together with at least the text of the corresponding report 204. Based on the report and image(s) together, the machine learning model 116 generates one or more suggestions for amendments 206 to the text of the report 204").

Claims 18-19 corresponds to claims 2-3 respectively. Thus, they are rejected for the same reasons of obviousness.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 4, 11, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Oktay (EP4266195A1) in view of Abdishektaei (US 20250111640 A1).
Regarding claim 4, Oktay discloses the computer-implemented method of claim 1.
Oktay does not teach “wherein generating one or more instructions based on the text-based report using a language model comprises: generating the one or more instructions for associating anatomical features depicted in the one or more training medical images with each other”.
However, Abdishektaei teaches generating the one or more instructions (Abdishektaei, paragraph [0072], "Based on the input image embedding vector, the text decoder 220 generates trial natural language text describing a finding 222 relating to the set of input medical image data 202.") for associating anatomical features depicted in the one or more training medical images with each other (Abdishektaei, paragraph [0027], "As mentioned, at step 108, the method 100 comprises performing a comparison of the feature vector 212 with the plurality of image embedding vectors 206 and identifying a first image embedding vector from among the plurality of image embedding vectors 206 on the basis of the comparison", using the concept of finding a common feature).
	It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to generate instructions through finding common features in Oktay’s images, as taught by Abdishektaei.
The suggestion/motivation for doing so would have been to find a shared area of interest, resulting in more consistent diagnosis.
Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. 
Therefore, it would have been obvious to combine Oktay in view of Abdishektaei to obtain the invention as specified in claim 4.

Claim 11 corresponds to claim 4, additionally reciting the apparatus (Oktay, paragraph [0031], “The computer equipment 104 comprises processing apparatus 108 comprising one or more processors, and memory 110 comprising one or more memory units”). Thus, they are rejected for the same reasons of obviousness.

Claim 20 corresponds to claim 4. Thus, they are rejected for the same reasons of obviousness.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WAYNE ZHANG whose telephone number is (571) 272-0245. The examiner can normally be reached Monday-Friday 10:00-6:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ms. Sumati Lefkowitz can be reached on (571) 272-3638. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/WAYNE ZHANG/Examiner, Art Unit 2672

/SUMATI LEFKOWITZ/Supervisory Patent Examiner, Art Unit 2672

Read full office action

Prosecution Timeline

Feb 12, 2024

Application Filed

Apr 20, 2026

Non-Final Rejection mailed — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/995,033

Patent 12591990

METHOD AND APPARATUS FOR GENERATING SPATIAL GEOMETRIC INFORMATION ESTIMATION MODEL

3y 6m to grant Granted Mar 31, 2026

18/185,102

Patent 12591958

INFRA-RED CONTRAST ENHANCEMENT FILTER

3y 0m to grant Granted Mar 31, 2026

17/919,905

Patent 12561843

METHOD FOR MANAGING IMAGE DATA, AND VEHICLE LIGHTING SYSTEM

3y 4m to grant Granted Feb 24, 2026

17/923,329

Patent 12536629

Image Processing Method and Electronic Device

3y 2m to grant Granted Jan 27, 2026

17/945,100

Patent 12536667

METHOD AND FACILITY FOR SEGMENTATION OF HIGH-CONTRAST OBJECTS IN X-RAY IMAGES

3y 4m to grant Granted Jan 27, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

59%

Grant Probability

91%

With Interview (+31.6%)

2y 11m (~8m remaining)

Median Time to Grant

Low

PTA Risk

Based on 22 resolved cases by this examiner. Grant probability derived from career allowance rate.