Prosecution Insights
Last updated: April 19, 2026
Application No. 18/711,017

METHOD AND SYSTEM FOR ANALYSING MEDICAL IMAGES TO GENERATE A MEDICAL REPORT

Non-Final OA §103§112
Filed
May 16, 2024
Examiner
XIAO, DI
Art Unit
2178
Tech Center
2100 — Computer Architecture & Software
Assignee
Eyetelligence Limited
OA Round
1 (Non-Final)
77%
Grant Probability
Favorable
1-2
OA Rounds
3y 4m
To Grant
99%
With Interview

Examiner Intelligence

Grants 77% — above average
77%
Career Allow Rate
463 granted / 600 resolved
+22.2% vs TC avg
Strong +22% interview lift
Without
With
+21.7%
Interview Lift
resolved cases with interview
Typical timeline
3y 4m
Avg Prosecution
24 currently pending
Career history
624
Total Applications
across all art units

Statute-Specific Performance

§101
8.2%
-31.8% vs TC avg
§103
57.6%
+17.6% vs TC avg
§102
17.1%
-22.9% vs TC avg
§112
14.2%
-25.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 600 resolved cases

Office Action

§103 §112
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . DETAILED ACTION The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 1. This action is responsive to communications: Application filed on May May 16, 2024, and Drawings filed on May 16, 2024. 2. Claims 1–20 are pending in this case. Claim 1, 12 are independent claims. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 35 USC § 112 The following is a quotation of 35 U.S.C. 112(f): (f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph: An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. With regard to claims 1 to 11, this application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are: an extractor module for, a positional encoder configured to, a text-generation module configured to in claim 1. Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof. If applicant wishes to provide further explanation or dispute the examiner’s interpretation of the corresponding structure, applicant must identify the corresponding structure with reference to the specification by page and line number, and to the drawing, if any, by reference characters in response to this Office action. If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. Claim Rejections - 35 USC § 112 The following is a quotation of 35 U.S.C. 112(b): (b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph: The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention. Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA the applicant regards as the invention. Claims 1-11 are interpreted under 35 U.S.C. 112(f). It is unclear whether the recited structure, material, or acts in these claims are sufficient for performing the claimed function because the Specification is unclear about the corresponding structure(s) of an extractor module, a positional encoder, a text-generation module A block diagram such as FIG. 1 does not provide indications of corresponding structure(s). Further the specification also does not provide the corresponding structure(s). If applicant wishes to have the claim limitation treated under 35 U.S.C. 112 (f), applicant may amend the claim so that the phrase “means for” or “step for” or the non-structural term is clearly not modified by sufficient structure, material, or acts for performing the claimed function, or present a sufficient showing that the claim limitation is written as a function to be performed and the claim does not recite sufficient structure, material, or acts for performing the claimed function. If applicant does not wish to have the claim limitation treated under 35 U.S.C. 112 (f), applicant may amend the claim so that it will clearly not invoke 35 U.S.C. 112 (f), or present a sufficient showing that the claim recites sufficient structure, material, or acts for performing the claimed function to preclude application of 35 U.S.C. 112 (f). With regard to claim 1 and 12, it is unclear what constitutes “a positional encoder configured to provide contextual order to decoder bi-linear multi-head attention layer output”. It is unclear what constitute decoder bi-linear multi-head attention layer output, whether it is the output of the attention layer or the output of the decoder. For the purpose of a compact prosecution, it is interpreted that it is the output of the attention layer. Claim 2, 3, 4, 5, 13, 14, 15, 16 would be allowable if rewritten or amended to overcome the rejection(s) under 112(b) Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1, 10, 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu, Multimodal Transformer with Multi-View Visual Representation for Image Captioning (2015), in view of Huang, CN 113343085 A, and further in view of JARREL, WO 2022006621 A1. With regard to claim 1: Yu discloses a system for analysing an image of, the system comprising: an extractor module for extracting image features from the image (The image encoder takes an image as its input and uses a pre-trained Faster-RCNN model [35] to extract region-based visual features, B. Multimodal Transformer for Image Captioning: “Based on the preliminary information about the Transformer above, we describe the Multimodal Transformer (MT) architecture for image captioning, which consists of an image encoder and a textual decoder. The image encoder takes an image as its input and uses a pre-trained Faster-RCNN model [35] to extract region-based visual features. The visual features are then fed into the encoder to obtain the attended visual representation with self-attention learning. The decoder takes the attended visual features and the previous word to predict the next word recursively. The flowchart of the MT architecture is shown in Fig. 2.”); a transformer including: an encoder including a plurality of encoder layers, and a decoder including a plurality of decoder layers (see fig. 1 for the plurality of encoder layers and decoder layers, “Fig. 1: Transformer architecture in an encoder-decoder manner. MHA and FFN denote the multi-head attention module and the feed-forward networks module, respectively. L is the number of stacked attention blocks for the encoder and decoder, and is set to the same number for simplicity”), wherein each layer of the encoder and decoder includes a bi-linear (the (multi headed attention layer) MHA are bi-linear multipe headed attention layer: “where each Al dec consists of two MHA modules and one FFN module (see Fig. 1). The first MHA module models the self attentions on the caption words and the second MHA module learns the image-guided attention on the caption words. Note that the self-attention (i.e., the first MHA module) is only allowed to attend to earlier positions in the output sequence and is implemented by masking subsequent positions (setting them to-∞) before the softmax step in the self-attention calculation, thereby resulting in a triangular mask matrix M∈Rn×n.The output features Y L = [yL 1,yL 1,...,yL n] are fed into a linear word embedding layer to transform the features to a dv-dimensional space, where dv is the vocabulary size. Subquently, softmax cross-entropy loss is performed on each word to predict the probability of its next word.”) multi-head attention layer configured to compute second-order interactions between vectors associated with the extracted image features (see fig. 1 wherein each layer of decoder and encoder contains a bi-linear multi-head attention layer: “The Transformer is a deep end-to-end architecture that stacks attention blocks to form an encoder-decoder strategy (see Fig. 1). Both the encoder and the decoder consist of N attention block, and each attention block contains the MHA and FFN modules. The MHA module learns the attended features that consider the pairwise interactions between two input features, and the FFN module further nonlinearly transforms the attended features. In the encoder, each attention block is self-attentional such that the queries, keys and values in Eq.(1) refer to the same input features. In contrast, the attention block in the decoder contains a self-attention layer and a guided attention layer. It first models the self-attention of given input features and then takes the output features of the last encoder attention block to guide the attention learning. To simplify the optimization, shortcut connection [33] and layer normalization [34] are applied after all the MHA and FFN modules”); and a text-generation module configured to generate a text-based report of the image based on an output from the transformer (see fig. 2 wherein the system generate a text based report of the image, B. Multimodal Transformer for Image Captioning: “Based on the preliminary information about the Transformer above, we describe the Multimodal Transformer (MT) ar chitecture for image captioning, which consists of an image encoder and a textual decoder. The image encoder takes an image as its input and uses a pre-trained Faster-RCNN model [35] to extract region-based visual features. The visual features are then fed into the encoder to obtain the attended visual representation with self-attention learning. The decoder takes the attended visual features and the previous word to predict the next word recursively. The flowchart of the MT architecture is shown in Fig. 2.”). Yu does not disclose the aspect where a positional encoder configured to provide contextual order to decoder bi-linear multi-head attention layer output. However Huang discloses a positional encoder configured to provide contextual order to decoder attention layer output (contextual order is inputted to the history to decode attention layer, “Optionally, the second determining module 503 is specifically used for, for each display position, the user in the display position of the history click behavior information and the context information input the third embedded layer to obtain history click behavior feature and context feature; inputting the history click behavior feature and the context feature into the attention layer, through the attention layer, obtaining the behavior aggregation feature of the user to the display position.”). It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Huang to Yu so the decoder could receive contextual order in order to better understand the context of the input and provide output that better matches the extracted feature of the image. Yu and Huang do not disclose the image is an image of a body part, and the report is a text report. However Jarrel discloses the image is an image of a body part, and a text-generation module configured to generate a text-based medical report of the image based on an output from the transformer (the system uses medical images to generate a clinical report for a patient, paragraph 38: “According to a further aspect, there is provided a computer implemented method for generating a clinical report for a patient, the method comprising: receiving one or more medical images associated with the patient; using an image processing component to process the one or more images, wherein the image processing component comprises a deep learning model that takes as input the one or more medical images and produces as an output an image feature tensor; and using a natural language processing component to generate a clinical report associated with the one or more medical images, wherein the natural language processing component comprises a transformer-based model that takes as input the image feature tensor from the image processing component and produces as output a probability for each word in a vocabulary.”) It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Jarrel to Yu and Huang So the system can be used for generating text for medical reports from image features wherein bilinear multi head attention layer from encoder and decoder can help accurate determine text from images. With regard to claim 10: Yu and Huang and Jarrel disclose the aspect wherein the text-generation module further comprises a linear layer and a Softmax function layer (Yu, see fig. 2 for Softmax and Linear Layer: “Multimodal Transformer (MT) model for image cap tioning. It consists of an image encoder to learn self-attended visual features, and a caption decoder to generate the caption from the attended visual features. [s] is a delimiter that indicates the start or the end of the caption.”). Claim 12 is rejected for the same reason as claim 1. Claims 6 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu, in view of Huang and Jarrel, and further in view of He, CN 114266354A. With regard to claims 6 and 17: Yu and Huang and Jarrel do not disclose the aspect wherein the positional encoder comprises a tensor having a same shape as an input sequence. However He discloses the aspect wherein the positional encoder comprises a tensor having a same shape as an input sequence (“wherein ut represents a token at the t-th position in the output of the GPT-2, u <t represents all tokens in the output of the GPT-2 from the first to the t-1 positions; represents the tensor of the t-th position in the output of the last layer (i.e., the L-th layer), which is a one-dimensional tensor of the shape of [dk]; H1 represents the output of the first layer (common L layer), which is the tensor of one [ns, dk], wherein ns is the length of the input sequence, dk represents the length of the feature hidden vector. E is the input text in all tokens corresponding to the semantic splicing tensor, the same is a [ns, dk] of tensor, is a training parameter; P is the tensor corresponding to all tokens in the input sequence, the shape thereof is the same as E, and the same as the training parameter. LayerNorm represents the batch standardization, FeedForward represents a two-layer full connection feed-forward network. wherein Ol=EnhancedMaskedSelfd (Nl) calculation method is specifically as follows:”). It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply He to Yu and Huang and Jarrel in order to preserve text of ordering between data points. Claims 7 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu, in view of Huang and Jarrel, and further in view of Litwiller, Pub No.: 20220198725 A1. With regard to claims 7 and 18: Yu and Huang and Jarrel do not disclose the aspect wherein the encoder further comprises one or more add and learnable normalisation layers to produce combinations of possibilities of resulting features of each of the bi-linear multi-head attention layer included in the encoder. However Litwiller discloses the aspect wherein the encoder further comprises one or more add and learnable normalisation layers to produce combinations of possibilities of resulting features of each of the bi-linear multi-head attention layer included in the encoder. (paragraph 67: “In the exemplary embodiment, the convolutional layer block 602 includes a convolutional layer 608 and a pooling layer 610. Each convolutional layer 608 is flexible in terms of its depth such as the number of convolutional filters and sizes of convolutional filters. The pooling layer 610 is used to streamline the underlying computation and reduce the dimensions of the data by combining outputs of neuron clusters at the prior layer into a single neuron in the pooling layer 610. The convolutional layer block 602 may further include a normalization layer 612 between the convolutional layer 608 and the pooling layer 610. The normalization layer 612 is used to normalize the distribution within a batch of training images and update the weights in the layer after the normalization. The number of convolutional layer blocks 602 in the neural network 600 may depend on the image quality of training images, and levels of details in extracted features.”) It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Litwiller to Yu and Huang and Jarrel in order to accelerate training and produce combinations of resulting features. Claims 8 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu, in view of Huang and Jarrel, and further in view of HEDLUND, 20170143312 A1. With regard to claims 8 and 19: Yu and Huang and Jarrel do not disclose the aspect wherein the encoder receives two or more inputs to contain feature representation from a plurality of image modalities. However Hedlund discloses the aspect wherein the encoder receives two or more inputs to contain feature representation from a plurality of image modalities. (paragraph 10: “A neural network may perform certain functions separately or in conjunction with the fuzzy logic. The neural network is configured to adapt functions of ultrasound image generating systems based on patient type, user preference and system operating conditions. The neural network is used in applications including detection of anatomical features, e.g. a main vessel, disease classification, and selection of features from different image modalities to obtain a composite image.”). It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Hedlund to Yu and Huang and Jarrel so the system can receive images of different modalities and extract features from images of different modalities and generate text based on extraction which allows more flexibility and allow the method to be applied to different type of images. Claims 9 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu, in view of Huang and Jarrel, and further in view of TAHERZADEH, 20210360460 A1. With regard to claims 9 and 20: Yu and Huang and Jarrel do not disclose the system of claim 1, further comprising a search module configured to perform beam searching to further boost standardization and quality of the generated medical reports. However Taherzadeh discloses the aspect comprising a search module configured to perform beam searching to further boost standardization and quality of the generated medical reports. (paragraph 241: “The processor 2004 may further include beam search and measurement circuitry 2044, configured to control the antenna array 2020 and transceiver 2010 to search for and identify a plurality of beams during a downlink beam sweep. The beam search and measurement circuitry 2044 may further be configured to receive a respective reference signal (e.g., SSB or CSI-RS) and measure a respective RSRP, SINR, or other suitable beam measurement of the respective reference signal on each of a set of the plurality of beams identified in a report setting 2016 and associated resource setting 2015. For example, the report setting 2016 may be associated with a resource setting 2015 including a configuration of one or more resource sets, each including a plurality of beam IDs indicating the set of beams and associated reference signal resources on which to obtain the beam measurements. The obtained beam measurements may be stored as the BMI 2018 within, for example, memory 2005 for use in generating an L1 measurement report including the BMI 2018. The beam search and measurement circuitry 2044 may further be configured to execute beam search and measurement software 2054 stored in the computer-readable medium 2006 to implement one or more of the functions described herein.”). It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Taherzadeh to Yu and Huang and Jarrel to use beam search to provide an optimal balance between computational efficiency and output quality in sequence generation tasks in order to provide standardized and higher quality report. Claim 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu, in view of Huang and Jarrel, and further in view of Wada, Pub. No.: 2023/0165456A1. With regard to claim 11: Yu and Huang and Jarrel do not disclose the system according to claim 1, wherein the image of the body part is an ophthalmic image. However Wada discloses the aspect wherein the image of the body part is an ophthalmic image for performing inference on target (paragraph 65 and 66: “Through machine learning (supervised learning) executed on the basis of such training data, a trained model (inference model) is created that is configured to receive as an input either one of or both ophthalmic characteristic data acquired from a patient by the ophthalmic measurement apparatus 12-m of the ophthalmic data acquiring unit 10 and ophthalmic image data acquired from this patient by the ophthalmic imaging apparatus 13-n of the ophthalmic data acquiring unit 10 and to output an inferred diagnostic result (inferred diagnostic data). The training data used to construct the first trained model 32-q.sub.i may include either one of or both ocular characteristic data and ocular image data, internal medicine data, and diagnostic result data. In other words, the training data used to construct the first trained model 32-q.sub.i may further include internal medicine data. If this is the case, a trained model (inference model) is created that is configured to receive as an input either one of or both ophthalmic characteristic data acquired from a patient by the ophthalmic measurement apparatus 12-m of the ophthalmic data acquiring unit 10 and ophthalmic image data acquired from this patient by the ophthalmic imaging apparatus 13-n, and internal medicine data acquired by the internal medicine data acquiring unit 20, and to output an inferred diagnostic result (inferred diagnostic data).”). It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Wada to Yu and Huang and Jarrel so the system would be able apply the same method to perform inference on the target in ophthalmic images. Pertinent Arts The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Liu, Pub. No.: 20210011941: In one embodiment, a multimedia file categorizing method is provided. The method includes determining a plurality of feature combinations according to respective feature sets corresponding to at least two types of modality information in a multimedia file, where a plurality of features constituting the feature combinations are selected from feature sets corresponding to different modality information; determining a semantically relevant feature combination by using a first computational model according to the plurality of feature combinations; and categorizing the multimedia file by using the first computational model with reference to the semantically relevant feature combination, where the multimedia file includes a plurality of types of modality information, and the plurality of types of modality information include at least two of modalities selected from the group consisting of a text modality, an image modality, and a voice modality. Singaraju, Pub. No.: US 20220058347: Accordingly, a different approach is needed to address these problems. In various embodiments, a computer-implemented method is provided that includes receiving a request to explain an inference result for an utterance by an intent classifier of a chatbot system; obtaining the inference result for the utterance from the intent classifier; selecting one or more anchors, where each anchor of the one or more anchors may include one or more anchor words in the utterance; for each anchor of the one or more anchors, varying one or more words of the utterance that are not anchor words for the anchor to generate variations of the utterance; obtaining inference results for the variations of the utterance from the intent classifier; determining, based on the inference results for the utterance and the variations of the utterance, a confidence level associated with each anchor of the one or more anchors; and generating an report that includes an anchor with a highest confidence level among the one or more anchors. In some embodiments, selecting the one or more anchors may include selecting the one or more anchors in one or more rounds using a beam search technique. The variations of the utterance, the inference results for the variations of the utterance, and the report may be in a database. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to DI XIAO whose telephone number is (571)270-1758. The examiner can normally be reached 9Am-5Pm est M-F. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Hong can be reached at (571) 272-4124. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /DI XIAO/Primary Examiner, Art Unit 2178
Read full office action

Prosecution Timeline

May 16, 2024
Application Filed
Mar 20, 2026
Non-Final Rejection — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12599341
AUTONOMOUS, CONSENT DRIVEN AND GENERATIVE DEVICE, SYSTEM AND METHOD THAT PROMOTES USER PRIVACY, SELF-KNOWLEDGE AND WELL-BEING
2y 5m to grant Granted Apr 14, 2026
Patent 12597519
METHODS FOR CHARACTERIZING AND TREATING A CANCER TYPE USING CANCER IMAGES
2y 5m to grant Granted Apr 07, 2026
Patent 12588967
PRESENTATION OF PATIENT INFORMATION FOR CARDIAC SHUNTING PROCEDURES
2y 5m to grant Granted Mar 31, 2026
Patent 12586456
SYSTEMS AND METHODS FOR PROVIDING SECURITY SYSTEM INFORMATION USING AUGMENTED REALITY EFFECTS
2y 5m to grant Granted Mar 24, 2026
Patent 12579773
DISPLAY APPARATUS AND DISPLAY METHOD
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
77%
Grant Probability
99%
With Interview (+21.7%)
3y 4m
Median Time to Grant
Low
PTA Risk
Based on 600 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month