Last updated: May 29, 2026
Application No. 18/015,036
METHOD AND SYSTEM FOR AUTOMATED GENERATION OF TEXT CAPTIONS FROM MEDICAL IMAGES

Final Rejection §103
Filed
Jan 06, 2023
Priority
Jul 06, 2020 — AU 2020902318 +2 more
Examiner
GEBRESLASSIE, WINTA
Art Unit
2677
Tech Center
2600 — Communications
Assignee
Harrison-Ai Pty Ltd.
OA Round
2 (Final)
Interview Optional

— +25.0% interview lift. Examiner has a relatively high allowance rate (76%); +25.0% interview lift. A written response may suffice.
Based on 135 resolved cases, 2023–2026
Examiner Intelligence

GEBRESLASSIE, WINTA View full profile →
Grants 76% — above average
Career Allowance Rate
102 granted / 135 resolved
+13.6% vs TC avg
Strong +25% interview lift
Without
With
+25.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 6m
Avg Prosecution
25 currently pending
Career history
189
Total Applications
across all art units
Statute-Specific Performance

§103
94.4%
+54.4% vs TC avg
§102
3.3%
-36.7% vs TC avg
§112
1.1%
-38.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 135 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Claims 1, 7, 22, 24, 26-27, and 63 has been amended.
Claim 4 has been cancelled.
Claims 1-2, 7, 11, 13-24, 26-27, 63 are still pending for consideration.
Response to Arguments
Applicant on page 2 of thew “Remarks” assert “Neither a "deep-learning diagnosis report unit" nor an "end-to-end learning network," as disclosed by Song, teaches or suggests a transformer-based model. Thus, Song is silent with respect to at least this claim element”.
Response: the rejection of claim 1 and 63 is not premised on Song alone, but rather on Song in view of Wang. Song is relied upon for teaching medical image analysis and generation of natural language diagnostic text based on extracted image features, including the use of deep-learning models to generate reports from medical images. Song does not need to disclose a transformer architecture to satisfy its role in the combination.
Wang is relied upon for teaching the use of a transformer/self-attention-based for sequence modeling and text generation. Wang disclose sequenced-based language generation that produces word predictions based on probability distributions derived from image feature representations. Wang discloses extracting a visual feature vector using a CNN and inputting that vector into a language processing model that probabilistically generates a descriptive caption by iteratively predicting words based on probability distributions. While Wang describes one implementation using a recurrent neural network. Wang is cited to establish the general principle of probabilistic word prediction from image features, which is common to both recurrent and transformer-based sequence models (see para [0004], [0021], [0022], [0041] e.t.c)
Applicant’s argument that song “teaches away” by describing RNNs is not persuasive. Song merely discloses one possible implementation and does not discourage the use of attention-based or transformer-based models. Moreover, the present application’s discussion of disadvantages of RNNs further supports the motivation to substitute a transformer-based model for known sequence modeling tasks.
The applied secondary reference Wang has been updated by newly found reference Song2 (US 20220351487 A1) in response to Applicant’s claim amendments. 

		Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1-2, 4, 7, 13, 15-23, 27 and 63 are rejected under 35 U.S.C. 103 as being unpatentable over Song et al. (US 20190139218 A1) in view of Song et al. (US 20220351487 A1) herein after Song2.
Regarding claim 1, Song et al. teach a computer implemented method for generating captions for medical images (see para [0006]; “a system for generating a report based on medical images of a patient” Note; generating a report implies generating captions), the method comprising: obtaining one or more medical images (see para [0006]; “The system includes a communication interface configured to receive the medical images acquired by an image acquisition device”); obtaining one or more words comprising seed text (see para [0049]; “user 105 may type text in a message box 222. The text entered can be keywords, phrases, or sentences. For example, user 105 may enter “brain” in message box 222, as shown in FIG. 2C. Based on the entered text and the image viewed, processor 120 may automatically adjust and generate new descriptions and the corresponding keywords. The order of the keywords may also be adjusted accordingly. For example, because “brain” is entered as shown in FIG. 2C, the keywords associated with brain, such as “left frontal lobe,” “gray matter,” “white matter,” etc. are ranked higher as compared to those in FIG. 2A”), Note: The "brain" example given ("keywords, phrases, or sentences") serves as the seed text); using an image processing component to process the one or more medical images, wherein the image processing component comprises a deep learning model that takes as input the one or more medical images and produces as an output an image feature tensor (see para [0038]; “deep-learning diagnosis report unit 124 may apply an end-to-end learning network to infer the text information from medical images 102. The end-to-end learning network may include two parts: a first part that extracts image features from medical images 102” Note: extracted features implies feature tensor); using a natural language processing component to generate a caption for the one or more medical images (see para [0038]; “the second part of the end-to-end learning network may include a recursive neural network (RNN). The RNN may generate a natural language description of at least one medical image based on the image features. In some embodiments, the RNN may further determine keywords from the natural language description and provide the keywords to a user for selection. The text included in the report may be generated based on the user selected keywords”). However, Song et al. does not specifically teach wherein the natural language processing component comprises a transformer-based model that takes as input the image feature tensor from the image processing component and produces as output a probability for each word in a vocabulary, wherein the transformer- based model takes as input a tensor that comprises the image feature tensor and an input tensor derived from the seed text
 In the same field of endeavor, Song2 teach wherein the natural language processing component comprises a transformer-based model that takes as input the image feature tensor from the image processing component and produces as output a probability for each word in a vocabulary (see para [0177]; “the target detection features and the global image features are inputted into an encoder of the Transformer translation model…. reference decoding vectors, the encoding vectors, and the global image features are inputted into a decoder to generate decoding vectors outputted by the decoder”, see also para [0111]; “the softmax function is used to convert the attention score numerically through a formula (2). On one hand, normalization may be performed to obtain a probability distribution with the sum of all weight coefficients being 1”, Note; the global image feature corresponds to the image feature tensor), wherein the transformer- based model takes as input a tensor that comprises the image feature tensor and an input tensor derived from the seed text (see para [0149]; “the reference decoding vectors are initial decoding vectors;… reference decoding vectors are the decoding vectors corresponding to the previous translation phrase”, see also para[0161]; “for the each one of the other translation phrases …. the reference decoding vectors thereof are decoding vectors corresponding to the previous translation phrase”, and para [0182]; “based on initial reference decoding vectors, the encoding vectors, and the global image features inputted, the decoder outputs decoding vectors and the first phrase “a” is obtained. Vectors corresponding to the first phrase “a” are taken as a reference for decoding the second phrase “boy”, Note: the reference decoding vectors are derived from previously generated words (or initialized text vectors) and are input to the transformer decoder). Accordingly, it would have been obvious to one of ordinary skill in the before the effective filing date of the claimed invention to modify a method for generating a report based on medical images of a patient based on a learning network of Song et al. in view of the use of an image description method performing feature extraction on a target image to generate a translation sentence of Song2 in order to generate a complete global information corresponding to the image as a reference in the subsequent process (see para [0177).
Regarding claim 2, the rejection of claim 1 is incorporated herein. 
Song2 in the combination further teach wherein using the natural language processing component to generate the caption for the one or more medical images comprises a first step which comprises using the transformer-based model to predict a probability for each word in the vocabulary (see para [0111]; “the softmax function is used to convert the attention score numerically through a formula (2). On one hand, normalization may be performed to obtain a probability distribution with the sum of all weight coefficients being 1”) and a second step which comprises sampling one or more words using the probabilities from the first step (see para [0182]; “Vectors corresponding to the first phrase “a” are taken as a reference for decoding the second phrase “boy”. Vectors corresponding to the second phrase “boy” are taken as reference decoding vectors, so that the decoder can obtain the next phrase “play” based on the reference decoding vectors”).
Regarding claim 4, the rejection of claim 1 is incorporated herein. 
 	Song2 in the combination further teach wherein the transformer-based model further takes as input an input tensor derived from a set of one or more words (see para [0149]; “the reference decoding vectors are initial decoding vectors;… reference decoding vectors are the decoding vectors corresponding to the previous translation phrase”, see also para[0161]; “for the each one of the other translation phrases …. the reference decoding vectors thereof are decoding vectors corresponding to the previous translation phrase”, and para [0182]; “based on initial reference decoding vectors, the encoding vectors, and the global image features inputted, the decoder outputs decoding vectors and the first phrase “a” is obtained. Vectors corresponding to the first phrase “a” are taken as a reference for decoding the second phrase “boy” Vectors corresponding to the second phrase “boy” are taken as reference decoding vectors, so that the decoder can obtain the next phrase “play” based on the reference decoding vectors, the encoding vectors, Note: the reference decoding vectors are derived from previously generated words (or initialized text vectors) and are input to the transformer decoder).
Regarding claim 7, the rejection of claim 1 is incorporated herein. 
 	Song et al. in the combination further teach further comprising obtaining the input tensor by tokenising and embedding the seed text (see para [0067]; “output layer 416 may select a word from the vocabulary at each time point, based on hidden state vector 414. In some embodiments, output layer 416 can be constructed as a fully-connected layer. Words may be continuously generated/sampled from the vocabulary until a stop token is sampled, which encodes the end of a report. In some embodiments, generated word 420 by output layer 416 may be used to create word embedding 418 by embedding layer 422”).
Regarding claim 13, the rejection of claim 1 is incorporated herein. 
 	Song et al. in the combination further teach wherein the image processing component and the natural language processing component have been trained jointly, to minimise at least one of the cross entropy loss and the perplexity of the predictions of the transformer-based model over a set of data (see para [0039]; “the end-to-end learning network may include an attention layer in between the CNN and RNN that assigns weights to the image features in different regions of the images. The assigned weights may be different depending on various factors. The CNN, the RNN, and the attention layer may be trained jointly to enhance the performance of the end-to-end learning network. For example, a joint loss function may be used to account for the combined performance of the CNN, the RNN, and the attention layer”, see also para [0069]; “the loss function can be defined by Equation (1):….where custom-character.sub. CNN is a suitable loss for medical image-related task in the CNN part (for example, cross-entropy loss for classification task and root mean squared error for regression task)”).
Regarding claim 15, the rejection of claim 1 is incorporated herein. 
 	Song et al. in the combination further teach further wherein the one or more medical images comprise multiple medical images and the method comprises generating a caption for the multiple medical images jointly, and wherein the multiple medical images are related to each other by sharing one or more features selected from (see para [0062]; “end-to-end diagnosis report generation model 400 may take one or more pre-processed images, e.g., a medical image 402, as input and output the description of the medical image (e.g., a text-based description) together with attention weights for the input image(s)….when the input includes multiple images, all the images may be input into model 400 as a whole (concatenated) and processed at the same time”): being associated with the same subject, being acquired using the same modality, showing the same pathology, showing the same organ or body part (see para [0054]; “In step S302, diagnostic report generating system 100 may receive one or more medical images 102 associated with a patient, e.g., from image acquisition device 101 or a medical image database. Medical images 102 may be 2D or 3D images. Medical images 102 can be generated from any imaging modality” see also para [0062]; “For example, the end-to-end learning network may be trained to interpret medical images 102 in light of the patient information. For instance, different image features may be extracted for an image of a pediatric patient as opposed to an image of a senior patient. In another example, diagnosis of lung cancer may change based on a patient's smoking history”).
Regarding claim 16, the rejection of claim 1 is incorporated herein. 
 	Song et al. in the combination further teach wherein the image processing component and the natural language processing component have been trained using training data comprising images that share one or more features with the one or more medical images, the one or more features being selected from: being associated with the same subject, being acquired using the same modality, showing the same pathology, showing the same organ or body part (see. para [0062]; “For example, the end-to-end learning network may be trained to interpret medical images 102 in light of the patient information. For instance, different image features may be extracted for an image of a pediatric patient as opposed to an image of a senior patient. In another example, diagnosis of lung cancer may change based on a patient's smoking history” see also Song et al. para [0038]; “The end-to-end learning network may include two parts: a first part that extracts image features from medical images 102… The RNN may generate a natural language description of at least one medical image based on the image features”).
Regarding claim 17, the rejection of claim 1 is incorporated herein. 
 	Song et al. in the combination further teach further comprising pre- processing the one or more medical images by performing one or more steps selected from: randomly re-ordering the one or more medical images, normalising pixel values across the one or more medical images, changing the aspect ratio of one or more of the one or more medical images, scaling one or more of the one or more medical images, re-sizing one or more of the one or more medical images (see para [0035]; “image processing unit 122 may perform pre-processing on medical images 102, such as filtering to reduce image anti s or noises, and leveling image quality, e.g., by adjusting the images' exposure parameters to increase contrast. In some embodiments, pre-processing may also include resizing or normalization of medical images 102. Such pre-processing may condition medical images 102 before they are displayed on a user interface (e.g., on display 130)”, see also para [0054]; “the preprocessing may include resizing, normalization, filtering, contrast balancing, etc”).
Regarding claim 18, the rejection of claim 1 is incorporated herein. 
 	Song2 in the combination further teach wherein the caption comprises free text (see para [0182]; “Vectors corresponding to the first phrase “a” are taken as a reference for decoding the second phrase “boy”. Vectors corresponding to the second phrase “boy” are taken as reference decoding vectors, so that the decoder can obtain the next phrase “play” based on the reference decoding vectors, the encoding vectors, and the global image features . . . and so on, a description sentence “A boy play football on football field” is obtained”).
Regarding claim 19, the rejection of claim 1 is incorporated herein. 
 	Song et al. in the combination further teach wherein the one or more medical images are associated with a patient and the caption is a clinical report for the patient (see para [0016]; “the present disclosure may support automatic or semi-automatic generation of medical reports for both whole image(s) (or multiple images of the same patient), and/or specific region(s) of interest. The reports may include descriptions of clinical observations. The reports may also include images related to the observations”).
Regarding claim 20, the rejection of claim 1 is incorporated herein. 
 	Song et al. in the combination further teach wherein the one or more medical images are selected from: histopathology images, radiography images, magnetic resonance images, ultrasound images, endoscopy images, positron emission tomography (PET) images, single- photon emission computed tomography (SPECT) images, and gross pathology images (see para [0027]; “image acquisition device 101 may acquire medical images 102 using any suitable imaging modalities, including, e.g., functional MRI (e.g., fMRI, DCE-MRI and diffusion MRI), Cone Beam CT (CBCT), Spiral CT, Positron Emission Tomography (PET), Single-Photon Emission Computed Tomography (SPECT), X-ray, optical tomography, fluorescence imaging, ultrasound imaging, and radiotherapy portal imaging, etc”).
Regarding claim 21, the rejection of claim 1 is incorporated herein. 
 	Song2 in the combination further teach wherein the natural language processing component comprises a transformer-based model with a single stack architecture (see Fig. 6 disclose a model implemented as a single stack of serially arranged encoding layers, wherein each layer has the same self-attention and feedforward structure and processes the output of the preceding layer).
Regarding claim 22, the rejection of claim 1 is incorporated herein. 
 	Song2 in the combination further teach wherein the transformer-based model uses an attention mask that is configured to forbid elements in the input tensor from attending to one another (see para [0182]; “based on initial reference decoding vectors, the encoding vectors, and the global image features inputted, the decoder outputs decoding vectors and the first phrase “a” is obtained. Vectors corresponding to the first phrase “a” are taken as a reference for decoding the second phrase “boy”. Vectors corresponding to the second phrase “boy” are taken as reference decoding vectors, so that the decoder can obtain the next phrase “play” based on the reference decoding vectors, the encoding vectors, and the global image features . . . and so on, a description sentence “A boy play football on football field” is obtained”, Note: the Transformer is forced to rely solely on "initial reference decoding vectors" (previous tokens) and "encoding vectors" (context from the input) to generate the next token. "A boy play football on football field" one word at a time, preventing it from seeing "field" before it has finished generating "play.").
Regarding claim 23, the rejection of claim 1 is incorporated herein. 
Song2 in the combination further teach each wherein the transformer-based model comprises one or more encoder and decoder blocks (see para [0043]; “Transformer: a translation model comprising an encoder and a decoder”), each comprising a multi-head attention layer (see para [0093]; “Wherein the first self-attention layer includes a multi-head self-attention layer”).
Regarding claim 27, the rejection of claim 1 is incorporated herein. 
Song2 in the combination further teach each wherein the tensor taken as input by the transformer- based model comprises the image feature tensor pre-pended to the input tensor derived from the seed text (see para [0187]; “a translation module 704 configured for inputting the global image features corresponding to the target image and the target detection features corresponding to the target image into a translation model to generate a translation sentence, and taking the translation sentence as a description sentence of the target image”).
Regarding claim 63, Song et al. teaches a system comprising: at least one processor; and at least one non-transitory computer readable medium containing instructions that, when executed by the at least one processor, cause the at least one processor to (see para [0008]; “a non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more processors, cause the one or more processors”): obtain one or more medical images (see para [0006]; “The at least one processor is configured to receive a user selection of at least one medical image in at least one view); obtaining one or more words comprising seed text (see para [0049]; “user 105 may type text in a message box 222. The text entered can be keywords, phrases, or sentences. For example, user 105 may enter “brain” in message box 222, as shown in FIG. 2C. Based on the entered text and the image viewed, processor 120 may automatically adjust and generate new descriptions and the corresponding keywords. The order of the keywords may also be adjusted accordingly. For example, because “brain” is entered as shown in FIG. 2C, the keywords associated with brain, such as “left fontal lobe,” “gray matter,” “white matter,” etc. are ranked higher as compared to those in FIG. 2A”), Note: user input implies seed text); use an image processing component to process the one or more medical images (see para [0006]; “The at least one processor is configured to receive a user selection of at least one medical image in at least one view”), wherein the image processing component comprises a deep learning model that takes as input the one or more medical images and produces as an output an image feature tensor (see para [0062]; “end-to-end diagnosis report generation model 400 may take one or more pre-processed images, e.g., a medical image 402, as input and output the description of the medical image”, see also para [0038]; “deep-learning diagnosis report unit 124 may apply an end-to-end learning network to infer the text information from medical images 102. The end-to-end learning network may include two parts: a first part that extracts image features from medical images 102” Note; extracted feature implies feature tensor); use a natural language processing component to generate a caption for the one or more medical images (see para [0059]; “In step S318, diagnostic report generating system 100 may adjust the natural language description”). However, Song et al. does not teach wherein the natural language processing component comprises a transformer-based model that takes as input the image feature tensor from the image processing component and produces as output a probability for each word in a vocabulary, wherein the transformer- based model takes as input a tensor that comprises the image feature tensor and an input tensor derived from the seed text.
In the same field of endeavor, Song2 teach wherein the natural language processing component comprises a transformer-based model that takes as input the image feature tensor from the image processing component and produces as output a probability for each word in a vocabulary (see para [0177]; “the target detection features and the global image features are inputted into an encoder of the Transformer translation model…. reference decoding vectors, the encoding vectors, and the global image features are inputted into a decoder to generate decoding vectors outputted by the decoder”, see also para [0111]; “the softmax function is used to convert the attention score numerically through a formula (2). On one hand, normalization may be performed to obtain a probability distribution with the sum of all weight coefficients being 1”, Note; the global image feature corresponds to the image feature tensor), wherein the transformer- based model takes as input a tensor that comprises the image feature tensor and an input tensor derived from the seed text (see para [0149]; “the reference decoding vectors are initial decoding vectors;… reference decoding vectors are the decoding vectors corresponding to the previous translation phrase”, see also para[0161]; “for the each one of the other translation phrases …. the reference decoding vectors thereof are decoding vectors corresponding to the previous translation phrase”, and para [0182]; “based on initial reference decoding vectors, the encoding vectors, and the global image features inputted, the decoder outputs decoding vectors and the first phrase “a” is obtained. Vectors corresponding to the first phrase “a” are taken as a reference for decoding the second phrase “boy”, Note: the reference decoding vectors are derived from previously generated words (or initialized text vectors) and are input to the transformer decoder). Accordingly, it would have been obvious to one of ordinary skill in the before the effective filing date of the claimed invention to modify a method for generating a report based on medical images of a patient based on a learning network of Song et al. in view of the use of an image description method performing feature extraction on a target image to generate a translation sentence of Song2 in order to generate a complete global information corresponding to the image as a reference in the subsequent process (see para [0177).

Claims 24 and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Song et al. in view of Song2 as applied in claim 1, and 7 above, and further in view of Wang et al. (US 20170200065A1)
Regarding claim 24, the rejection of claim 1 is incorporated herein. 
 	 Song2 in the combination further teach each wherein the transformer-based model further takes as input a vector comprising information about a relative position of elements in the an input tensor, wherein the relative position of the elements corresponds to an order of the one or more words in the seed text from which the input tensor was derived (see para [0032]; “the Transformer model does not require a loop, instead, processes the input global image features corresponding to the target image and the target detection features corresponding to the target image in parallel, while uses the self-attention mechanism to combine features”, Note: Transformer-based models require positional encoding (vectors added to word embeddings) to understand sequence order, as self-attention is inherently permutation-invariant). However, Song2 does not explicitly disclose a relative position of elements.
In the same field of endeavor, Wang et al. wherein the transformer-based model further takes as input a vector comprising information about a relative position of elements in the an input tensor, wherein the relative position of the elements corresponds to an order of the one or more words in the seed text from which the input tensor was derived (see para [0068]; “for each node, N words different from the target word are randomly selected and a loss factor for the objective function is defined as log(1+exp(−w.sub.iVh.sub.i−1)+Σ.sub.n log(1+exp(w.sub.nVh.sub.i−1). In this expression, w.sub.i represents the embedding for each target word at i-th position. w.sub.n represents the n-th randomly chosen negative sample for the i-th target word and h.sub.i−1 is the hidden response at position i−1” Note; w.sub.i represents the word embedding and I-th position indicates the position of the target word within a sequence). Accordingly, it would have been obvious to one of ordinary skill in the before the effective filing date of the claimed invention to modify a method for generating a report based on medical images of a patient based on a learning network of Song et al. in view of the use of an image description method performing feature extraction on a target image to generate a translation sentence of Song2 and a technique for image captioning with weak supervision during image captioning analysis of Wang et al. in order to generate image captions with greater complexity and precision (see para [0068]).
Regarding claim 26, the rejection of claim 7 is incorporated herein. 
Wang et al. in the combination further teach wherein the input tensor has a size KxM, wherein M is the size of the embedding used by the transformer-based model and K is a number of tokens derived from the set of one or more words by tokenization (see Wang et al. para [0050]; “Then, each input node in the RNN is appended with additional embedding information for the keywords according to the equation K.sub.e=max (W.sub.kK+b). Here, K.sub.e is the keyword list for the node, W.sub.k is the embedding matrix for the keywords that controls the keyword weights 504”). Accordingly, it would have been obvious to one of ordinary skill in the before the effective filing date of the claimed invention to modify a method for generating a report based on medical images of a patient based on a learning network of Song et al. in view of the use of an image description method performing feature extraction on a target image to generate a translation sentence of Song2 and a technique for image captioning with weak supervision during image captioning analysis of Wang et al. in order to generate image captions with greater complexity and precision (see para [0050]).

Claims 11 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Song et al. in view of Wang et al. as applied in claim 1 above, and further in view of Radford et al. “Language Models are Unsupervised Multitask Learners”.
Regarding claim 11, the rejection of claim 1 is incorporated herein. The combination of Song et al.  and Song2 et al. does not teach wherein the transformer-based model is obtained by training a pre-trained GPT-2 model, a pre-trained BERT model or a pre- trained T5 model.
 	In the same field of endeavor Radford et al. teaches wherein the transformer-based model is obtained by training a pre-trained GPT-2 model, a pre-trained BERT model or a pre- trained T5 model (see section 3. Experiment; “We trained and benchmarked four LMs with approximately log-uniformly spaced sizes. The architectures are summarized in Table 2. The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT (Devlin et al., 2018). Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT”). Accordingly, it would have been obvious to one of ordinary skill in the before the effective filing date of the claimed invention to modify a method for generating a report based on medical images of a patient based on a learning network of Song et al. in view of the use of an image description method performing feature extraction on a target image to generate a translation sentence of Song2 and a new language representation model Bidirectional Encoder Representations from Transformers of Radford et al. in order to create state-of-the-art models for a wide range of tasks without substantial task-specific architecture modifications (see section 3. Experiment).
Regarding claim 14, the rejection of claim 1 is incorporated herein. 
 	Song et al. in the combination further teach each further comprising receiving training data from a user (see para [0068]; “end-to-end diagnosis report generation model 400 may be trained using sample medical images and their corresponding diagnosis reports (e.g., text-based descriptions) provided by radiologists/clinicians (serving as ground truths)”).
Radford et al. in the combination further teach each and at least partially re-training the deep learning models in the image processing component and the transformer-based model in the natural language processing component using the training data (see above section “6. Discussion”; “An influential early work on deep representation learning for text was Skip-thought Vectors…explored the use of representations derived from machine translation models and Howard & Ruder (2018) improved the RNN based fine-tuning approaches of (Dai & Le, 2015). (Conneau et al., 2017a) studied the transfer performance of representations learned by natural language inference models and (Subramanian et al., 2018) explored large-scale multitask training”). Accordingly, it would have been obvious to one of ordinary skill in the before the effective filing date of the claimed invention to modify a method for generating a report based on medical images of a patient based on a learning network of Song et al. in view of the use of an image description method performing feature extraction on a target image to generate a translation sentence of Song2 and natural language processing tasks without any explicit supervision when trained on a new dataset of millions of webpages of Radford et al. in order to perform a wide range of tasks in a zero-shot setting (see above section “6. Discussion).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to WINTA GEBRESLASSIE whose telephone number is (571)272-3475. The examiner can normally be reached Monday-Friday9:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Bee can be reached at 571-270-5180. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/WINTA GEBRESLASSIE/            Examiner, Art Unit 2677      

/ANDREW W BEE/            Supervisory Patent Examiner, Art Unit 2677
Read full office action
Prosecution Timeline

Jan 06, 2023
Application Filed
Aug 12, 2025
Non-Final Rejection mailed — §103
Nov 20, 2025
Response Filed
Mar 03, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/710,872
Patent 12579683
IMAGE VIEW ADJUSTMENT
3y 11m to grant Granted Mar 17, 2026
17/876,145
Patent 12573238
BIOMETRIC FACIAL RECOGNITION AND LIVENESS DETECTOR USING AI COMPUTER VISION
3y 7m to grant Granted Mar 10, 2026
18/177,769
Patent 12530768
SYSTEMS AND METHODS FOR IMAGE STORAGE
2y 10m to grant Granted Jan 20, 2026
17/923,954
Patent 12524932
MACHINE LEARNING IMAGE RECONSTRUCTION
3y 2m to grant Granted Jan 13, 2026
18/196,332
Patent 12511861
DETECTION OF ANNOTATED REGIONS OF INTEREST IN IMAGES
2y 7m to grant Granted Dec 30, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
76%
Grant Probability
99%
With Interview (+25.0%)
2y 6m (~0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 135 resolved cases by this examiner. Grant probability derived from career allowance rate.