Last updated: April 19, 2026
Application No. 18/304,330
MEDICAL IMAGING ANALYSIS USING SELF-SUPERVISED LEARNING

Final Rejection §103
Filed
Apr 20, 2023
Examiner
BURLESON, MICHAEL L
Art Unit
2681
Tech Center
2600 — Communications
Assignee
Bristol-Myers Squibb Company
OA Round
2 (Final)
Interview Optional

— -6.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 489 resolved cases, 2023–2026
Examiner Intelligence

BURLESON, MICHAEL L View full profile →
Grants 75% — above average
Career Allow Rate
365 granted / 489 resolved
+12.6% vs TC avg
Minimal -6% lift
Without
With
+-6.1%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
36 currently pending
Career history
525
Total Applications
across all art units
Statute-Specific Performance

§101
12.1%
-27.9% vs TC avg
§103
55.2%
+15.2% vs TC avg
§102
21.8%
-18.2% vs TC avg
§112
8.3%
-31.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 489 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments, see Applicants Remarks pages 9-12, filed 12/01/25, with respect to the rejection(s) of claim(s) 1-20 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Haghighi et al US 20220309811.
Regarding claim 1, Applicant states that the prior art of record fails to teach of wherein each unannotated multi-dimensional medical image among the plurality of unannotated multi-dimensional medical images comprises a corresponding three-dimensional medical image or a corresponding four-dimensional medical image (Applicants Remarks pages 10-11).  Haghighi et al teaches a visual word is defined as a segment of consistent and recurrent anatomical pattern, and instances of a visual word as image cubes/patches (samples) extracted across different 3D/2D images for the same visual word (paragraph 0171).  transferable visual words, where the recurrent anatomical structures in medical images are anatomical visual words, which can be automatically discovered from unlabeled medical images, serving as strong yet free supervision signals for training deep models (paragraph 0172). Note:  the visual word is 3D image corresponding to unlabled medical image.  Claims 1-3, 5, 9-13, 15, 19 and 20 are rejected.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 10/29/25 was filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1 and 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ma et al US 20230306723 in view of Haghighi et al US 20220309811 further in view of Poole US 20200311911.
Regarding claim 1, Ma et al teaches a computer-implemented method executed on data processing hardware causes the data processing hardware to perform operations (paragraph 0084) comprising:
obtaining a first training data set comprising a plurality of unannotated multi- dimensional medical images (receives a first set of training data which includes photographic images unrelated to a targeted medical diagnosis task (paragraph 0085 and fig 7A);
executing a self-supervised masked image modeling (MIM) training process to pre-train an image encoder on the first training data set (processing logic Pre-training an AI model on the first set of training data which includes the photographic images by learning image classification from the photographic images within the first set of training data (paragraph 0087); 
obtaining a second training data set comprising a plurality of annotated multi- dimensional medical images (processing logic receives a second set of training data which includes a plurality of medical images derived from multiple distinct sources, in which the plurality of medical images are configured with multiple inconsistent annotation and classification data (paragraph 0086), 
Ma et al fails to teach wherein each unannotated multi-dimensional medical image among the plurality of unannotated multi-dimensional medical images comprises a corresponding three-dimensional medical image or a corresponding four-dimensional medical image;
Haghighi et al teaches wherein each unannotated multi-dimensional medical image among the plurality of unannotated multi-dimensional medical images comprises a corresponding three-dimensional medical image or a corresponding four-dimensional medical image (a visual word is defined as a segment of consistent and recurrent anatomical pattern, and instances of a visual word as image cubes/patches (samples) extracted across different 3D/2D images for the same visual word (paragraph 0171).  transferable visual words, where the recurrent anatomical structures in medical images are anatomical visual words, which can be automatically discovered from unlabeled medical images, serving as strong yet free supervision signals for training deep models (paragraph 0172). Note:  the visual word is 3D image corresponding to unlabled medical image;
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al to include: wherein each unannotated multi-dimensional medical image among the plurality of unannotated multi-dimensional medical images comprises a corresponding three-dimensional medical image or a corresponding four-dimensional medical image;
The reason of doing so would to accurately train a model to detect a desired result.

Ma et al in view of Haghighi et al fails to teach each annotated multi-dimensional medical image comprising a plurality of image voxels each paired with a corresponding ground-truth label indicating a class the corresponding image voxel belongs to;
executing a supervised training process to train an image analysis model on the second training data set to teach the image analysis model to learn how to predict the corresponding ground-truth labels for the plurality of image voxels of each annotated multi-dimensional medical image, the image analysis model incorporates the pre-trained image encoder;
Poole teaches each annotated multi-dimensional medical image comprising a plurality of image voxels each paired with a corresponding ground-truth label indicating a class the corresponding image voxel belongs to (a plurality of sets of training data 60. In the present embodiment, each set of training data 60 comprises a respective set of medical imaging data, for example a set of pixel or voxel intensities for an array of pixel or voxel positions. Each of the sets of training data 60 has been manually classified to obtain a ground truth label for each of the sets of training data (paragraph 0062); and

executing a supervised training process to train an image analysis model on the second training data set to teach the image analysis model to learn how to predict the corresponding ground-truth labels for the plurality of image voxels of each annotated multi-dimensional medical image, the image analysis model incorporates the pre-trained image encoder (raining circuitry 44 also provides the neural network 62 with primary ground truth labels 69 for at least some of the sets of training data 60 (paragraph 0092).  Using the primary GT labels 69 and primary training data 62, the training circuitry 44 is trained to perform a classification process to predict labels from primary input data. the primary input data comprises image data. Positive supervision is used to train the neural network 62 to be sensitive to the selected features 64, for example to assign positive weights to the selected features (paragraph 0093) Note: the selected features are a voxel feature 82, image feature 84, and image feature 86 (paragraph 0105).
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al in view of Haghighi et al to include: each annotated multi-dimensional medical image comprising a plurality of image voxels each paired with a corresponding ground-truth label indicating a class the corresponding image voxel belongs to; executing a supervised training process to train an image analysis model on the second training data set to teach the image analysis model to learn how to predict the corresponding ground-truth labels for the plurality of image voxels of each annotated multi-dimensional medical image, the image analysis model incorporates the pre-trained image encoder;
The reason of doing so would to accurately train a model to detect a desired result.

Regarding claim 11, Ma et al teaches A system comprising: 
data processing hardware (paragraph 0084); and 
memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware causes the data processing hardware (paragraph 0084) to perform operations comprising:
obtaining a first training data set comprising a plurality of unannotated multi- dimensional medical images (receives a first set of training data which includes photographic images unrelated to a targeted medical diagnosis task (paragraph 0085 and fig 7A);
executing a self-supervised masked image modeling (MIM) training process to pre-train an image encoder on the first training data set (processing logic Pre-training an AI model on the first set of training data which includes the photographic images by learning image classification from the photographic images within the first set of training data (paragraph 0087); 
obtaining a second training data set comprising a plurality of annotated multi- dimensional medical images (processing logic receives a second set of training data which includes a plurality of medical images derived from multiple distinct sources, in which the plurality of medical images are configured with multiple inconsistent annotation and classification data (paragraph 0086), 
Ma et al fails to teach wherein each unannotated multi-dimensional medical image among the plurality of unannotated multi-dimensional medical images comprises a corresponding three-dimensional medical image or a corresponding four-dimensional medical image;
Haghighi et al teaches wherein each unannotated multi-dimensional medical image among the plurality of unannotated multi-dimensional medical images comprises a corresponding three-dimensional medical image or a corresponding four-dimensional medical image (a visual word is defined as a segment of consistent and recurrent anatomical pattern, and instances of a visual word as image cubes/patches (samples) extracted across different 3D/2D images for the same visual word (paragraph 0171).  transferable visual words, where the recurrent anatomical structures in medical images are anatomical visual words, which can be automatically discovered from unlabeled medical images, serving as strong yet free supervision signals for training deep models (paragraph 0172). Note:  the visual word is 3D image corresponding to unlabled medical image;
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al to include: wherein each unannotated multi-dimensional medical image among the plurality of unannotated multi-dimensional medical images comprises a corresponding three-dimensional medical image or a corresponding four-dimensional medical image;
The reason of doing so would to accurately train a model to detect a desired result.

Ma et al in view of Haghighi et al fails to teach each annotated multi-dimensional medical image comprising a plurality of image voxels each paired with a corresponding ground-truth label indicating a class the corresponding image voxel belongs to;
executing a supervised training process to train an image analysis model on the second training data set to teach the image analysis model to learn how to predict the corresponding ground-truth labels for the plurality of image voxels of each annotated multi-dimensional medical image, the image analysis model incorporates the pre-trained image encoder;

Poole teaches each annotated multi-dimensional medical image comprising a plurality of image voxels each paired with a corresponding ground-truth label indicating a class the corresponding image voxel belongs to (a plurality of sets of training data 60. In the present embodiment, each set of training data 60 comprises a respective set of medical imaging data, for example a set of pixel or voxel intensities for an array of pixel or voxel positions. Each of the sets of training data 60 has been manually classified to obtain a ground truth label for each of the sets of training data (paragraph 0062); and

executing a supervised training process to train an image analysis model on the second training data set to teach the image analysis model to learn how to predict the corresponding ground-truth labels for the plurality of image voxels of each annotated multi-dimensional medical image, the image analysis model incorporates the pre-trained image encoder (raining circuitry 44 also provides the neural network 62 with primary ground truth labels 69 for at least some of the sets of training data 60 (paragraph 0092).  Using the primary GT labels 69 and primary training data 62, the training circuitry 44 is trained to perform a classification process to predict labels from primary input data. the primary input data comprises image data. Positive supervision is used to train the neural network 62 to be sensitive to the selected features 64, for example to assign positive weights to the selected features (paragraph 0093) Note: the selected features are a voxel feature 82, image feature 84, and image feature 86 (paragraph 0105).
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al in view of Haghighi et al to include: each annotated multi-dimensional medical image comprising a plurality of image voxels each paired with a corresponding ground-truth label indicating a class the corresponding image voxel belongs to; executing a supervised training process to train an image analysis model on the second training data set to teach the image analysis model to learn how to predict the corresponding ground-truth labels for the plurality of image voxels of each annotated multi-dimensional medical image, the image analysis model incorporates the pre-trained image encoder;
The reason of doing so would to accurately train a model to detect a desired result.

Claim(s) 2, 5, 12 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ma et al US 20230306723 in view of Haghighi et al US 20220309811 further in view of Poole US 20200311911 further in view of Liu et al US 20240177838.

Regarding claim 2, Ma et al in view of Haghighi et al further in view of Poole teach all of the limitations of claim 1
Ma et al in view of Haghighi et al further in view of Poole fails to teach wherein executing the self-supervised MIM training process to pre-train the image encoder comprises, for each corresponding unannotated multi-dimensional medical image in the first training data set:
generating, using an image tokenizer configured to receive the corresponding unannotated multi-dimensional medical image as input, a sequence of discrete visual tokens characterizing the corresponding unannotated multi-dimensional medical image; 
dividing the corresponding unannotated multi-dimensional medical image into a plurality of image patches;
randomly masking a portion of the image patches divided from the corresponding unannotated multi-dimensional medical image; for each masked image patch: 
generating, using the image encoder, an encoded hidden representation for the masked image patch; and
based on the encoded hidden representation, generating, using a decoder, a corresponding predicted token; 
determining a training loss based on the predicted tokens generated for the masked image patches and corresponding visual tokens from the sequence of discrete visual tokens that are aligned with the masked image patches; and
updating parameters of the image encoder based on the training loss.
Liu et al teaches wherein executing the self-supervised MIM training process to pre-train the image encoder comprises, for each corresponding unannotated multi-dimensional medical image in the first training data set:
generating, using an image tokenizer configured to receive the corresponding unannotated multi-dimensional medical image as input, a sequence of discrete visual tokens characterizing the corresponding unannotated multi-dimensional medical image (training the MAE model may include receiving a plurality of inputs 205, such as medical images, one or more prompts, etc. As depicted in FIG. 2, the plurality of medical images may be divided into a plurality of fixed-size patches (image tokens 210) (paragraph 0059 and fig 2); 
dividing the corresponding unannotated multi-dimensional medical image into a plurality of image patches (the plurality of medical images may be divided into a plurality of fixed-size patches (image tokens 210) (paragraph 0059);
randomly masking a portion of the image patches divided from the corresponding unannotated multi-dimensional medical image (A subset of the image tokens 215 may be intentionally removed from the plurality of medical images, leaving a remaining plurality of image tokens (paragraph 0059). An encoder, e.g., ViT Encoder 220, may output one or more encoded image tokens 225 based on the remaining plurality of image tokens. In some techniques, masked tokens 235 may be appended with position encoding applied to each respective encoded image token (paragraph 0060); 
for each masked image patch: 
generating, using the image encoder, an encoded hidden representation for the masked image patch (masked tokens 235 may be appended with position encoding applied to each respective encoded image token (paragraph 0060); and
based on the encoded hidden representation, generating, using a decoder, a corresponding predicted token (masked tokens 235, and optional classification token 230, may be fed into a ViT Decoder 240 (paragraph 0061); 
determining a training loss based on the predicted tokens generated for the masked image patches and corresponding visual tokens from the sequence of discrete visual tokens that are aligned with the masked image patches (VIT Decoder 240 may be used to reconstruct the original image tokens (e.g., image tokens 210) to match or substantially match the original image pixel values. ViT Decoder 240 may generate at least one output, the network may be optimized using L2 image reconstruction loss applied on the removed vision tokens (e.g., masked token(s) 253) only (paragraph 0061); and
updating parameters of the image encoder based on the training loss (A Student ViT Encoder 320 with masked tokens 337 may be trained with a distilled loss to predict the classification tokens 330 predicted by a Teacher ViT Encoder 335 fed with all the image tokens 310 without masking. The Teacher ViT Encoder 335 may be updated using the moving average of the Student ViT Encoder 320 at the end of each training batch (paragraph 0064).
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al in view of Haghighi et al further in view of Poole to include: wherein executing the self-supervised MIM training process to pre-train the image encoder comprises, for each corresponding unannotated multi-dimensional medical image in the first training data set: generating, using an image tokenizer configured to receive the corresponding unannotated multi-dimensional medical image as input, a sequence of discrete visual tokens characterizing the corresponding unannotated multi-dimensional medical image; dividing the corresponding unannotated multi-dimensional medical image into a plurality of image patches; randomly masking a portion of the image patches divided from the corresponding unannotated multi-dimensional medical image; for each masked image patch: generating, using the image encoder, an encoded hidden representation for the masked image patch; and based on the encoded hidden representation, generating, using a decoder, a corresponding predicted token; determining a training loss based on the predicted tokens generated for the masked image patches and corresponding visual tokens from the sequence of discrete visual tokens that are aligned with the masked image patches; and updating parameters of the image encoder based on the training loss;
The reason of doing so would to accurately train a model to detect a desired result.

Regarding claim 5, Ma et al in view of Haghighi et al further in view of Poole teaches all of the limitations of claim 1
Ma et al in view of Haghighi et al further in view of Poole fails to teach wherein a number of visual tokens in the sequence of discrete visual tokens is equal to a number of image patches in the plurality of image patches
Lui et al teaches wherein a number of visual tokens in the sequence of discrete visual tokens is equal to a number of image patches in the plurality of image patches (the plurality of medical images may be divided into a plurality of fixed-size patches (image tokens 210) (paragraph 0059)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al in view of Poole to include: wherein a number of visual tokens in the sequence of discrete visual tokens is equal to a number of image patches in the plurality of image patches;
The reason of doing so would to accurately train a model to detect a desired result.

Regarding claim 12, Ma et al in view of Haghighi et al further in view of Poole teach all of the limitations of claim 11
Ma et al in view of Haghighi et al further in view of Poole fails to teach wherein executing the self-supervised MIM training process to pre-train the image encoder comprises, for each corresponding unannotated multi-dimensional medical image in the first training data set:
generating, using an image tokenizer configured to receive the corresponding unannotated multi-dimensional medical image as input, a sequence of discrete visual tokens characterizing the corresponding unannotated multi-dimensional medical image; 
dividing the corresponding unannotated multi-dimensional medical image into a plurality of image patches;
randomly masking a portion of the image patches divided from the corresponding unannotated multi-dimensional medical image; for each masked image patch: 
generating, using the image encoder, an encoded hidden representation for the masked image patch; and
based on the encoded hidden representation, generating, using a decoder, a corresponding predicted token; 
determining a training loss based on the predicted tokens generated for the masked image patches and corresponding visual tokens from the sequence of discrete visual tokens that are aligned with the masked image patches; and
updating parameters of the image encoder based on the training loss.
Liu et al teaches wherein executing the self-supervised MIM training process to pre-train the image encoder comprises, for each corresponding unannotated multi-dimensional medical image in the first training data set:
generating, using an image tokenizer configured to receive the corresponding unannotated multi-dimensional medical image as input, a sequence of discrete visual tokens characterizing the corresponding unannotated multi-dimensional medical image (training the MAE model may include receiving a plurality of inputs 205, such as medical images, one or more prompts, etc. As depicted in FIG. 2, the plurality of medical images may be divided into a plurality of fixed-size patches (image tokens 210) (paragraph 0059 and fig 2); 
dividing the corresponding unannotated multi-dimensional medical image into a plurality of image patches (the plurality of medical images may be divided into a plurality of fixed-size patches (image tokens 210) (paragraph 0059);
randomly masking a portion of the image patches divided from the corresponding unannotated multi-dimensional medical image (A subset of the image tokens 215 may be intentionally removed from the plurality of medical images, leaving a remaining plurality of image tokens (paragraph 0059). An encoder, e.g., ViT Encoder 220, may output one or more encoded image tokens 225 based on the remaining plurality of image tokens. In some techniques, masked tokens 235 may be appended with position encoding applied to each respective encoded image token (paragraph 0060); 
for each masked image patch: 
generating, using the image encoder, an encoded hidden representation for the masked image patch (masked tokens 235 may be appended with position encoding applied to each respective encoded image token (paragraph 0060); and
based on the encoded hidden representation, generating, using a decoder, a corresponding predicted token (masked tokens 235, and optional classification token 230, may be fed into a ViT Decoder 240 (paragraph 0061); 
determining a training loss based on the predicted tokens generated for the masked image patches and corresponding visual tokens from the sequence of discrete visual tokens that are aligned with the masked image patches (VIT Decoder 240 may be used to reconstruct the original image tokens (e.g., image tokens 210) to match or substantially match the original image pixel values. ViT Decoder 240 may generate at least one output, the network may be optimized using L2 image reconstruction loss applied on the removed vision tokens (e.g., masked token(s) 253) only (paragraph 0061); and
updating parameters of the image encoder based on the training loss (A Student ViT Encoder 320 with masked tokens 337 may be trained with a distilled loss to predict the classification tokens 330 predicted by a Teacher ViT Encoder 335 fed with all the image tokens 310 without masking. The Teacher ViT Encoder 335 may be updated using the moving average of the Student ViT Encoder 320 at the end of each training batch (paragraph 0064).
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al in view of Haghighi et al further in view of Poole to include: wherein executing the self-supervised MIM training process to pre-train the image encoder comprises, for each corresponding unannotated multi-dimensional medical image in the first training data set: generating, using an image tokenizer configured to receive the corresponding unannotated multi-dimensional medical image as input, a sequence of discrete visual tokens characterizing the corresponding unannotated multi-dimensional medical image; dividing the corresponding unannotated multi-dimensional medical image into a plurality of image patches; randomly masking a portion of the image patches divided from the corresponding unannotated multi-dimensional medical image; for each masked image patch: generating, using the image encoder, an encoded hidden representation for the masked image patch; and based on the encoded hidden representation, generating, using a decoder, a corresponding predicted token; determining a training loss based on the predicted tokens generated for the masked image patches and corresponding visual tokens from the sequence of discrete visual tokens that are aligned with the masked image patches; and updating parameters of the image encoder based on the training loss;
The reason of doing so would to accurately train a model to detect a desired result.


Regarding claim 15, Ma et al in view of Haghighi et al further in view of Poole teaches all of the limitations of claim 11
Ma et al in view of Haghighi et al further in view of Poole fails to teach wherein a number of visual tokens in the sequence of discrete visual tokens is equal to a number of image patches in the plurality of image patches
Lui et al teaches wherein a number of visual tokens in the sequence of discrete visual tokens is equal to a number of image patches in the plurality of image patches (the plurality of medical images may be divided into a plurality of fixed-size patches (image tokens 210) (paragraph 0059)
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al in view of Poole to include: wherein a number of visual tokens in the sequence of discrete visual tokens is equal to a number of image patches in the plurality of image patches;
The reason of doing so would to accurately train a model to detect a desired result.



Claim(s) 3 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ma et al US 20230306723 in view of Haghighi et al US 20220309811 further in view of Poole US 20200311911  further in view of Liu et al US 20240177838 further in view of Dalli et al US 20220198254.
	Regarding claim 3, Ma et al in view of Haghighi et al further in view of Poole further in view of Liu et al teaches all of the limitations of claims 1 and 2
	Ma et al in view of Haghighi et al further in view of Poole further in view of Liu et al fails to teach wherein: the image encoder comprises a plurality of multi-head attention layers; and 
the decoder comprises a plurality of multi-head attention layers.
Dalli et al teaches wherein: the image encoder comprises a plurality of multi-head attention layers (a parallel explainable encoder layer 1630 which takes two inputs: the output of the Multi-Head Attention component 215 or the output of the Add and Normalize component 217 (paragraph 0124); and 
the decoder comprises a plurality of multi-head attention layers (the parallel explainable encoder layer is used as input to the Multi-Head Attention layer 1631 in the decoder layer (paragraph 0125).

Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al in view of Haghighi et al further in view of Poole further in view of Liu et al to include: wherein: the image encoder comprises a plurality of multi-head attention layers; and the decoder comprises a plurality of multi-head attention layers.
The reason of doing so would to accurately train a model to detect a desired result.

Regarding claim 13, Ma et al in view of Haghighi et al further in view of Poole further in view of Liu et al teaches all of the limitations of claims 11 and 12
	Ma et al in view of Haghighi et al further in view of Poole further in view of Liu et al fails to teach wherein: the image encoder comprises a plurality of multi-head attention layers; and 
the decoder comprises a plurality of multi-head attention layers.
Dalli et al teaches wherein: the image encoder comprises a plurality of multi-head attention layers (a parallel explainable encoder layer 1630 which takes two inputs: the output of the Multi-Head Attention component 215 or the output of the Add and Normalize component 217 (paragraph 0124); and 
the decoder comprises a plurality of multi-head attention layers (the parallel explainable encoder layer is used as input to the Multi-Head Attention layer 1631 in the decoder layer (paragraph 0125).

Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al in view of Haghighi et al further in view of Poole further in view of Liu et al to include: wherein: the image encoder comprises a plurality of multi-head attention layers; and the decoder comprises a plurality of multi-head attention layers.
The reason of doing so would to accurately train a model to detect a desired result.



Claim(s) 9, 10, 19 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ma et al US 20230306723 in view of Haghighi et al US 20220309811 further in view of Poole US 20200311911   further in view of Bengtsson et al US 20210401392.

Regarding claim 9, Ma et al in view of Haghighi et al further in view of Poole teach all of the limitations of claim 1
Ma et al in view of Haghighi et al further in view of Poole fails to teach wherein the image analysis model comprises a tumor segmentation model.
Bengtsson et al teaches wherein the image analysis model comprises a tumor segmentation model (radiologist tumor segmentation (paragraph 0149).
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al in view of Poole to include: wherein the image analysis model comprises a tumor segmentation model.
The reason of doing so would to accurately train a specific model to detect a desired result.

Regarding claim 10, Ma et al in view of Haghighi et al further in view of Poole teach all of the limitations of claim 11
Ma et al in view of Haghighi et al further in view of Poole fails to teach wherein the image analysis model comprises a multi- organ segmentation model.
Bengtsson et al teaches wherein the image analysis model comprises a multi- organ segmentation model (a three-dimensional organ segmentation model (paragraph 0018).
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al in view of Haghighi et al further in view of Poole to include: wherein the image analysis model comprises a multi- organ segmentation model.
The reason of doing so would to accurately train a specific model to detect a desired result.

Regarding claim 19, Ma et al in view of Haghighi et al further in view of Poole teach all of the limitations of claim 11
Ma et al in view of Haghighi et al further in view of Poole fails to teach wherein the image analysis model comprises a tumor segmentation model.
Bengtsson et al teaches wherein the image analysis model comprises a tumor segmentation model (radiologist tumor segmentation (paragraph 0149).
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al in view of Haghighi et al further in view of Poole to include: wherein the image analysis model comprises a tumor segmentation model.
The reason of doing so would to accurately train a specific model to detect a desired result.

Regarding claim 20, Ma et al in view of Haghighi et al further in view of Poole teach all of the limitations of claim 11
Ma et al in view of Haghighi et al further in view of Poole fails to teach wherein the image analysis model comprises a multi- organ segmentation model.
Bengtsson et al teaches wherein the image analysis model comprises a multi- organ segmentation model (a three-dimensional organ segmentation model (paragraph 0018).
Therefore, it would have been obvious to a person with ordinary skill in the art to have modified Ma et al in view of Poole to include: wherein the image analysis model comprises a multi- organ segmentation model.
The reason of doing so would to accurately train a specific model to detect a desired result.


Allowable Subject Matter
Claims 4, 6-8, 14, 16-18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication should be directed to Michael Burleson whose telephone number is (571) 272-7460 and fax number is (571) 273-7460.  The examiner can normally be reached Monday thru Friday from 8:00 a.m. – 4:30p.m.  If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Akwasi Sarpong can be reached at (571) 270- 3438.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Michael Burleson
Patent Examiner
Art Unit 2681

Michael Burleson
March 1, 2026
/MICHAEL BURLESON/


/AKWASI M SARPONG/SPE, Art Unit 2681                                                                                                                                                                                                        3/9/2026
Read full office action
Prosecution Timeline

Apr 20, 2023
Application Filed
Sep 06, 2025
Non-Final Rejection — §103
Dec 01, 2025
Response Filed
Mar 01, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/338,560
Patent 12603965
PRINTING DEVICE SETTING EXPANDED REGION AND GENERATING PATCH CHART PRINT DATA BASED ON PIXELS IN EXPANDED REGION
2y 5m to grant Granted Apr 14, 2026
18/148,379
Patent 12585826
DOCUMENT AUTHENTICATION USING ELECTROMAGNETIC SOURCES AND SENSORS
2y 5m to grant Granted Mar 24, 2026
17/940,591
Patent 12566125
SEQUENCER FOCUS QUALITY METRICS AND FOCUS TRACKING FOR PERIODICALLY PATTERNED SURFACES
2y 5m to grant Granted Mar 03, 2026
17/632,718
Patent 12561548
SYSTEM SIMULATING A DECISIONAL PROCESS IN A MAMMAL BRAIN ABOUT MOTIONS OF A VISUALLY OBSERVED BODY
2y 5m to grant Granted Feb 24, 2026
18/121,009
Patent 12562549
LIGHT EMITTING ELEMENT, LIGHT SOURCE DEVICE, DISPLAY DEVICE, HEAD-MOUNTED DISPLAY, AND BIOLOGICAL INFORMATION ACQUISITION APPARATUS
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
75%
Grant Probability
68%
With Interview (-6.1%)
2y 10m
Median Time to Grant
Moderate
PTA Risk
Based on 489 resolved cases by this examiner. Grant probability derived from career allow rate.