Last updated: April 19, 2026
Application No. 18/475,406
SYSTEM AND METHOD FOR ONE-SHOT ANATOMY LOCALIZATION WITH UNSUPERVISED VISION TRANSFORMERS FOR THREE-DIMENSIONAL (3D) MEDICAL IMAGES

Final Rejection §103
Filed
Sep 27, 2023
Examiner
CROCKETT, JOSHUA BRIGHAM
Art Unit
2661
Tech Center
2600 — Communications
Assignee
GE Precision Healthcare LLC
OA Round
2 (Final)
Interview Optional

— +27.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 18 resolved cases, 2023–2026
Examiner Intelligence

CROCKETT, JOSHUA BRIGHAM View full profile →
Grants 72% — above average
Career Allow Rate
13 granted / 18 resolved
+10.2% vs TC avg
Strong +28% interview lift
Without
With
+27.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
26 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
6.0%
-34.0% vs TC avg
§103
47.5%
+7.5% vs TC avg
§102
10.1%
-29.9% vs TC avg
§112
35.1%
-4.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 18 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 20 January 2026 was received and the information disclosure statement has been considered by the examiner.

Response to Arguments
Claims 1, 7, 9, 14, and 16 are amended. Claim 4 is canceled. Claims 1-3 and 5-20 are pending in this action.
Applicant’s arguments, see 8, filed 16 January 2026, with respect to the objections to claims 7, 9, and 16 have been fully considered and are persuasive. Specifically, the applicant amended the claims to correct minor informalities. The objections to claims 7, 9, and 16 have been withdrawn. 
Applicant’s arguments, see pg. 8-10, filed 16 January 2026, with respect to the rejections of claims 1-6 under 35 U.S.C. 103 have been fully considered and are not persuasive.  The applicant argues that Yan et al. ("SAM: Self-supervised Learning of Pixel-wise Anatomical Embeddings in Radiological Images" full reference on PTO-892 submitted on 6 November 2025; hereafter, Yan) in view of Liu et al. (US 20180089530 A1; hereafter, Liu) does not disclose "assigning, via the processor, the corresponding pixel level features in the medical image an anatomical label corresponding to a respective anatomical label of the region of interest in the template image", as recited in independent claim 1. The examiner disagrees. The applicant points out that Yan finds matching "arbitrary" landmarks of interest and argues that therefore Yan does not disclose "assigning a corresponding pixel level feature in a medical image an anatomical label corresponding to a respective anatomical label of a ROI in the template image" (applicant's remarks filed 16 January 2026 pg. 10). However, the broadest reasonable interpretation of an anatomical label is a marking or text, a label, pointing out or identifying a piece of anatomy in an image. Further, the use of "arbitrary" in Yan is not to be equated with unimportant or not anatomical. Rather, "arbitrary" refers to the ability of the user to define the point of interest in the anatomy regardless of the location of the point of interest, see Yan pg. 11 col. 1 para. last, "To show that SAM can be used to detect arbitrary anatomical locations, we randomly select a point in a template CT image, and then use SAM to find its matched point in a query image from another patient." Such an arbitrary point is marking a region of anatomy as a region of interest which is understood as an anatomical label. The anatomical label is marked in a template image and corresponding pixel level features in the target medical image are identified and marked or labeled (Yan, Fig. 3, the matched anatomical point is marked on the output image which is understood as a labeling). Therefore, the applicant's argument is not persuasive. The rejection of claims 1-3 and 5-6, claim 4 being canceled, under 35 U.S.C. 103 is maintained.
Applicant’s arguments, see pg. 10-11, filed 16 January 2026, with respect to the rejections of claims 7-20 under 35 U.S.C. 103 have been fully considered and are persuasive. Specifically, the scope of the claim has changed due to the applicant’s amendment, “clustering, via the processor, pixel level features from both the medical image and the template image together via paired clustering into anatomically similar regions”, and the applicant argues that Yan in view of Liu does not disclose the amended language. The examiner agrees. Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Salehi et al. (US 20260017921 A1; hereafter, Salehi).
Salehi discloses:
clustering, via the processor, pixel level features from both the medical image and the template image together via paired clustering ([0099] "In some examples, the propagated cluster map can be determined using a propagator. For example, patch mapping between the first set of features and the second set of features can be performed using a Temporal Patch Propagator (TPP)". Therefore, the first set of features from the source image, i.e. template image, are clustered with the second set of features from the target image, i.e. medical image, to generate a propagated cluster map. This clustering is a clustering of features from both images and is therefore understood as a paired clustering. See also [0082] for how features are clustered to form the propagated cluster map "The propagator 470 can be a TPP which utilizes the modified cluster map 457, the source F1 feature map 456, and the target F2 feature map 452 to determine and generate as output a propagated cluster map 472") into anatomically similar regions ([0099] "In some cases, the propagated cluster map can be indicative of a correspondence of patches between the source image and the target image." Corresponding patches are understood as similar regions. A person ordinary skill in the art would understand that whether the subject of the image be anatomical, as taught by Yan, or another subject that the invention of Salehi would still operate to find similar regions. Therefore, when considered in combination with Yan, the regions are understood as anatomical regions as shown by Yan)
The full rejection, including motivations to combine, is included below in the section “Claim Rejections - 35 USC § 103”. Therefore, new grounds of rejection necessitated by the applicant’s amendment is made for claims 7-20 under 35 U.S.C. 103.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3 and 5-6 are rejected under 35 U.S.C. 103 as being unpatentable over Yan et al. ("SAM: Self-supervised Learning of Pixel-wise Anatomical Embeddings in Radiological Images" full reference on PTO-892; hereafter, Yan) in view of Liu et al. (US 20180089530 A1; hereafter, Liu).
Regarding claim 1, Yan discloses:
A computer-implemented method for performing one-shot anatomy localization, comprising: obtaining, at a processor (pg. 5 col. 1 para. 1, the operation is performed on a GPU. Further, a person of ordinary skill in the art would understand that the described process of digital image processing would be performed on a processor),

    PNG
    media_image1.png
    82
    342
    media_image1.png
    Greyscale

a medical image of a subject (pg. 4 col. 2 para. last, the process may be applied to anatomical images in which the query image comprising a medical image is received);
receiving, at the processor, a selection of both a template image (pg. 4 col. 2 para. last, a template image is received) and a region of interest within the template image (pg. 4 col. 2 para. last, a point of interest, i.e. a reference point, is marked on the reference image. As the point or interest highlights an area on the reference image, it is understood to mark a region of interest),

    PNG
    media_image2.png
    122
    346
    media_image2.png
    Greyscale

wherein the template image includes one or more anatomical landmarks (pg. 5 col. 2 para. 2, template images include anatomical landmarks) assigned a respective anatomical label (pg. 5 col. 2 para. 2, the landmarks are chosen based on anatomy, such as lung or trachea bifurcation, which is understood as labels);

    PNG
    media_image3.png
    112
    348
    media_image3.png
    Greyscale

inputting, via the processor, both the medical image and the template image into a trained	 model (Fig. 3, both the template image and the query image, i.e. medical image, and input to determining the 4D embedding tensors);

    PNG
    media_image4.png
    266
    436
    media_image4.png
    Greyscale

outputting, via the processor, from the trained vision transformer model both patch level features (Fig. 2 and pg. 3 col. 2 para. 2, the model outputs a local embedding tensor. Fig. 5, the local embedding tensor has a lower definition than the input image and is therefore understood as patch level features. See also pg. 3 col. 2 para. 2, "[the local embedding] is from the finest FPN level with a smaller stride and detailed features". As it has gone through at least one layer of FPN which includes convolution, a person of ordinary skill in the art would understand this to have a lower resolution than the input such that it may be understood as patch level features)

    PNG
    media_image5.png
    360
    876
    media_image5.png
    Greyscale


    PNG
    media_image6.png
    410
    1030
    media_image6.png
    Greyscale

and image level features (Fig. 2 and pg. 3 col. 2 para. 2, the model outputs a global embedding tensor. As it shows global features it is understood as image level features) for both the medical image and the template image (pg. 4 col. 2 para. last and fig. 3, local and global embeddings, i.e. patch and image level features, are determined for both the template image and the medical image);
and utilizing, via the processor, the patch level features and the image level features within the region of interest of the template image (pg. 5 col. 1 para. 1, similarity maps, based on the global and local embeddings, are generated and used in localization. As the similarity maps are based on global and local embeddings, they are understood to utilize both patch level and image level features) to locate and label corresponding pixel level features in the medical image (pg. 5 col. 1 para. 1, the matched anatomical point is found. Fig. 3, the matched anatomical point is marked on the output image. This is understood as locating and labeling pixel level features);
and assigning, via the processor, the corresponding pixel level features in the medical image an anatomical label corresponding to a respective anatomical label of the region of interest in the template image (pg. 5 col. 1 para. 1, the matched anatomical point is found. Fig. 3, the matched anatomical point is marked on the output image. This is understood as labeling the medical image. See also Fig. 9, the anatomically similar point to the point in the template image is found in the query image such as right edge sternum, left spinal extensors, etc. A person of ordinary skill in the art would understand that the query image is marked and labeled with the same term as the template image. See also pg. 11 col. 1 para. last, "To show that SAM can be used to detect arbitrary anatomical locations, we randomly select a point in a template CT image, and then use SAM to find its matched point in a query image from another patient." Therefore, a point is labeled in the query, template, image and a “matched point” is found in another image. As shown by Fig. 3 above, detecting the matched point involves labeling respective anatomy as a region of interest).

    PNG
    media_image7.png
    17
    768
    media_image7.png
    Greyscale


    PNG
    media_image7.png
    17
    768
    media_image7.png
    Greyscale

Yan does not disclose expressly that the trained model is a vision transformer model.
Liu discloses:
vision transformer model (the examiner is interpreting a vision transformer model as a machine learning model which considers patches of an image. [0016] patches of an image are input into a machine learning model. Therefore, the model is understood as a vision transformer model);
Yan and Liu are combinable because they are from same field of endeavor of landmark detection (Yan, pg. 2 col. 1 para. 4; Liu, [0012]).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention to use the vision transformer model of Liu with the invention of Yan.
The motivation for doing so would have been "In a landmark detection task, a sliding window approach can be used, in which a large number of image patches are examined by sliding a window of a certain size over the whole image or volume. . . Embodiments of the present invention provide methods for lowering a dimensionality of the input vector for a given image patch and thereby achieving speedup of landmark detection tasks using deep neural networks" (Liu, [0014]).
Therefore, it would have been obvious to combine Liu with Yan to obtain the invention as specified in claim 1.
Regarding claim 2, Yan in view of Liu discloses the subject matter of claim 1.
Yan further discloses:
The computer-implemented method of claim 1, wherein the trained vision transformer model was trained on a plurality of unlabeled medical images utilizing self-supervised learning (pg. 3 col. 1 para. 1, the training method is unsupervised which is understood as self-supervised. Pg. 3 col. 2 para. 2, the training sample is unlabeled).
Regarding claim 3, Yan in view of Liu discloses the subject matter of claim 2.
Yan further discloses:
The computer-implemented method of claim 1, further comprising: obtaining, at the processor, an orthogonal set of medical images of the subject (pg. 5 col. 2 para. 1-2, the system may be used for 3D landmark detection and receives a dataset of 3D medical images. A 3D medical image is understood as an orthogonal set of medical images. As Fig. 4 shows both a model for 2D images and 3D images and as pg. 5 col. 2 para. 2 describes data for a 3D task and pg. 5 col. 2 para. 3 describes data for a 2D task, it is understood that the disclosed method may be operated for either 2D images or 3D images. Therefore, the following mapping is understood to apply to 3D images just as to 2D images) wherein the orthogonal set of medical images describe a three-dimensional volume of a region of interest of the subject (pg. 5 col. 1 para. 1, the 3D medical image set is of chest-abdomen-pelvis region, i.e. a 3D volume of a region of interest);
receiving, at the processor, a selection of both a corresponding template image (pg. 4 col. 2 para. last, a template image is received. Pg. 5 col. 2 para. 2, a template image is selected) and respective region of interest within the corresponding template image (pg. 4 col. 2 para. last, a point of interest, i.e. a reference point, is marked on the reference image which is understood to indicate a respective region of interest) to utilize with each respective medical image of the orthogonal set of images (pg. 5 col. 2 para. 1, the ChestCT dataset is described to include at least 94 3D images, at least one for each patient. Pg. 5 col. 2 para. 2, a single template image is selected. Therefore, it is understood that the template image and region of interest in the template image is used for each of the medical images), wherein each corresponding template image includes one or more anatomical landmarks assigned respective anatomical labels (pg. 5 col. 2 para. 2, template images include anatomical landmarks. The landmarks are chosen based on anatomy, such as lung or trachea bifurcation, which is understood as labels);
inputting, via the processor, both the orthogonal set of medical images and the corresponding template images (Fig. 3, both the template image and the query image, i.e. medical image, and input to determining the 4D embedding tensors) into the trained model (Fig. 2 and pg. 3 col. 2 para. 2, the 4D embedding tensors, as obtained above, are obtained by a machine learning model, e.g. 3D ResNet + FPN);
outputting, via the processor, from the trained vision transformer model both respective patch level features (Fig. 2 and pg. 3 col. 2 para. 2, the model outputs a local embedding tensor. Fig. 5, the local embedding tensor has a lower definition than the input image and is therefore understood as patch level features. See also pg. 3 col. 2 para. 2, "[the local embedding] is from the finest FPN level with a smaller stride and detailed features". As it has gone through at least one layer of FPN which includes convolution, a person of ordinary skill in the art would understand this to have a lower resolution than the input such that it may be understood as patch level features) and respective image level features (Fig. 2 and pg. 3 col. 2 para. 2, the model outputs a global embedding tensor. As it shows global features it is understood as image level features) for both the orthogonal set of medical images and the corresponding template images (pg. 4 col. 2 para. last and fig. 3, local and global embeddings, i.e. patch and image level features, are determined for both the template image and the medical image);
interpolating, via the processor, respective pixel level features from the respective patch level features (pg. 5 col. 1 para. 1, the global similarity map is upsampled to the size of the original images which is understood as pixel level features. The global similarity map is a comparison of the patch features for the medical images and the template images. Upsampling is commonly understood to incorporate interpolation. Therefore, patch features are interpolated to pixel level features) for both the orthogonal set of medical images and the corresponding template images (pg. 5 col. 1 para. 1, as the global similarity map is based off of the medical images, i.e. query image, and template images patch level features, the pixel level features are understood as being for both the medical images and the template images);
and utilizing, via the processor, the respective pixel level features within the respective region of interest of each corresponding template image (pg. 5 col. 1 para. 1, similarity maps, based on the global and local embeddings, are generated and upsampled to the size of the original image. As they are the size of the original image they are understood as pixel level features. They are based on the template and medical image) to locate and label corresponding pixel level features in each corresponding respective medical image of the orthogonal set of images (pg. 5 col. 1 para. 1, the matched anatomical point is found at the pixel level. Fig. 3, the matched anatomical point is marked on the output image. This is understood as locating and labeling pixel level features).
Yan does not disclose expressly that the model is a vision transformer model.
Liu discloses:
vision transformer model (the examiner is interpreting a vision transformer model as a machine learning model which considers patches of an image. [0016] patches of an image are input into a machine learning model. Therefore, the model is understood as a vision transformer model);
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention to use the vision transformer model of Liu with the invention of Yan.
The motivation for doing so would have been "In a landmark detection task, a sliding window approach can be used, in which a large number of image patches are examined by sliding a window of a certain size over the whole image or volume. . . Embodiments of the present invention provide methods for lowering a dimensionality of the input vector for a given image patch and thereby achieving speedup of landmark detection tasks using deep neural networks" (Liu, [0014]).
Therefore, it would have been obvious to combine Liu with Yan to obtain the invention as specified in claim 3.
Regarding claim 5, Yan in view of Liu discloses the subject matter of claim 1.
Yan further discloses:
The computer-implemented method of claim 1, further comprising marking, via the processor, the region of interest in the template image with a first reference point (pg. 4 col. 2 para. last, a point of interest, i.e. a reference point, is marked on the reference image. As the point or interest highlights an area on the reference image, it is understood to mark a region of interest).
Regarding claim 6, Yan in view of Liu discloses the subject matter of claim 5.
Yan further discloses:
The computer-implemented method of claim 5, further comprising marking, via the processor, a corresponding region of interest in the medical image with a second reference point (pg. 5 col. 1 para. 1, the matched anatomical point is found. Fig. 3, the matched anatomical point is marked on the output image) that corresponds to the region of interest in the template image with the first reference point (pg. 5 col. 1 para. 1, the marked point on the output image is with the highest similarity to the reference point on the template image which is understood as the region of interest marked by the first reference point).

Claims 7-12 and 14-19 are rejected under 35 U.S.C. 103 as being unpatentable over Yan et al. ("SAM: Self-supervised Learning of Pixel-wise Anatomical Embeddings in Radiological Images" full reference on PTO-892; hereafter, Yan) in view of Salehi et al. (US 20260017921 A1; hereafter, Salehi).
Regarding claim 7, Yan discloses:
A computer-implemented method for performing one-shot anatomy localization, comprising: obtaining, at a processor (pg. 5 col. 1 para. 1, the operation is performed on a GPU. Further, a person of ordinary skill in the art would understand that the described process of digital image processing would be performed on a processor),

    PNG
    media_image1.png
    82
    342
    media_image1.png
    Greyscale

a medical image of a subject (pg. 4 col. 2 para. last, the process may be applied to anatomical images in which the query image comprising a medical image is received);
receiving, at the processor, a selection of a template image (pg. 4 col. 2 para. last, a template image is received) 

    PNG
    media_image2.png
    122
    346
    media_image2.png
    Greyscale

wherein the template image includes one or anatomical landmarks (pg. 5 col. 2 para. 2, template images include anatomical landmarks) assigned a respective anatomical label (pg. 5 col. 2 para. 2, the landmarks are chosen based on anatomy, such as lung or trachea bifurcation, which is understood as labels),

    PNG
    media_image3.png
    112
    348
    media_image3.png
    Greyscale

and a first reference point is marked on the template image (pg. 4 col. 2 para. last, a point of interest, i.e. a reference point, is marked on the reference image);
inputting, via the processor, both the medical image and the template image (Fig. 3, both the template image and the query image, i.e. medical image, and input to determining the 4D embedding tensors) 

    PNG
    media_image4.png
    266
    436
    media_image4.png
    Greyscale

into a trained model (Fig. 2 and pg. 3 col. 2 para. 2, the 4D embedding tensors, as obtained above, are obtained by a machine learning model, e.g. 3D ResNet + FPN);

    PNG
    media_image5.png
    360
    876
    media_image5.png
    Greyscale

outputting, via the processor, from the trained vision transformer model both patch level features (Fig. 2 and pg. 3 col. 2 para. 2, the model outputs a local embedding tensor. Fig. 5, the local embedding tensor has a lower definition than the input image and is therefore understood as patch level features. See also pg. 3 col. 2 para. 2, "[the local embedding] is from the finest FPN level with a smaller stride and detailed features". As it has gone through at least one layer of FPN which includes convolution, a person of ordinary skill in the art would understand this to have a lower resolution than the input such that it may be understood as patch level features) and image level features (Fig. 2 and pg. 3 col. 2 para. 2, the model outputs a global embedding tensor. As it shows global features it is understood as image level features) for both the medical image and the template image (pg. 4 col. 2 para. last and fig. 3, local and global embeddings, i.e. patch and image level features, are determined for both the template image and the medical image);
wherein the pixel level features are derived from the patch level features and the image level features (pg. 5 col. 1 para. 1, similarity maps, based on the global and local embeddings, are generated and upsampled to the size of the original image. As they are the size of the original image they are understood as pixel level features);
and assigning, via the processor, cluster labels to pixels of both the medical image (pg. 5 col. 1 para. 1, the matched anatomical point is found. Fig. 3, the matched anatomical point is marked on the output image. This is understood as labeling the medical image) and the template image for corresponding anatomically similar regions (pg. 4 col. 2 para. last, a point of interest is marked on the reference image which is understood as labeling the pixels in the template image).
Yan does not disclose expressly that the model is a vision transformer model and to cluster pixel level features via paired clustering into anatomically similar regions.
Salehi discloses:
vision transformer model ([0090] a vision transformer model generates features from a source or template image and [0093] from a target image. See [0053] for how the vision transformer model operates on patches)
clustering, via the processor, pixel level features from both the medical image and the template image together via paired clustering ([0099] "In some examples, the propagated cluster map can be determined using a propagator. For example, patch mapping between the first set of features and the second set of features can be performed using a Temporal Patch Propagator (TPP)". Therefore, the first set of features from the source image, i.e. template image, are clustered with the second set of features from the target image, i.e. medical image, to generate a propagated cluster map. This clustering is a clustering of features from both images and is therefore understood as a paired clustering. See also [0082] for how features are clustered to form the propagated cluster map "The propagator 470 can be a TPP which utilizes the modified cluster map 457, the source F1 feature map 456, and the target F2 feature map 452 to determine and generate as output a propagated cluster map 472") into anatomically similar regions ([0099] "In some cases, the propagated cluster map can be indicative of a correspondence of patches between the source image and the target image." Corresponding patches are understood as similar regions. A person ordinary skill in the art would understand that whether the subject of the image be anatomical, as taught by Yan, or another subject that the invention of Salehi would still operate to find similar regions. Therefore, when considered in combination with Yan, the regions are understood as anatomical regions as shown by Yan)
Salehi is combinable with Yan because it is from the related field of endeavor of detecting an object between different image frames (Salehi, [0031]).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention to use the vision transformer model and clustering of Salehi with the invention of Yan.
The motivation for doing so would have been "For example, the systems and techniques can be used to perform unsupervised semantic segmentation based on using temporally-propagated cluster maps. In some examples, the temporally-propagated cluster maps can be utilized as a time-based supervision signal for the unsupervised semantic segmentation. The systems and techniques can also be used to perform other operations or tasks, such as object detection, depth estimation, or other operation or task" (Salehi, [0029]). In other words, the vision transformer model and clustering of Salehi allows for unsupervised segmentation of objects across frames which is beneficial for tracking regions of interest and objects.
Therefore, it would have been obvious to combine Salehi with Yan to obtain the invention as specified in claim 7.
Regarding claim 8, Yan in view of Salehi discloses the subject matter of claim 7.
Yan further discloses:
The computer-implemented method of claim 7, wherein the trained vision transformer model was trained on a plurality of unlabeled medical images utilizing self-supervised learning (pg. 3 col. 1 para. 1, the training method is unsupervised which is understood as self-supervised. Pg. 3 col. 2 para. 2, the training sample is unlabeled).
Regarding claim 9, Yan in view of Salehi discloses the subject matter of claim 7.
Yan further discloses:
The computer-implemented method of claim 7, further comprising: obtaining, at the processor, an orthogonal set of medical images of the subject (pg. 5 col. 2 para. 1-2, the system may be used for 3D landmark detection and receives a dataset of 3D medical images. A 3D medical image is understood as an orthogonal set of medical images. As Fig. 4 shows both a model for 2D images and 3D images and as pg. 5 col. 2 para. 2 describes data for a 3D task and pg. 5 col. 2 para. 3 describes data for a 2D task, it is understood that the disclose method may be operated for either 2D images or 3D images. Therefore, the following mapping is understood to apply to 3D images just as to 2D images), wherein the orthogonal set of medical images describe a three-dimensional volume of a region of interest of the subject (pg. 5 col. 1 para. 1, the 3D medical image set is of chest-abdomen-pelvis region, i.e. a 3D volume of a region of interest);
receiving, at the processor, a selection of a set of template images (pg. 4 col. 2 para. last, a template image is received), wherein each template image of the set of template images includes one or anatomical landmarks assigned respective anatomical labels (pg. 5 col. 2 para. 2, template images include anatomical landmarks. The landmarks are chosen based on anatomy, such as lung or trachea bifurcation, which is understood as labels), and a respective reference point is marked on each template image of the set of template images (pg. 4 col. 2 para. last, a point of interest, i.e. a reference point, is marked on the reference image), wherein each template image of the set of template images corresponds to a respective medical image of the set of medical images (pg. 5 col. 2 para. 2, a template image is chosen from the set of CT images);
inputting, via the processor, both the orthogonal set of medical images and the set of template images (Fig. 3, both the template image and the query image, i.e. medical image, and input to determining the 4D embedding tensors)  into the trained model (Fig. 2 and pg. 3 col. 2 para. 2, the 4D embedding tensors, as obtained above, are obtained by a machine learning model, e.g. 3D ResNet + FPN);
outputting, via the processor, from the trained vision transformer model both respective patch level features (Fig. 2 and pg. 3 col. 2 para. 2, the model outputs a local embedding tensor. Fig. 5, the local embedding tensor has a lower definition than the input image and is therefore understood as patch level features. See also pg. 3 col. 2 para. 2, "[the local embedding] is from the finest FPN level with a smaller stride and detailed features". As it has gone through at least one layer of FPN which includes convolution, a person of ordinary skill in the art would understand this to have a lower resolution than the input such that it may be understood as patch level features) and respective image level features (Fig. 2 and pg. 3 col. 2 para. 2, the model outputs a global embedding tensor. As it shows global features it is understood as image level features) for both the orthogonal set of medical images and the set of template images (pg. 4 col. 2 para. last and fig. 3, local and global embeddings, i.e. patch and image level features, are determined for both the template image and the medical image);
interpolating, via the processor, respective pixel level features from the respective patch level features (pg. 5 col. 1 para. 1, the global similarity map is upsampled to the size of the original images which is understood as pixel level features. The global similarity map is a comparison of the patch features for the medical images and the template images. Upsampling is commonly understood to incorporate interpolation. Therefore, path features are upsampled to pixel level features) for both the orthogonal set of medical images and the set of template images (pg. 5 col. 1 para. 1, as the global similarity map is based off of the medical images, i.e. query image, and template images patch level features, the pixel level features are understood as being for both the medical images and the template images);
and assigning, via the processor, cluster labels to the pixels of both the orthogonal set of medical images (pg. 5 col. 1 para. 1, the matched anatomical point is found. Fig. 3, the matched anatomical point is marked on the output image. This is understood as labeling the medical image) and the set of template images for corresponding anatomically similar regions (pg. 4 col. 2 para. last, a point of interest is marked on the reference image which is understood as labeling the pixels in the template image).
Yan does not disclose expressly that the model is a vision transformer model and clustering the pixel level features into anatomically similar regions.
Salehi discloses:
vision transformer model ([0090] a vision transformer model generates features from a source or template image and [0093] from a target image. See [0053] for how the vision transformer model operates on patches)
clustering, via the processor, the respective pixel level features for both the orthogonal set of medical images and the set of template images ([0099] "In some examples, the propagated cluster map can be determined using a propagator. For example, patch mapping between the first set of features and the second set of features can be performed using a Temporal Patch Propagator (TPP)". Therefore, the features from the source image, i.e. template image, are clustered with the features from the target image, i.e. medical image, to generate a propagated cluster map. This clustering is a clustering of features from both images and is therefore understood as a paired clustering. See also [0082] for how features are clustered to form the propagated cluster map "The propagator 470 can be a TPP which utilizes the modified cluster map 457, the source F.sub.1 feature map 456, and the target F.sub.2 feature map 452 to determine and generate as output a propagated cluster map 472". Salehi does not disclose expressly an orthogonal set of medical images. However, as an orthogonal set of images is often portrayed as a plurality of image slices, it would be obvious to a person of ordinary skill in the art to perform the clustering of Salehi on the orthogonal medical images taught by Yan) into anatomically similar regions ([0099] "In some cases, the propagated cluster map can be indicative of a correspondence of patches between the source image and the target image." Corresponding patches are understood as similar regions. A person ordinary skill in the art would understand that whether the subject of the image be anatomical, as taught by Yan, or another subject that the invention of Salehi would still operate to find similar regions. Therefore, when considered in combination with Yan, the regions are understood as anatomical regions as shown by Yan)
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention to use the vision transformer model and clustering of Salehi with the invention of Yan.
The motivation for doing so would have been "For example, the systems and techniques can be used to perform unsupervised semantic segmentation based on using temporally-propagated cluster maps. In some examples, the temporally-propagated cluster maps can be utilized as a time-based supervision signal for the unsupervised semantic segmentation. The systems and techniques can also be used to perform other operations or tasks, such as object detection, depth estimation, or other operation or task" (Salehi, [0029]). In other words, the vision transformer model and clustering of Salehi allows for unsupervised segmentation of objects across frames which is beneficial for tracking regions of interest and objects.
Therefore, it would have been obvious to combine Salehi with Yan to obtain the invention as specified in claim 9.
Regarding claim 10, Yan in view of Salehi discloses the subject matter of claim 7.
Yan further discloses:
The computer-implemented method of claim 7, further comprising assigning, via the processor, one or more of the corresponding anatomically similar regions in the medical image with the respective anatomical label associated with the corresponding anatomically similar regions in the template image (pg. 5 col. 1 para. 1, the matched anatomical point is found. Fig. 3, the matched anatomical point is marked on the output image. This is understood as labeling the medical image. See also Fig. 9, the anatomically similar point to the point in the template image is found in the query image such as right edge sternum, left spinal extensors, etc. A person of ordinary skill in the art would understand that the query image is marked and labeled with the same term as the template image).
Regarding claim 11, Yan in view of Salehi discloses the subject matter of claim 7.
Yan further discloses:
The computer-implemented method of claim 7, further comprising marking, via the processor, a region of interest in the template image with a first reference point (pg. 4 col. 2 para. last, a point of interest, i.e. a reference point, is marked on the reference image. As the point or interest highlights an area on the reference image, it is understood to mark a region of interest).
Regarding claim 12, Yan in view of Salehi discloses the subject matter of claim 11.
Yan further discloses:
The computer-implemented method of claim 11, further comprising marking, via the processor, a corresponding region in the medical image with a second reference point (pg. 5 col. 1 para. 1, the matched anatomical point is found. Fig. 3, the matched anatomical point is marked on the output image) that corresponds to the region of interest in the template image marked with the first reference point (pg. 5 col. 1 para. 1, the marked point on the output image is with the highest similarity to the reference point on the template image which is understood as the region of interest marked by the first reference point).
Regarding claim 14, claim 14 recites a system with elements corresponding to the steps recited in claim 7. Therefore, the recited elements of this claim are mapped to Yan in view of Salehi in the same manner as the corresponding steps in its corresponding method claim, claim 7.  Additionally, the rationale and motivation to combine Yan in view of Salehi presented in rejection of claim 7 apply to this claim.  Finally, Yan discloses:
a processor (pg. 5 col. 1 para. 1, the operation is performed on a GPU. Further, a person of ordinary skill in the art would understand that the described process of digital image processing would be performed on a processor)
Yan does not disclose expressly a memory encoding processor-executable routines and a processor configured to execute the routines.
Salehi discloses:
A system for performing one-shot anatomy localization, comprising: a memory encoding processor-executable routines ([0035] the system includes a memory clock which stores instructions);
and a processor configured to access the memory and to execute the processor-executable routines ([0035] instructions are loaded from the memory and executed by a processor),
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention to combine the memory and processor of Salehi with the invention of Yan.
The motivation for doing so would have been the combination of known elements, the method of Yan and the processor and memory of Salehi, in a known fashion, it is known in the art to use a processor and memory for image processing as shown by Salehi, to obtain a predictable result, computer enabled image processing. Further, while Yan does not disclose expressly a memory storing instructions and a processor to execute instructions, a person of ordinary skill in the art would understand that Yan would use a processor with a memory.
Therefore, it would have been obvious to combine Salehi with Yan to obtain the invention as specified in claim 14.
Regarding claim 15, claim 15 recites a system with elements corresponding to the steps recited in claim 8. Therefore, the recited elements of this claim are mapped to Yan in view of Salehi in the same manner as the corresponding steps in its corresponding method claim, claim 8.  Additionally, the rationale and motivation to combine Yan in view of Salehi presented in rejection of claim 8 apply to this claim.  
Regarding claim 16, claim 16 recites a system with elements corresponding to the steps recited in claim 9. Therefore, the recited elements of this claim are mapped to Yan in view of Liu in the same manner as the corresponding steps in its corresponding method claim, claim 9.  Additionally, the rationale and motivation to combine Yan in view of Liu presented in rejection of claim 9 apply to this claim.  
Regarding claim 17, claim 17 recites a system with elements corresponding to the steps recited in claim 10. Therefore, the recited elements of this claim are mapped to Yan in view of Salehi in the same manner as the corresponding steps in its corresponding method claim, claim 10.  Additionally, the rationale and motivation to combine Yan in view of Salehi presented in rejection of claim 10 apply to this claim.  
Regarding claim 18, claim 18 recites a system with elements corresponding to the steps recited in claim 11. Therefore, the recited elements of this claim are mapped to Yan in view of Salehi in the same manner as the corresponding steps in its corresponding method claim, claim 11.  Additionally, the rationale and motivation to combine Yan in view of Salehi presented in rejection of claim 11 apply to this claim.  
Regarding claim 19, claim 19 recites a system with elements corresponding to the steps recited in claim 12. Therefore, the recited elements of this claim are mapped to Yan in view of Salehi in the same manner as the corresponding steps in its corresponding method claim, claim 12.  Additionally, the rationale and motivation to combine Yan in view of Salehi presented in rejection of claim 12 apply to this claim.  

Claims 13 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Yan et al. ("SAM: Self-supervised Learning of Pixel-wise Anatomical Embeddings in Radiological Images" full reference on PTO-892; hereafter, Yan) in view of Salehi et al. (US 20260017921 A1; hereafter, Salehi) in further view of Novosad et al. (US 20220176157 A1; hereafter, Novosad).
Regarding claim 13, Yan in view of Salehi discloses the subject matter of claim 7.
Yan in view of Salehi does not disclose expressly to apply segmentation to both the medical image and the template image.
Novosad discloses:
The computer-implemented method of claim 7, wherein assigning cluster labels comprises applying segmentation masks to the both the medical image ([0062] the current image, i.e. the medical image, is segmented) and the template image ([0060] a reference image is segmented).
Novosad is combinable with Yan in view of Salehi because it is from the related field of endeavor of medical image segmentation (Novosad, [0002]).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention to combine the segmentation of Novosad with the invention of Yan in view of Salehi.
The motivation for doing so would have been that "The disclosed network architecture includes two or more encoders and one decoder. One encoder acts to encode information contained in the daily MRI, while the other encoder encodes information contained in the reference image and segmentation pair. The encoded information is combined, and then decoded by the decoder branch, yielding the predicted (estimated) segmentation for the current fraction. This approach provides high modelling capacity and conditions the output segmentation on an example reference fraction contour which avoids the disadvantage of possibly inaccurate deformable image registration" (Novosad, [0055], emphasis added).
Therefore, it would have been obvious to combine Novosad with Yan in view of Salehi to obtain the invention as specified in claim 13.
Regarding claim 20, claim 20 recites a system with elements corresponding to the steps recited in claim 13. Therefore, the recited elements of this claim are mapped to Yan in view of Salehi in further view of Novosad in the same manner as the corresponding steps in its corresponding method claim, claim 13.  Additionally, the rationale and motivation to combine Yan in view of Liu presented in rejection of claim 13 apply to this claim.  

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 20140029857 A1, Kompalli et al., discloses a system which registers to digital items, image or document, by clustering features in the digital items and performing a registration based on the clusters of features.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOSHUA B CROCKETT whose telephone number is (571)270-7989. The examiner can normally be reached Monday-Thursday 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John M Villecco can be reached at (571) 272-7319. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JOSHUA B. CROCKETT/Examiner, Art Unit 2661                                                                                                                                                                                                        
/JOHN VILLECCO/Supervisory Patent Examiner, Art Unit 2661                                                                                                                                                                                                        
/JOHN VILLECCO/Supervisory Patent Examiner, Art Unit 2661
Read full office action
Prosecution Timeline

Sep 27, 2023
Application Filed
Oct 30, 2025
Non-Final Rejection — §103
Jan 15, 2026
Response Filed
Mar 09, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/199,017
Patent 12592060
ARTIFICIAL INTELLIGENCE DEVICE AND 3D AGENCY GENERATING METHOD THEREOF
2y 5m to grant Granted Mar 31, 2026
17/925,201
Patent 12587704
VIDEO DATA TRANSMISSION AND RECEPTION METHOD USING HIGH-SPEED INTERFACE, AND APPARATUS THEREFOR
2y 5m to grant Granted Mar 24, 2026
17/811,329
Patent 12567150
EDITING PRESEGMENTED IMAGES AND VOLUMES USING DEEP LEARNING
2y 5m to grant Granted Mar 03, 2026
18/170,040
Patent 12561839
SYSTEMS AND METHODS FOR CALIBRATING IMAGE SENSORS OF A VEHICLE
2y 5m to grant Granted Feb 24, 2026
17/999,990
Patent 12529639
METHOD FOR ESTIMATING HYDROCARBON SATURATION OF A ROCK
2y 5m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
72%
Grant Probability
99%
With Interview (+27.5%)
3y 0m
Median Time to Grant
Moderate
PTA Risk
Based on 18 resolved cases by this examiner. Grant probability derived from career allow rate.