Last updated: April 19, 2026
Application No. 18/369,766
SYSTEMS AND METHODS FOR LANGUAGE-GUIDED IMAGE RETRIEVAL

Final Rejection §102§103
Filed
Sep 18, 2023
Examiner
FITZPATRICK, ATIBA O
Art Unit
2677
Tech Center
2600 — Communications
Assignee
Shanghai United Imaging Intelligence Co. Ltd.
OA Round
2 (Final)
Interview Optional

— +4.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 881 resolved cases, 2023–2026
Examiner Intelligence

FITZPATRICK, ATIBA O View full profile →
Grants 88% — above average
Career Allow Rate
775 granted / 881 resolved
+26.0% vs TC avg
Minimal +5% lift
Without
With
+4.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
27 currently pending
Career history
908
Total Applications
across all art units
Statute-Specific Performance

§101
12.3%
-27.7% vs TC avg
§103
34.9%
-5.1% vs TC avg
§102
22.8%
-17.2% vs TC avg
§112
20.1%
-19.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 881 resolved cases
Office Action

§102 §103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendments
Amendments overcome the 35 USC 112 and 35 USC 101 rejections. Amendments move limitations from dependent claims 2 and 12 into amended independent claims 1 and 11, respectively. However, these amendments do not overcome the 35 USC 102 rejections (See the Response to Arguments section below for specificity). 

Response to Arguments
Applicant's arguments pertaining to the prior art rejections have been fully considered but they are not persuasive. Beginning at the bottom of page 7 of Applicant’s remarks, Applicant argues that:

    PNG
    media_image1.png
    698
    1071
    media_image1.png
    Greyscale

Examiner respectfully disagrees. Let us carefully consider the claim limitations in question here. Claim 1 recites, inter alia, “generate, using an artificial neural network (ANN), a feature representation that
represents the image of the person and the textual description as an associated pair; identify one or more images from an image repository based on at least the feature representation generated using the ANN”. The broadest reasonable interpretation of these limitations is that the ANN associates image and text information, and this association is used in identifying images from a repository. Interestingly, Applicant states that, “The aforementioned claim features allow image search and retrieval to be performed not based on an image or a textual description alone, but based on the two as a pair” (emphasis added). The asserted interpretation may be allowed by the claim language, but it is not required. That is, this claim language does not require that image search and retrieval to be performed based on an image and textual description as a pair. The broad claim language also allows for the interpretation that image search and retrieval can be performed based on an image or a textual description alone – based on the image-text association established by the ANN – as taught by Tizhoosh. No other interpretation is required by the claims.

Beginning in the third paragraph on page 8 of Applicant’s remarks, Applicant argues that:

    PNG
    media_image2.png
    892
    1070
    media_image2.png
    Greyscale

Examiner respectfully disagrees that Tizhoosh does not teach the claim limitations. Applicant’s characterization of Tizhoosh, above, falls within the scope of the broadest reasonable interpretation of the claims. That is, this claim language does not require that image search and retrieval to be performed based on an image and textual description as a pair. The broad claim language allows for the interpretation that image search and retrieval can be performed based on an image or a textual description alone – based on the image-text association established by the ANN – as taught by Tizhoosh. No other interpretation is required by the claims.

Beginning in the last paragraph on page 8 of Applicant’s remarks, Applicant argues that:

    PNG
    media_image3.png
    744
    1065
    media_image3.png
    Greyscale

Examiner respectfully disagrees that Tizhoosh does not teach the claim limitations. The claims do not recite, “query image” and “query description”. Instead, the claim recites, “generate, using an artificial neural network (ANN), a feature representation that represents the image of the person and the textual description as an associated pair; identify one or more images from an image repository based on at least the feature representation generated using the ANN”. This feature representation can be interpreted to be an association generated by training the ANN to determine the correspondence between the two data types in order to allow for cross-domain searching. No alternative interpretation is required by the claim limitations. 

Regarding the 35 USC 103 rejections of claims 4 and 14, Applicant argues that these depending claims are allowable since the independent claims are alleged to be allowable, but this assertion is obviated with the office’s foregoing arguments.


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1, 3, 5-11, 13, and 15-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by US 2024/0194328 A1 (Tizhoosh).

As per claim 1, Tizhoosh teaches an apparatus, comprising:
one or more processors configured to (Tizhoosh: paras 136-149;

    PNG
    media_image4.png
    452
    711
    media_image4.png
    Greyscale
):
obtain an image of a person (Tizhoosh:
para 2: “and accessing, with the computer system, second input data of a second modality (e.g., image data corresponding to image information)”;

    PNG
    media_image5.png
    461
    1351
    media_image5.png
    Greyscale


    PNG
    media_image6.png
    412
    990
    media_image6.png
    Greyscale
);
obtain a textual description associated with the image (Tizhoosh;
para 2: “The method includes accessing, with a computer system, first input data of a first modality (e.g., text data corresponding to textual information)”;
Fig. 11 (shown above): “examples of paired first input data (text) and second input data (images)”
Para 64 (shown above): “the first input data includes morphological descriptions and diagnoses for a wide variety of tissue types and staining. The dataset images (i.e., the second input data) include histopathology images (e.g., whole slide images) corresponding to the morphological descriptions.”);
generate, using an artificial neural network (ANN), a feature representation that represents the image of the person and the textual description as an associated pair (Tizhoosh: abstract: “The first and second input data are then input to a dual attention network that has been trained on training data to extract feature data from cross-modal input data. Feature data are generated as outputs by inputting the first and second input data to the dual attention network. The feature data include feature representations of the first modality and the second modality.”;
para 2: “A dual attention network trained on training data to extract feature data from cross-modal input data is also accessed with the computer system. The first input data and the second input data are input to the dual attention network by the computer system, generating outputs as feature data comprising feature representations of the first modality and the second modality”;
Para 48: “refining the feature representation regarding the shared information between two modalities”;

    PNG
    media_image7.png
    734
    996
    media_image7.png
    Greyscale

Para 26: “apply images and text at the same time… the disclosed systems and methods are capable of fusing image-text pairs (or other cross-modality data pairs or groups)”;
para 28: “The cross-attention modules 110 are employed to extract important, or otherwise relevant, segments from the first and second input data (e.g., images and text), considering their relevance to the other modality (e.g., text and images, respectively). Additionally, an iterative matching scheme using a gated memory block 112 is applied to refine the extracted features for each modality.”;

    PNG
    media_image8.png
    454
    716
    media_image8.png
    Greyscale


    PNG
    media_image9.png
    1095
    996
    media_image9.png
    Greyscale

    PNG
    media_image10.png
    455
    1009
    media_image10.png
    Greyscale


    PNG
    media_image11.png
    1149
    709
    media_image11.png
    Greyscale

    PNG
    media_image12.png
    859
    712
    media_image12.png
    Greyscale


    PNG
    media_image13.png
    195
    710
    media_image13.png
    Greyscale


    PNG
    media_image14.png
    546
    995
    media_image14.png
    Greyscale
 
    PNG
    media_image15.png
    360
    994
    media_image15.png
    Greyscale
);
identify one or more images from an image repository based on at least the feature representation generated using the ANN (Tizhoosh: para 24: “the dual attention network can be configured for cross-modal information retrieval in histopathology archives (e.g., the first input data may be histopathology images and the second input data may be text).”:

    PNG
    media_image16.png
    501
    995
    media_image16.png
    Greyscale

    PNG
    media_image17.png
    454
    992
    media_image17.png
    Greyscale


    PNG
    media_image18.png
    624
    708
    media_image18.png
    Greyscale

Para 83: “dual attention network approach was demonstrated to function as a bi-directional retriever for both GRH and LC25000 datasets”;
Fig. 10 (shown above): 1008;

    PNG
    media_image19.png
    541
    995
    media_image19.png
    Greyscale

    PNG
    media_image20.png
    523
    690
    media_image20.png
    Greyscale
); and
provide an indication regarding the one or more images identified from the image repository (Tizhoosh: para 25: “the disclosed systems and methods are capable of generating diagnostic reports and retrieving images based on symptom description”;
para 73: “the feature data may be processed to generate short diagnostic reports based on the input text and images. As another example, the feature data may be processed to retrieve images based on a symptom description”;
para 102: “The top three results from the text-to-image (t2i) retrieval process are displayed in FIG. 16.”:

    PNG
    media_image21.png
    1013
    1469
    media_image21.png
    Greyscale

Para 110: “the central goal of the model is to effectively locate and pair corresponding images and texts”;

    PNG
    media_image22.png
    682
    995
    media_image22.png
    Greyscale
),
wherein the ANN includes at least a first neural network, a second neural network, and a cross-attention module (Tizhoosh:

    PNG
    media_image8.png
    454
    716
    media_image8.png
    Greyscale

    PNG
    media_image10.png
    455
    1009
    media_image10.png
    Greyscale
para 27: “a dual attention network 100 includes an image subnetwork 120 and a text subnetwork 140”;

    PNG
    media_image9.png
    1095
    996
    media_image9.png
    Greyscale
), 
wherein the first neural network is configured to extract features from the image of the person (Tizhoosh:

    PNG
    media_image23.png
    726
    993
    media_image23.png
    Greyscale

    PNG
    media_image24.png
    743
    926
    media_image24.png
    Greyscale
), 
wherein the second neural network is configured to extract features from the textual description associated with the image of the person (Tizhoosh:

    PNG
    media_image25.png
    420
    708
    media_image25.png
    Greyscale

    PNG
    media_image26.png
    808
    786
    media_image26.png
    Greyscale
), and 
wherein the cross-attention module is configured to establish a relationship between the features extracted from the image and the features extracted from the textual description (Tizhoosh: Fig. 1 (shown above): 110;
para 28: “The cross-attention modules 110 are employed to extract important, or otherwise relevant, segments from the first and second input data (e.g., images and text), considering their relevance to the other modality (e.g., text and images, respectively). Additionally, an iterative matching scheme using a gated memory block 112 is applied to refine the extracted features for each modality.”;

    PNG
    media_image27.png
    841
    995
    media_image27.png
    Greyscale

    PNG
    media_image28.png
    692
    360
    media_image28.png
    Greyscale

Para 48: “refining the feature representation regarding the shared information between two modalities”).

As per claim 3, Tizhoosh teaches the apparatus of claim 2, wherein at least one of the first neural network or the second neural network includes a transformer neural network (Tizhoosh: See arguments and citations offered in rejecting claim 2 above;
Para 27: “the feature representation data (e.g., image feature representation data 102, text feature representation data 104) can be generated using a suitable transformer model”;
Para 29: “the image feature representation data 102 and text feature representation data 104 input to the dual attention network 100 may be the outputs of individual transformer models. For example, a transformer architecture can be employed for both image and text encoding backbones, with the self-attention modules of the dual attention network being leveraged to highlight key aspects, or features, of images and text.”;
Para 30: “The modular architecture of a transformer model enables the processing of different modalities (e.g., images, videos, text, and voice) leveraging similar processing blocks. Transformer models can scale efficiently to large capacity networks for complex tasks and can perform well with massive datasets, such as WSIs. As noted above, transformer architectures can be used to extract feature representations from the first and second input data (e.g., text and images) in the disclosed systems and methods due to these advantages.”).

As per claim 5, Tizhoosh teaches the apparatus of claim 1, wherein the second neural network is configured to implement a machine-learning (ML) language model that is pre-trained to extract the features from the textual description and generate an embedding that represents the extracted features (Tizhoosh: See arguments and citations offered in rejecting claim 2 above;
Para 10: “FIG. 6 is an example robustly optimized BERT (Bidirectional Encoder Representations from Transformers) pre-training approach (RoBERTa) network.”;

    PNG
    media_image29.png
    853
    709
    media_image29.png
    Greyscale
).

As per claim 6, Tizhoosh teaches the apparatus of claim 1, wherein the feature representation is generated by conditioning the features extracted from the image of the person on the features extracted from the textual description, or by combining the features extracted from the image of the person with the features extracted from the textual description (Tizhoosh: See arguments and citations offered in rejecting claim 1 above: paras 56, 62, 66, 72 (all shown above);
para 28: “The cross-attention modules 110 are employed to extract important, or otherwise relevant, segments from the first and second input data (e.g., images and text), considering their relevance to the other modality (e.g., text and images, respectively). Additionally, an iterative matching scheme using a gated memory block 112 is applied to refine the extracted features for each modality.”).

As per claim 7, Tizhoosh teaches the apparatus of claim 1, wherein the one or more images from the image repository are tagged with respective textual descriptions, and wherein the one or more processors are configured to identify the one or more images further based on the respective textual descriptions used to tag the one or more images (Tizhoosh: See arguments and citations offered in rejecting claim 1 above;
Para 20: “FIG. 16 illustrates image retrieval using real-world medical descriptions extracted from the WHO dataset, which wasn't included in the original training set. Under each description, the associated primary diagnosis is displayed. To the right, the top three images retrieved by the model are presented alongside their corresponding primary diagnoses. This illustration showcases the model's ability to retrieve and match images based on textual medical descriptions accurately.”;
para 31: “The choice of feature extraction technique may be chosen depending on the dataset and the availability of annotations for the object detection task.”;
para 84: “In relation to the PatchGastricADC22 dataset, which is characterized by paired image-description entries at the WSI level, the provided descriptions were employed as text inputs.”;
para 109: “In the processing of the PatchGastricADC22 dataset, patches of 300 by 300 pixels in size were utilized. The description of each extracted patch was linked to the corresponding WSI it was obtained from.”;
para 116: “The dataset under consideration contains authentic captions penned by pathologists”).

As per claim 8, Tizhoosh teaches the apparatus of claim 7, wherein the textual description associated with the image of the person differs, on a verbatim basis, from at least one of the textual descriptions used to tag the one or more images (Tizhoosh: See arguments and citations offered in rejecting claim 7 above;
Para 20: “FIG. 16 illustrates image retrieval using real-world medical descriptions extracted from the WHO dataset, which wasn't included in the original training set. Under each description, the associated primary diagnosis is displayed. To the right, the top three images retrieved by the model are presented alongside their corresponding primary diagnoses. This illustration showcases the model's ability to retrieve and match images based on textual medical descriptions accurately.”
Para 101: “The dataset used for this experiment was sourced from the WHO (World Health Organization), which was not part of the original training set but provided real medical reports all related to the same primary diagnosis.”).

As per claim 9, Tizhoosh teaches the apparatus of claim 7, wherein the image of the person includes a medical scan image that depicts an anatomical structure of the person, wherein the textual description associated with the image of the person indicates an abnormality of the anatomical structure, and wherein at least one of the one or more images identified from the image repository depicts the anatomical structure of a different person with a substantially similar abnormality (Tizhoosh: See arguments and citations offered in rejecting claim 7 above;
para 63: “The image modality may be associated with histopathology images (e.g., whole slide images, patches extracted from whole slide images), medical images (e.g., MR images, CT images, ultrasound images, PET images, etc.), or the like.”).

As per claim 10, Tizhoosh teaches the apparatus of claim 1, wherein the one or more processors being configured to provide the indication regarding the one or more images identified from the image repository comprises the one or more processors being configured to provide a ranking of the one or more images based on respective relevance of the one or more images to the image of the person (Tizhoosh: See arguments and citations offered in rejecting claim 1 above;
para 102: “The top three results from the text-to-image (t2i) retrieval process are displayed in FIG. 16.”;

    PNG
    media_image21.png
    1013
    1469
    media_image21.png
    Greyscale

Para 105: “top K retrieved items”).

As per claim(s) 11, 13 and 15-19, arguments made in rejecting claim(s) 1, 3, 5-7, 9, and 8 are analogous, respectively. Considering claim 11, Tizhoosh also teaches identifying one or more images from an image repository based on at least the feature representation that represents the image of the person and the textual description as the associated pair (Tizhoosh: See arguments and citations offered in rejecting claim 1 above).

As per claim 20, Tizhoosh teaches a non-transitory computer-readable medium comprising instructions that, when executed by a processor included in a computing device, cause the processor to implement the method of claim 11 (Tizhoosh: paras 136-149; Fig. 19).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 4 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Tizhoosh as applied to claim 3 and 13 above, and further in view of US 20240185588 A1 (Kumari)  .

As per claim 4, Tizhoosh teaches the apparatus of claim 3, wherein the features extracted from the textual description are provided to the cross-attention module (Tizhoosh: See arguments and citations offered in rejecting claim 3 above;
Fig. 8 and associated text (both shown above)).

Tizhoosh is silent regarding “in one or more key matrices and one or more value matrices … in one or more query matrices”. 

Kumari teaches the features extracted from the textual description are provided to the cross-attention module in one or more key matrices and one or more value matrices, and wherein the features extracted from the image of the person are provided to the cross-attention module in one or more query matrices (Kumari: 

    PNG
    media_image30.png
    949
    1129
    media_image30.png
    Greyscale
 
    PNG
    media_image31.png
    415
    706
    media_image31.png
    Greyscale


    PNG
    media_image32.png
    578
    489
    media_image32.png
    Greyscale
 
    PNG
    media_image33.png
    385
    787
    media_image33.png
    Greyscale

Para 111 (shown below): “Q mappings 1030 transform pixel features 1005 into a query vector. K mappings 1020 may transform text features 1010 into a key vector, and V mappings 1024 may transform the text features 1025 into a value vector. Projection mappings 1015 may include weights that are trained, enabling cross-attention between pixel features 1005 and text features 1010. In some embodiments, only K mappings 1020 and V mappings 1025 are trained during fine-tuning”:

    PNG
    media_image34.png
    786
    682
    media_image34.png
    Greyscale


    PNG
    media_image35.png
    559
    1787
    media_image35.png
    Greyscale
).

Thus, it would have been obvious for one of ordinary skill in the art, prior to filing, to implement the teachings of Kumari into Tizhoosh since both Tizhoosh and Kumari suggest a practical solution of using cross-attention to map input text description condition to image features wherein the text condition is determined using a text transformer in general and Kumari additionally provides teachings that can be incorporated into Tizhoosh in that the features extracted from the textual description are provided to the cross-attention module in one or more key matrices and one or more value matrices, and wherein the features extracted from the image of the person are provided to the cross-attention module in one or more query matrices so that “The cross-attention layers modify the latent features of the network according to the input text condition” (Kumari: para 86). The teachings of Kumari can be incorporated into Tizhoosh in that the features extracted from the textual description are provided to the cross-attention module in one or more key matrices and one or more value matrices, and wherein the features extracted from the image of the person are provided to the cross-attention module in one or more query matrices. Furthermore, one of ordinary skill in the art could have combined the elements as claimed by known methods and, in combination, each component functions the same as it does separately. One of ordinary skill in the art would have recognized that the results of the combination would be predictable.

As per claim(s) 14, arguments made in rejecting claim(s) 4 are analogous, respectively. Considering claim 11, Tizhoosh also teaches identifying one or more images from an image repository based on at least the feature representation that represents the image of the person and the textual description as the associated pair (Tizhoosh: See arguments and citations offered in rejecting claim 1 above).

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).

A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

US 20230054096 A1 (ICHINOSE) also teaches all limitations of claims 1, 6, 11, 16, and 20.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Atiba Fitzpatrick whose telephone number is (571) 270-5255.  The examiner can normally be reached on M-F 10:00am-6pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Bee can be reached on (571) 270-5183.  The fax phone number for Atiba Fitzpatrick is (571) 270-6255.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Atiba Fitzpatrick
/ATIBA O FITZPATRICK/
Primary Examiner, Art Unit 2677
Read full office action
Prosecution Timeline

Sep 18, 2023
Application Filed
Aug 20, 2025
Non-Final Rejection — §102, §103
Nov 19, 2025
Response Filed
Dec 21, 2025
Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/504,311
Patent 12602854
SYSTEM AND METHOD FOR MEDICAL IMAGING
2y 5m to grant Granted Apr 14, 2026
18/142,049
Patent 12586195
OPHTHALMIC INFORMATION PROCESSING APPARATUS, OPHTHALMIC APPARATUS, OPHTHALMIC INFORMATION PROCESSING METHOD, AND RECORDING MEDIUM
2y 5m to grant Granted Mar 24, 2026
18/191,191
Patent 12579649
RADIATION IMAGE PROCESSING APPARATUS AND OPERATION METHOD THEREOF
2y 5m to grant Granted Mar 17, 2026
18/876,143
Patent 12555237
CLOSEUP IMAGE LINKING
2y 5m to grant Granted Feb 17, 2026
18/191,812
Patent 12548221
SYSTEMS AND METHODS FOR AUTOMATIC QUALITY CONTROL OF IMAGE RECONSTRUCTION
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
88%
Grant Probability
93%
With Interview (+4.9%)
2y 8m
Median Time to Grant
Moderate
PTA Risk
Based on 881 resolved cases by this examiner. Grant probability derived from career allow rate.