DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendments
Amendments overcome the 35 USC 112 and 35 USC 101 rejections. Amendments move limitations from dependent claims 2 and 12 into amended independent claims 1 and 11, respectively. However, these amendments do not overcome the 35 USC 102 rejections (See the Response to Arguments section below for specificity).
Response to Arguments
Applicant's arguments pertaining to the prior art rejections have been fully considered but they are not persuasive. Beginning at the bottom of page 7 of Applicant’s remarks, Applicant argues that:
PNG
media_image1.png
698
1071
media_image1.png
Greyscale
Examiner respectfully disagrees. Let us carefully consider the claim limitations in question here. Claim 1 recites, inter alia, “generate, using an artificial neural network (ANN), a feature representation that
represents the image of the person and the textual description as an associated pair; identify one or more images from an image repository based on at least the feature representation generated using the ANN”. The broadest reasonable interpretation of these limitations is that the ANN associates image and text information, and this association is used in identifying images from a repository. Interestingly, Applicant states that, “The aforementioned claim features allow image search and retrieval to be performed not based on an image or a textual description alone, but based on the two as a pair” (emphasis added). The asserted interpretation may be allowed by the claim language, but it is not required. That is, this claim language does not require that image search and retrieval to be performed based on an image and textual description as a pair. The broad claim language also allows for the interpretation that image search and retrieval can be performed based on an image or a textual description alone – based on the image-text association established by the ANN – as taught by Tizhoosh. No other interpretation is required by the claims.
Beginning in the third paragraph on page 8 of Applicant’s remarks, Applicant argues that:
PNG
media_image2.png
892
1070
media_image2.png
Greyscale
Examiner respectfully disagrees that Tizhoosh does not teach the claim limitations. Applicant’s characterization of Tizhoosh, above, falls within the scope of the broadest reasonable interpretation of the claims. That is, this claim language does not require that image search and retrieval to be performed based on an image and textual description as a pair. The broad claim language allows for the interpretation that image search and retrieval can be performed based on an image or a textual description alone – based on the image-text association established by the ANN – as taught by Tizhoosh. No other interpretation is required by the claims.
Beginning in the last paragraph on page 8 of Applicant’s remarks, Applicant argues that:
PNG
media_image3.png
744
1065
media_image3.png
Greyscale
Examiner respectfully disagrees that Tizhoosh does not teach the claim limitations. The claims do not recite, “query image” and “query description”. Instead, the claim recites, “generate, using an artificial neural network (ANN), a feature representation that represents the image of the person and the textual description as an associated pair; identify one or more images from an image repository based on at least the feature representation generated using the ANN”. This feature representation can be interpreted to be an association generated by training the ANN to determine the correspondence between the two data types in order to allow for cross-domain searching. No alternative interpretation is required by the claim limitations.
Regarding the 35 USC 103 rejections of claims 4 and 14, Applicant argues that these depending claims are allowable since the independent claims are alleged to be allowable, but this assertion is obviated with the office’s foregoing arguments.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1, 3, 5-11, 13, and 15-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by US 2024/0194328 A1 (Tizhoosh).
As per claim 1, Tizhoosh teaches an apparatus, comprising:
one or more processors configured to (Tizhoosh: paras 136-149;
PNG
media_image4.png
452
711
media_image4.png
Greyscale
):
obtain an image of a person (Tizhoosh:
para 2: “and accessing, with the computer system, second input data of a second modality (e.g., image data corresponding to image information)”;
PNG
media_image5.png
461
1351
media_image5.png
Greyscale
PNG
media_image6.png
412
990
media_image6.png
Greyscale
);
obtain a textual description associated with the image (Tizhoosh;
para 2: “The method includes accessing, with a computer system, first input data of a first modality (e.g., text data corresponding to textual information)”;
Fig. 11 (shown above): “examples of paired first input data (text) and second input data (images)”
Para 64 (shown above): “the first input data includes morphological descriptions and diagnoses for a wide variety of tissue types and staining. The dataset images (i.e., the second input data) include histopathology images (e.g., whole slide images) corresponding to the morphological descriptions.”);
generate, using an artificial neural network (ANN), a feature representation that represents the image of the person and the textual description as an associated pair (Tizhoosh: abstract: “The first and second input data are then input to a dual attention network that has been trained on training data to extract feature data from cross-modal input data. Feature data are generated as outputs by inputting the first and second input data to the dual attention network. The feature data include feature representations of the first modality and the second modality.”;
para 2: “A dual attention network trained on training data to extract feature data from cross-modal input data is also accessed with the computer system. The first input data and the second input data are input to the dual attention network by the computer system, generating outputs as feature data comprising feature representations of the first modality and the second modality”;
Para 48: “refining the feature representation regarding the shared information between two modalities”;
PNG
media_image7.png
734
996
media_image7.png
Greyscale
Para 26: “apply images and text at the same time… the disclosed systems and methods are capable of fusing image-text pairs (or other cross-modality data pairs or groups)”;
para 28: “The cross-attention modules 110 are employed to extract important, or otherwise relevant, segments from the first and second input data (e.g., images and text), considering their relevance to the other modality (e.g., text and images, respectively). Additionally, an iterative matching scheme using a gated memory block 112 is applied to refine the extracted features for each modality.”;
PNG
media_image8.png
454
716
media_image8.png
Greyscale
PNG
media_image9.png
1095
996
media_image9.png
Greyscale
PNG
media_image10.png
455
1009
media_image10.png
Greyscale
PNG
media_image11.png
1149
709
media_image11.png
Greyscale
PNG
media_image12.png
859
712
media_image12.png
Greyscale
PNG
media_image13.png
195
710
media_image13.png
Greyscale
PNG
media_image14.png
546
995
media_image14.png
Greyscale
PNG
media_image15.png
360
994
media_image15.png
Greyscale
);
identify one or more images from an image repository based on at least the feature representation generated using the ANN (Tizhoosh: para 24: “the dual attention network can be configured for cross-modal information retrieval in histopathology archives (e.g., the first input data may be histopathology images and the second input data may be text).”:
PNG
media_image16.png
501
995
media_image16.png
Greyscale
PNG
media_image17.png
454
992
media_image17.png
Greyscale
PNG
media_image18.png
624
708
media_image18.png
Greyscale
Para 83: “dual attention network approach was demonstrated to function as a bi-directional retriever for both GRH and LC25000 datasets”;
Fig. 10 (shown above): 1008;
PNG
media_image19.png
541
995
media_image19.png
Greyscale
PNG
media_image20.png
523
690
media_image20.png
Greyscale
); and
provide an indication regarding the one or more images identified from the image repository (Tizhoosh: para 25: “the disclosed systems and methods are capable of generating diagnostic reports and retrieving images based on symptom description”;
para 73: “the feature data may be processed to generate short diagnostic reports based on the input text and images. As another example, the feature data may be processed to retrieve images based on a symptom description”;
para 102: “The top three results from the text-to-image (t2i) retrieval process are displayed in FIG. 16.”:
PNG
media_image21.png
1013
1469
media_image21.png
Greyscale
Para 110: “the central goal of the model is to effectively locate and pair corresponding images and texts”;
PNG
media_image22.png
682
995
media_image22.png
Greyscale
),
wherein the ANN includes at least a first neural network, a second neural network, and a cross-attention module (Tizhoosh:
PNG
media_image8.png
454
716
media_image8.png
Greyscale
PNG
media_image10.png
455
1009
media_image10.png
Greyscale
para 27: “a dual attention network 100 includes an image subnetwork 120 and a text subnetwork 140”;
PNG
media_image9.png
1095
996
media_image9.png
Greyscale
),
wherein the first neural network is configured to extract features from the image of the person (Tizhoosh:
PNG
media_image23.png
726
993
media_image23.png
Greyscale
PNG
media_image24.png
743
926
media_image24.png
Greyscale
),
wherein the second neural network is configured to extract features from the textual description associated with the image of the person (Tizhoosh:
PNG
media_image25.png
420
708
media_image25.png
Greyscale
PNG
media_image26.png
808
786
media_image26.png
Greyscale
), and
wherein the cross-attention module is configured to establish a relationship between the features extracted from the image and the features extracted from the textual description (Tizhoosh: Fig. 1 (shown above): 110;
para 28: “The cross-attention modules 110 are employed to extract important, or otherwise relevant, segments from the first and second input data (e.g., images and text), considering their relevance to the other modality (e.g., text and images, respectively). Additionally, an iterative matching scheme using a gated memory block 112 is applied to refine the extracted features for each modality.”;
PNG
media_image27.png
841
995
media_image27.png
Greyscale
PNG
media_image28.png
692
360
media_image28.png
Greyscale
Para 48: “refining the feature representation regarding the shared information between two modalities”).
As per claim 3, Tizhoosh teaches the apparatus of claim 2, wherein at least one of the first neural network or the second neural network includes a transformer neural network (Tizhoosh: See arguments and citations offered in rejecting claim 2 above;
Para 27: “the feature representation data (e.g., image feature representation data 102, text feature representation data 104) can be generated using a suitable transformer model”;
Para 29: “the image feature representation data 102 and text feature representation data 104 input to the dual attention network 100 may be the outputs of individual transformer models. For example, a transformer architecture can be employed for both image and text encoding backbones, with the self-attention modules of the dual attention network being leveraged to highlight key aspects, or features, of images and text.”;
Para 30: “The modular architecture of a transformer model enables the processing of different modalities (e.g., images, videos, text, and voice) leveraging similar processing blocks. Transformer models can scale efficiently to large capacity networks for complex tasks and can perform well with massive datasets, such as WSIs. As noted above, transformer architectures can be used to extract feature representations from the first and second input data (e.g., text and images) in the disclosed systems and methods due to these advantages.”).
As per claim 5, Tizhoosh teaches the apparatus of claim 1, wherein the second neural network is configured to implement a machine-learning (ML) language model that is pre-trained to extract the features from the textual description and generate an embedding that represents the extracted features (Tizhoosh: See arguments and citations offered in rejecting claim 2 above;
Para 10: “FIG. 6 is an example robustly optimized BERT (Bidirectional Encoder Representations from Transformers) pre-training approach (RoBERTa) network.”;
PNG
media_image29.png
853
709
media_image29.png
Greyscale
).
As per claim 6, Tizhoosh teaches the apparatus of claim 1, wherein the feature representation is generated by conditioning the features extracted from the image of the person on the features extracted from the textual description, or by combining the features extracted from the image of the person with the features extracted from the textual description (Tizhoosh: See arguments and citations offered in rejecting claim 1 above: paras 56, 62, 66, 72 (all shown above);
para 28: “The cross-attention modules 110 are employed to extract important, or otherwise relevant, segments from the first and second input data (e.g., images and text), considering their relevance to the other modality (e.g., text and images, respectively). Additionally, an iterative matching scheme using a gated memory block 112 is applied to refine the extracted features for each modality.”).
As per claim 7, Tizhoosh teaches the apparatus of claim 1, wherein the one or more images from the image repository are tagged with respective textual descriptions, and wherein the one or more processors are configured to identify the one or more images further based on the respective textual descriptions used to tag the one or more images (Tizhoosh: See arguments and citations offered in rejecting claim 1 above;
Para 20: “FIG. 16 illustrates image retrieval using real-world medical descriptions extracted from the WHO dataset, which wasn't included in the original training set. Under each description, the associated primary diagnosis is displayed. To the right, the top three images retrieved by the model are presented alongside their corresponding primary diagnoses. This illustration showcases the model's ability to retrieve and match images based on textual medical descriptions accurately.”;
para 31: “The choice of feature extraction technique may be chosen depending on the dataset and the availability of annotations for the object detection task.”;
para 84: “In relation to the PatchGastricADC22 dataset, which is characterized by paired image-description entries at the WSI level, the provided descriptions were employed as text inputs.”;
para 109: “In the processing of the PatchGastricADC22 dataset, patches of 300 by 300 pixels in size were utilized. The description of each extracted patch was linked to the corresponding WSI it was obtained from.”;
para 116: “The dataset under consideration contains authentic captions penned by pathologists”).
As per claim 8, Tizhoosh teaches the apparatus of claim 7, wherein the textual description associated with the image of the person differs, on a verbatim basis, from at least one of the textual descriptions used to tag the one or more images (Tizhoosh: See arguments and citations offered in rejecting claim 7 above;
Para 20: “FIG. 16 illustrates image retrieval using real-world medical descriptions extracted from the WHO dataset, which wasn't included in the original training set. Under each description, the associated primary diagnosis is displayed. To the right, the top three images retrieved by the model are presented alongside their corresponding primary diagnoses. This illustration showcases the model's ability to retrieve and match images based on textual medical descriptions accurately.”
Para 101: “The dataset used for this experiment was sourced from the WHO (World Health Organization), which was not part of the original training set but provided real medical reports all related to the same primary diagnosis.”).
As per claim 9, Tizhoosh teaches the apparatus of claim 7, wherein the image of the person includes a medical scan image that depicts an anatomical structure of the person, wherein the textual description associated with the image of the person indicates an abnormality of the anatomical structure, and wherein at least one of the one or more images identified from the image repository depicts the anatomical structure of a different person with a substantially similar abnormality (Tizhoosh: See arguments and citations offered in rejecting claim 7 above;
para 63: “The image modality may be associated with histopathology images (e.g., whole slide images, patches extracted from whole slide images), medical images (e.g., MR images, CT images, ultrasound images, PET images, etc.), or the like.”).
As per claim 10, Tizhoosh teaches the apparatus of claim 1, wherein the one or more processors being configured to provide the indication regarding the one or more images identified from the image repository comprises the one or more processors being configured to provide a ranking of the one or more images based on respective relevance of the one or more images to the image of the person (Tizhoosh: See arguments and citations offered in rejecting claim 1 above;
para 102: “The top three results from the text-to-image (t2i) retrieval process are displayed in FIG. 16.”;
PNG
media_image21.png
1013
1469
media_image21.png
Greyscale
Para 105: “top K retrieved items”).
As per claim(s) 11, 13 and 15-19, arguments made in rejecting claim(s) 1, 3, 5-7, 9, and 8 are analogous, respectively. Considering claim 11, Tizhoosh also teaches identifying one or more images from an image repository based on at least the feature representation that represents the image of the person and the textual description as the associated pair (Tizhoosh: See arguments and citations offered in rejecting claim 1 above).
As per claim 20, Tizhoosh teaches a non-transitory computer-readable medium comprising instructions that, when executed by a processor included in a computing device, cause the processor to implement the method of claim 11 (Tizhoosh: paras 136-149; Fig. 19).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 4 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Tizhoosh as applied to claim 3 and 13 above, and further in view of US 20240185588 A1 (Kumari) .
As per claim 4, Tizhoosh teaches the apparatus of claim 3, wherein the features extracted from the textual description are provided to the cross-attention module (Tizhoosh: See arguments and citations offered in rejecting claim 3 above;
Fig. 8 and associated text (both shown above)).
Tizhoosh is silent regarding “in one or more key matrices and one or more value matrices … in one or more query matrices”.
Kumari teaches the features extracted from the textual description are provided to the cross-attention module in one or more key matrices and one or more value matrices, and wherein the features extracted from the image of the person are provided to the cross-attention module in one or more query matrices (Kumari:
PNG
media_image30.png
949
1129
media_image30.png
Greyscale
PNG
media_image31.png
415
706
media_image31.png
Greyscale
PNG
media_image32.png
578
489
media_image32.png
Greyscale
PNG
media_image33.png
385
787
media_image33.png
Greyscale
Para 111 (shown below): “Q mappings 1030 transform pixel features 1005 into a query vector. K mappings 1020 may transform text features 1010 into a key vector, and V mappings 1024 may transform the text features 1025 into a value vector. Projection mappings 1015 may include weights that are trained, enabling cross-attention between pixel features 1005 and text features 1010. In some embodiments, only K mappings 1020 and V mappings 1025 are trained during fine-tuning”:
PNG
media_image34.png
786
682
media_image34.png
Greyscale
PNG
media_image35.png
559
1787
media_image35.png
Greyscale
).
Thus, it would have been obvious for one of ordinary skill in the art, prior to filing, to implement the teachings of Kumari into Tizhoosh since both Tizhoosh and Kumari suggest a practical solution of using cross-attention to map input text description condition to image features wherein the text condition is determined using a text transformer in general and Kumari additionally provides teachings that can be incorporated into Tizhoosh in that the features extracted from the textual description are provided to the cross-attention module in one or more key matrices and one or more value matrices, and wherein the features extracted from the image of the person are provided to the cross-attention module in one or more query matrices so that “The cross-attention layers modify the latent features of the network according to the input text condition” (Kumari: para 86). The teachings of Kumari can be incorporated into Tizhoosh in that the features extracted from the textual description are provided to the cross-attention module in one or more key matrices and one or more value matrices, and wherein the features extracted from the image of the person are provided to the cross-attention module in one or more query matrices. Furthermore, one of ordinary skill in the art could have combined the elements as claimed by known methods and, in combination, each component functions the same as it does separately. One of ordinary skill in the art would have recognized that the results of the combination would be predictable.
As per claim(s) 14, arguments made in rejecting claim(s) 4 are analogous, respectively. Considering claim 11, Tizhoosh also teaches identifying one or more images from an image repository based on at least the feature representation that represents the image of the person and the textual description as the associated pair (Tizhoosh: See arguments and citations offered in rejecting claim 1 above).
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
US 20230054096 A1 (ICHINOSE) also teaches all limitations of claims 1, 6, 11, 16, and 20.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Atiba Fitzpatrick whose telephone number is (571) 270-5255. The examiner can normally be reached on M-F 10:00am-6pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Bee can be reached on (571) 270-5183. The fax phone number for Atiba Fitzpatrick is (571) 270-6255.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
Atiba Fitzpatrick
/ATIBA O FITZPATRICK/
Primary Examiner, Art Unit 2677