DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Regarding claims 1 and 8 and their respective dependent claims, Applicant's arguments filed 03/26/2026 have been fully considered but they are not persuasive.
Regarding claim 1, Applicant asserts on pages 6-7 of the Remarks that claim 1, as amended, is novel over Huang for incorporating features of original claim 3 and part of original claim 4. Specifically, Applicant asserts that the method of Huang generates a single text match for an image using both local and global features, as opposed to matching a plurality of sets of visual tokens to a set of text tokens. Examiner respectfully disagrees with this assertion. In particular, as written, the claim language “a multimodal fusion model that matches each of sets of visual tokens to a corresponding set of text tokens characterizing the pathology” may be interpreted as performing a matching or comparison such as that conducted by Huang in order to determine the similarity of each text candidate feature set from a corresponding report to each set of image features representing each of the local image sub-regions. Examiner has interpreted the local image-text similarity computation of Huang as corresponding to the “matching” limitation of the claims, as the claim does not clearly specify that the output of the matching is one set of text tokens per set of visual tokens. The previous rejection of claim 1 over Huang is therefore maintained, as set forth below.
Regarding claim 8, Applicant indicates that claim 15 has been incorporated, but Examiner notes that the claim does not clearly specify that matching each set of visual tokens is performed such that an individual set of text tokens per set of visual tokens is the result or output of the matching, similar to the rationale set forth above with respect to independent claim 1. Additionally, the limitation of “providing an output based on the set of text tokens for each of the plurality of tiles” in amended claim 8 and “the output being provided according to the set of text tokens for each of the plurality of tiles” as recited in original claim 15 are interpreted differently by Examiner, as “based on” could mean that the individual results of matching/comparing each set of image features to a set of candidate text features could result in an aggregation of all the matching results into a final whole image classification, or a display of particular regions having corresponding associated words, such as disclosed by Huang (see Fig. 4), while an output being provided “according to the set of text tokens for each of the plurality of tiles” would require information for each set of text tokens corresponding to each of the tiles to be output, and not just particular regions of the image or a single classification. Claim 8 therefore does not include all features of claim 15 as originally written, and can be read as comparing each set of image patch embeddings to a candidate set of text tokens corresponding to the image, and producing a final output based on said comparison, which is covered by Huang, as set forth below in the rejection of claim 8.
All new grounds of rejection set forth below were necessitated by the amendment.
Claim Objections
Claims 1, 4, 8, 16 and 21 are objected to because of the following informalities:
In claim 1, Examiner suggests correcting “each of sets of visual tokens” to read as “each of the plurality of sets of visual tokens”;
In claim 4, Examiner suggests correction of “the user interface an output…” to read as “the user interface displays an output…”;
In claim 8, Examiner suggests changing the limitation “to provide a set of visual tokens” to read as “to generate a set of visual tokens”, correcting the limitation “representing the plurality of tiles” to read as “representing each of the plurality of tiles”, and removing the extra instance of “the” from the limitation “providing an output based on the the set of text tokens”;
In claim 16, Examiner suggests correcting “wherein matching each set of text tokens to the set of text tokens” to read as “wherein matching each set of visual tokens to the set of text tokens”;
In claim 21, Examiner suggests correction of “each of sets of visual tokens” to read as “each of the plurality of sets of visual tokens”, and correction of “representing the set of text token” to read as “representing the set of text tokens”.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1, 4-7, 13 and 24 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites the limitation "an output representing the second set of tokens". There is insufficient antecedent basis for the limitation “the second set of tokens” in the claim. Appropriate clarification and correction is required.
Claim 4 recites “The system of claim 3”, but claim 3 has been cancelled by Applicant’s amendment. Examiner believes the correct dependency to be from claim 1. Appropriate correction is required.
Claims 5-7 recite “The system of claim [[1]]”. The brackets indicate that the claim dependency has been removed from the claims. For purposes of examination, these claims will be interpreted to be dependent from claim 1. Appropriate correction is required.
Claims 7 and 24 recite the limitation “a contrastive objective component that aligns the first and second encoders”. There is insufficient antecedent basis for “the…second encoders” in the claims. Claims 1 and 21 each respectively only recite “a first encoder”. Appropriate clarification and correction is requested.
Claim 13 recites the limitation “generating the first set of tokens”. There is insufficient antecedent basis for “the first set of tokens” in the claim. Examiner believes the claim should recite “generating the plurality of sets of visual tokens” commensurate with the visual tokens recited in claim 8. Appropriate correction is required.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1, 7-8, 10-11 and 16 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by “GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition” (hereinafter “Huang”; published 2021).
Regarding claim 1, Huang discloses a system comprising: a processor; and a non-transitory computer readable medium storing instructions executable by the processor to provide: (Huang, Introduction, first paragraph-third paragraphs, Section 3, Method, Figs. 1-2; “deep learning and computer vision provide a promising solution for automating medical image analysis”):
an image interface that receives a received image representing a pathology and divides the received image into a plurality of tiles (Huang, Section 3.2.3-3.3.1; “we propose to leverage both the global and local features for a more accurate retrieval. We use the attention-driven image-text matching-score Z(tli, vli) defined in Eq. 6 as the similarity metric for the local representations. In this way, the localized similarity between the query image and candidate sentences can be calculated base on the context-aware local representations”).
a first encoder that reduces each of the plurality of tiles to a set of visual tokens to provide a plurality of sets of visual tokens (Huang, Section 3.1.1; “We extract the local image features from an intermediate convolution layer and vectorize to get the C-dimensional features for each of M image sub-regions” using the image encoder Ev);
a multimodal fusion model that matches each of sets of visual tokens to a corresponding set of text tokens characterizing the pathology (Huang, Section 3.3.1; “We use the attention-driven image-text matching-score Z(tli, vli) defined in Eq. 6 as the similarity metric for the local representations. In this way, the localized similarity between the query image and candidate sentences can be calculated base on the context-aware local representations. Finally, the image-text retrieval task is completed based on the aggregated image-text similarity metric by averaging the global and local similarities as shown in Fig. 3.”),
the multimodal fusion model being trained on an pretraining dataset compiled from a plurality of pathology-related sources, a given training sample within the pretraining dataset comprising a data representing a pathology and data characterizing the data representing the pathology (Huang, Section 3, p.3944-3947, Fig. 2; “Given a pair of medical image and report, we first use the image encoder and text encoder to extract image and text features respectively. The global image-text representations are learned through the global contrastive loss. For learning local representations, we compute the similarity matrix based on the image sub-region features and word-level features to generate attention-weighted image representations. The local contrastive objective is based on the attention-weighted image representations and the corresponding word representations. The overall representation learning framework is trained end-to-end by jointly optimizing both local and global contrastive losses.” The training dataset comprises pairs of images and associated reports containing pathological findings in medical imaging examinations, of which the images are read here as the claimed “data representing a pathology” and the text phrases extracted from reports are read as the claimed “data characterizing the data representing the pathology”, and the trained model is used for zero-shot classification of an input image)); and
a user interface that displays an output representing the second set of tokens (Huang, Fig. 4, Section 4.6; “well-trained attention weights should correctly identify significant image regions that correspond to a particular word…Fig. 4 demonstrates our attention model is able correctly identify significant image regions for a given word. For instance, the attention based on the word “Pneumonia” Fig. 4a (bottom) correctly localize regions of the right lower lobe containing heterogenous consolidative opacities indicative of pneumonia”).
Regarding claim 7, claim 1 is incorporated, and Huang further discloses wherein the multimodal fusion model is trained using an objective function having a contrastive objective component that aligns the first and second encoders by maximizing cosine-similarity scores between paired image and text embeddings and a captioning objective that maximizes the likelihood of generating the correct text conditioned on the image and previously generated text (Huang, Sections 3.2.2-3.2.3; “the global objective is formulated as minimizing the negative log posterior probability…where τ1 ∈ R is a scaling temperature parameter, and ⟨vgi, tgi⟩ represents the cosine similarity between the global image representation vgi and global text features” and “Similarly, due to the mutual correlation between the image and text pairs, we also maximize the posterior probability of the text given its corresponding image.”).
Regarding claim 8, Huang discloses a method (Huang, Fig. 2, Fig. 3, Section 3, Method) comprising:
receiving an input image representing a pathology (Huang, Section 3.3.2; “In zero-shot classification, we take an image xv as input and aim at predicting the corresponding label”);
dividing the input into a plurality of tiles; providing the plurality of tiles to a vision encoder to provide a set of visual tokens for each of the plurality of tiles (Huang, Section 3.1.1; “We extract the local image features from an intermediate convolution layer and vectorize to get the C-dimentional features for each of M image sub-regions” using the image encoder Ev); and
matching each set of visual tokens representing the plurality of tiles to a set of text tokens at a multimodal fusion model trained on a pretraining dataset compiled from a plurality of pathology-related sources, a given training sample within the pretraining dataset comprising a data representing a pathology and text describing the image (Huang, Section 3, p.3944-3947, Fig. 2; “Given a pair of medical image and report, we first use the image encoder and text encoder to extract image and text features respectively. The global image-text representations are learned through the global contrastive loss. For learning local representations, we compute the similarity matrix based on the image sub-region features and word-level features to generate attention-weighted image representations. The local contrastive objective is based on the attention-weighted image representations and the corresponding word representations. The overall representation learning framework is trained end-to-end by jointly optimizing both local and global contrastive losses.” The training dataset comprises pairs of images and associated reports containing pathological findings in the medical imaging examinations, of which the images are read here as the claimed “data representing a pathology” and the text phrases extracted from radiology reports are read as the claimed “data characterizing the data representing the pathology”, and the trained model is used for zero-shot classification of an input image based on image-text similarity scores for the local representations)); and
providing an output based on the set of text tokens for each of the plurality of tiles (Huang, Fig. 4, Section 4.6; “well-trained attention weights should correctly identify significant image regions that correspond to a particular word…Fig. 4 demonstrates our attention model is able correctly identify significant image regions for a given word. For instance, the attention based on the word “Pneumonia” Fig. 4a (bottom) correctly localize regions of the right lower lobe containing heterogenous consolidative opacities indicative of pneumonia”).
Regarding claim 10, claim 8 is incorporated, and Huang further discloses wherein the provided output is a class label associated with the input image (Huang, Section 3.3.1-3.3.2; “In the image-text retrieval task, a query image is used as the input to retrieve the closet matching text based on the similarities between their representations”).
Regarding claim 11, claim 8 is incorporated, and Huang further discloses wherein the provided output is a segmented representation of the input image (Huang, Fig. 4, Section 4.6; “well-trained attention weights should correctly identify significant image regions that correspond to a particular word…Fig. 4 demonstrates our attention model is able correctly identify significant image regions for a given word. For instance, the attention based on the word “Pneumonia” Fig. 4a (bottom) correctly localize regions of the right lower lobe containing heterogenous consolidative opacities indicative of pneumonia”).
Regarding claim 16, claim 8 is incorporated, and Huang further discloses wherein matching each set of text tokens to the set of text tokens at the multimodal fusion model comprises generating a similarity metric between the set of visual tokens for each of the plurality of tiles with the set of text tokens associated with the input image, the output being provided according to the similarity metric for each of the plurality of tiles (Huang, Section 3.2.3-3.3.1; “In the image-text retrieval task, a query image is used as the input to retrieve the closet matching text based on the similarities between their representations…we propose to leverage both the global and local features for a more accurate retrieval. We use the attention-driven image-text matching-score Z(tli, vli) defined in Eq. 6 as the similarity metric for the local representations. In this way, the localized similarity between the query image and candidate sentences can be calculated based on the context-aware local representations”).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Huang, as applied to claim 1 above, in view of “Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing” (hereinafter “Boecking”; published 2022).
Regarding claim 5, claim 1 is incorporated, and Huang does not expressly teach the limitations as further claimed, but, in an analogous field of endeavor, Boecking does as follows.
Boecking teaches wherein the multimodal fusion model provides a set of text tokens for the received image and a similarity metric for each tile for the set of text tokens, the output representing the similarity metric for each tile (Boecking, p.6-7, Section 2.2, Section 4.1, Fig. 3; “For each input image…we use the image encoder and projection module to obtain patch embeddings…for segmentation tasks… Probabilities for classes/regions can then be computed via a softmax over the cosine similarities between the image (or region) and prompt representations.”).
Boecking is considered analogous art because it pertains to biomedical vision-language data processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system taught by Huang to include outputting a heatmap representation of the similarity between the text embeddings and the image patch embeddings, as taught by Boecking, in order to achieve more accurate local alignment and visualization of corresponding text phrases to image regions (Boecking, p.7-8, Section 3).
Claim 6 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Huang, as applied to claims 1 and 8 above, in view of “Self-distillation Augmented Masked Autoencoders for Histopathological Image Understanding” (hereinafter “Luo”; published 2023).
Regarding claim 6, claim 1 is incorporated, and Huang does not expressly teach the limitations as further claimed, but, in an analogous field of endeavor, Luo does as follows.
Luo teaches wherein the first encoder is trained on a plurality of pathology images via a self-supervising learning algorithm using an objective function including a self-distillation loss and a masked image modeling loss (Luo, Section II.A-C, equation (4); the total loss includes a self-distillation loss and a MSE loss on masked patches).
Luo is considered analogous art because it pertains to biomedical image analysis using machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system taught by Huang to pretrain the image encoder using a loss function incorporating both self-distillation loss and a MSE loss based on masked image patches, as taught by Luo, in order to obtain better self-supervised-based feature learning in the pretraining process (Luo, Section II.C).
Regarding claim 13, claim 8 is incorporated, and Huang further teaches wherein generating the first set of tokens comprises providing the input image to a vision encoder trained on a plurality of pathology images (Huang, Section 3, p.3944-3947, Fig. 2; “Given a pair of medical image and report, we first use the image encoder and text encoder to extract image and text features respectively. The global image-text representations are learned through the global contrastive loss. For learning local representations, we compute the similarity matrix based on the image sub-region features and word-level features to generate attention-weighted image representations. The local contrastive objective is based on the attention-weighted image representations and the corresponding word representations. The overall representation learning framework is trained end-to-end by jointly optimizing both local and global contrastive losses.” The training dataset comprises a plurality of pairs of images and associated reports containing pathological findings in the medical imaging examinations, and the resultant trained model is used for zero-shot classification of an input image).
Huang does not expressly teach the limitations as further claimed, but, in an analogous field of endeavor, Luo does as follows.
Luo teaches a vision encoder trained on a plurality of pathology images via a self-supervising learning algorithm using an objective function including a self-distillation loss and a masked image modeling loss (Luo, Section II.A-C, equation (4); the total loss includes a self-distillation loss and a MSE loss on masked patches).
Luo is considered analogous art because it pertains to biomedical image analysis using machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method taught by Huang to pretrain the image encoder using a loss function incorporating both self-distillation loss and a MSE loss based on masked image patches, as taught by Luo, in order to obtain better self-supervised-based feature learning in the pretraining process (Luo, Section II.C).
Allowable Subject Matter
Claim 4 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:
Regarding claim 4, none of the cited prior art of record, either alone or in combination, expressly teaches “wherein the user interface [displays] an output representing the sets of text tokens for each of the plurality of tiles”. In particular, while Huang and Boecking above both teach calculating image patch-level embeddings for an input image and comparing it to text embeddings of a text phrase, neither expressly teaches or suggests displaying an output representing a set of text tokens for each of the plurality of tiles, as claimed.
Claims 21-23 are allowed.
The following is an examiner’s statement of reasons for allowance: The closest prior art of record, either alone or in combination, does not expressly teach the entire combination of limitations as recited in independent claim 21. In particular, while Huang and Boecking above both teach calculating image patch-level embeddings for an input image and comparing each patch-level embedding to text embeddings of a text phrase based on a calculated similarity, neither reference expressly teaches or suggests displaying an output representing the set of text tokens for each of the plurality of tiles, as claimed. Claims 22-23 are allowably by virtue of their dependency from claim 21.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Claim 24 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The additional references cited pertain generally to matching image to text features for medical image classification/interpretation.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SAMAH A BEG whose telephone number is (571)270-7912. The examiner can normally be reached M-F 9 AM - 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, HENOK SHIFERAW can be reached on 571-272-4637. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SAMAH A BEG/Primary Examiner, Art Unit 2676