Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
This action is in response to the amendments filed 09/05/2025. Claims 1, 8, 15, and 21 have been amended, claim 22 has been added. Claims 1-22 are currently pending.
Response to Arguments
Applicant’s arguments and amendments regarding the 101 rejection have been fully considered and are considered persuasive. The claims recite particular technical operations related to image transformation operations that could not be interpreted as judicial exceptions; therefore, the 101 rejection of claims 1-22 is withdrawn.
Applicant’s arguments regarding the prior art rejection have been fully considered but are moot because of the new ground(s) of rejection. Applicant argues that the prior art fails to teach masking a common feature “so as to exclude at least part of the common feature from the set of features, wherein masking the common feature comprises tokenizing and modifying the common feature to perform attention masking. . .”. Examiner notes that the Park reference teaches tokenizing and modifying features as part of an attention masking technique in at least paragraphs [0101]-[0111], in combination with the Loo reference, which teaches a method to remove a common feature in at least paragraphs [0006]-[0008]. Applicant also argues that the prior art does not teach training a machine learning model “using the set of features that exclude at least part of the common feature in the at least one machine learning training forward propagation” to predict whether an input document is out-of-domain. Examiner notes that the Park reference teaches training a model on masked features in at least paragraphs [0050], [0082], and [0089], and that this model can be a feedforward model in at least paragraph [0080]. One of ordinary skill in the art would recognize that a feedforward model is trained using forward and back propagation operations. The Park reference is relied upon in combination with the Chen reference, which teaches training a model to identify out-of-domain documents in at least paragraph [0043], and the Loo reference, which teaches that common features can be removed from a training dataset in at least paragraph [0006]-[0008].
Examiner also notes that the scope of claims 21 has changed and that claim 22 has been added. Claim 21 is now rejected by the combination of the Park, Chen, Loo, and Oberoi references. Claim 22 is rejected by the combination of the Park, Chen, and Loo references. The prior art rejections have been updated to include the amended limitations and to clarify the reasoning given for the limitations that were not amended.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-12, 14-20 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Park et al (US 20240290065 A1, herein Park) in view of Chen et al (US 20230410483 A1, herein Chen), in further view of Loo et al (US 20070009160 A1, herein Loo).
Regarding claim 1, Park teaches a system [for detecting out-of-domain documents], comprising: one or more processors; and memory storing computer-executable code that, as a result of execution by the one or more processors, cause the system (para. [0006] recites “Aspects of the present disclosure provide a method and system for accurately embedding multimodal data”) to at least:
identifying a common feature between both an out-of-domain machine learning training document and an in-domain machine learning training document (para. [0082] recites “the multimodal encoder refers to a module that encodes text-related features (i.e., token features) and image-related features (i.e., patch features) together to generate joint embeddings. The term "joint embeddings" may refer to embeddings in a joint (or shared) space and may include multiple embeddings (features) corresponding to the tokens of a text sample”. Para. [0087] recites “the embedding system 10 may input an embedding/feature 95 corresponding to a CLS (i.e., classification) token, among the joint embeddings 94, into a prediction layer 91 to generate a matching score for the two samples (i.e., to predict the matching status of the two samples). The prediction layer 91 may be implemented as a binary classification layer (e.g., a fully-connected layer) performing binary classification for matching status (i.e., a matching or nonmatching class), and a confidence score for the matching class or a processed value thereof may serve as the matching score” (i.e., determining shared, or common features between embeddings from in-domain and out-of-domain documents));
extract a set of features for at least one machine learning training forward propagation (Park para. [0080] recites "Each of the encoding layers 72 may be configured to include at least one self-attention layer 73 and at least one feedforward layer 74. The self-attention layer 73 may analyze the relationships between input embeddings 75, and the feedforward layer 74 may aggregate information on the input embeddings 75 based on the results of the analysis. The self-attention layer 73 and the feedforward layer 74 are already well known in the art". Para. [0103]-[0104] recite "Referring to FIG. 13, it is assumed that an attention map 134 is generated in an attention layer 131 of the multimodal encoder 90. As previously mentioned, the attention map 134 may be understood as being a map representing the relationships between token features 133 and patch features 132. The embedding system 10 may extract attention values for a region 137 of the attention map 134 that corresponds to the specific token" (i.e., extracting features as part of a feedforward, or forward propagation operation)),
wherein the masking the [common] feature comprises tokenizing and modifying the [common] features to perform attention masking on at least part of the set of features corresponding to the [common] feature (para. [0089] recites “Referring back to FIG. 3, in S36, the feature associated with a specific token in (within) the patch features may be softly masked. The embedding system 10 may select the specific token in accordance with a predefined condition (e.g., select a token that has been substituted with a mask token)”. Para. [0101]-[0102] recite “Referring to FIG. 12, to generate a soft mask, the embedding system 10 may acquire an attention map generated in an attention layer of a multimodal encoder (S121). The multimodal encoder may include at least one attention layer, such as a cross-attention layer, and the attention layer may generate an attention map by analyzing the relationships between patch features (or patches) and token features (or tokens). In Sl22, attention values for a specific token and patch features may be extracted from the attention map. The extracted attention values may be understood as representing the relationships between features of the specific token and the patch features”. Para. [0111] recites “Referring back to FIG. 12, in S123, a soft mask may be generated based on the attention values extracted in S122” (i.e., performing attention masking on a given feature by tokenizing and modifying the feature));
train, using the set of features that exclude at least part of the [common] feature in the at least one machine learning training forward propagation, a machine learning model to produce a trained machine learning model [that predicts whether an input document is out-of-domain] (para. [0050] recites “As illustrated in FIG. 1, a multimodal embedding system 10 may be a device/system capable of embedding given multimodal data using a deep learning model 11. For example, the multimodal embedding system 10 may train the deep learning model 11 by utilizing paired datasets 12 and may embed multimodal data using the trained deep learning model 11. The deep learning model 11 may also be referred to as an "embedding model" or "multimodal embedding model"”. Para. [0079]-[0080] recite “The embedding layer 71 may refer to a layer that receives each of multiple tokens 63 (e.g., receives the one-hot vectors of the respective tokens 63) and outputs token-level embeddings 75. Each of the encoding layers 72 may be configured to include at least one self-attention layer 73 and at least one feedforward layer 74. The self-attention layer 73 may analyze the relationships between input embeddings 75, and the feedforward layer 74 may aggregate information on the input embeddings 75 based on the results of the analysis”. Para. [0082] recites “Referring back to FIG. 3, in S34, token features and patch features may be input (fed) into a multimodal encoder to generate joint embeddings. Here, the multimodal encoder refers to a module that encodes text-related features (i.e., token features) and image-related features (i.e., patch features) together to generate joint embeddings”. Para. [0089] recites “Referring back to FIG. 3, in S36, the feature associated with a specific token in (within) the patch features may be softly masked” (i.e., training a feedforward machine learning model using the extracted features, wherein the at least one features is masked, or excluded));
However, while Park teaches a method for training a machine learning model (see at least para. [0050] of Park), Park does not explicitly teach training a model to predict whether an input document is out-of-domain and wherein the set of features includes features of the out-of-domain machine learning training document and an in-domain machine learning training document.
Chen teaches training a model to predict whether an input document is out-of-domain (para. [0043] recites “The supervised training process 160 fine-tunes the pre-trained image encoder 150 integrated with the image analysis model 170 to teach the image analysis model 170 to perform downstream vision tasks such as image segmentation tasks or image classification tasks. Each annotated MD medical image 204 includes a plurality of image voxels 206 each paired with a corresponding ground-truth label 208 indicating a class the corresponding image voxel 206 belongs to. Notably, the unannotated 3D images 202 in the first training data set 201 used to pre-train the image encoder 150 may be associated with a different medical domain than the annotated 3D images 204 in annotated second training data set 203” (i.e., using a trained model to predict whether documents are in a different class than a class, or domain used during training)).
and wherein the set of features includes features of the out-of-domain machine learning training document and an in-domain machine learning training document (para. [0043] recites “After the image encoder 150 is pre-trained via the self-supervised training process 200, the supervised training process 160 trains the image analysis model 170 on a second training data set 203 that includes the plurality of annotated MD medical images 204. the unannotated 3D images 202 in the first training data set 201 used to pre-train the image encoder 150 may be associated with a different medical domain than the annotated 3D images 204 in annotated second training data set 203. For instance, the first data set 201 may include chest CT scans while the second data set 203 may include abdominal CT scans or multimodal MRI scans of brain tumors.” (i.e., the set of features can include features from in-domain documents, such as the CT scans, and out-of-domain documents, such as the MRI scans));
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these teachings by using the technique of determining shared features in training documents from Park to the image processing system from Chen. Chen and Park are both directed to image processing methods, and Chen notes that its techniques can precede image classification systems in at least paragraph [0036] and Park teaches that its methods can be applied to image captioning, or classification, in at least paragraph [0054]. As such, one of ordinary skill in the art would understand how to combine the methods of determining which features to omit from a training image document from Chen with the method of determining shared features in training documents from Park to obtain predictable results.
However, while Park teaches masking features (see at least para. [0051]) and identifying features that may be shared between training documents (see at least paragraph [0082]), the combination of Park and Chen does not explicitly teach masking a common feature so as to exclude at least part of the common feature from the set of features.
Loo teaches masking a common feature so as to exclude at least part of the common feature from the set of features (para. [0006] recites “in many applications where the interested response is weaker than the common characteristics, normalization to the common characteristics instead of the interested response reduces the discriminatory power of the spectra significantly. In addition, the large dimension of common characteristics and noise retained after normalization put extra burden on a feature extraction module and lower the processing speed of the pattern recognizer significantly”. Para. [0008] recites “The data analyzer comprises a data removal module for identifying and removing portions of the set of indexed data having insufficient discriminatory power based on the ensemble statistics of the set of indexed data. The data removal module functions to remove portions of the spectra from further processing if such portions do not have sufficient discriminatory power (namely if they cannot help classify the spectrum into an appropriate category). In this regard, the data removal module may comprise a common characteristic removal module which includes means for identifying and removing common characteristics of the set of indexed data based on the ensemble statistics of the set of indexed data” (i.e., removing, or masking a common characteristic, or feature)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these teachings by modifying the contrastive learning system from Chen (which modifies Park) to utilize the method of removing common characteristics, or features from Loo. Loo and Chen are both directed to data classification systems; Loo teaches in at least paragraph [0006] that retaining common characteristics can reduce the discriminatory power of the classifier. As such, one of ordinary skill would benefit from using the method from Loo to remove common characteristics from the data in the contrastive learning system from the combination of Park and Chen to increase the discriminatory power of the classifier.
Regarding claim 2, the combination of Park, Chen, and Loo teaches the system of claim 1, wherein the computer-executable code that causes the system to produce the trained machine learning model includes executable code that causes the system to compare a first embedding associated with one or more in-domain documents to a second embedding associated with one or more OOD documents (Chen para. [0052] recites “the self-supervised MIM training process 200 adds positional embeddings 215 to the image patches 210. The image encoder 150 receives each masked image patch 210M, whereby each masked image patch may be replaced with a special masking embedding [M]”. Chen para. [0053] recites “the training process 200 teaches the encoder 150 to produce encoded feature representations 225 for the masked image patches 210M for use in generating predicted tokens 275 that match the visual tokens 240 obtained from the original 3D image 202. Here, the training process 200 may determine a training loss based on the predicted tokens 275 generated for the masked image patches 210M and the corresponding visual tokens from the sequence of discrete visual tokens 240 that are aligned (i.e., using the positional embeddings 215) with the masked image patches 210M”. Park para. [0087] recites “the embedding system 10 may input an embedding/feature 95 corresponding to a CLS (i.e., classification) token, among the joint embeddings 94, into a prediction layer 91 to generate a matching score for the two samples (i.e., to predict the matching status of the two samples). The prediction layer 91 may be implemented as a binary classification layer (e.g., a fully-connected layer) performing binary classification for matching status (i.e., a matching or nonmatching class), and a confidence score for the matching class or a processed value thereof may serve as the matching score” i.e., comparing embeddings from in-domain and out-of-domain documents)).
Regarding claim 3, the combination of Park, Chen, and Loo teaches the system of claim 1, wherein the set of features includes in-domain data and OOD data (Chen para. [0040] recites “The self-supervised training process 200 trains the image encoder 150 on a first training data set 201 that includes the plurality of unannotated multi-dimensional (MD) images 202”. Para. [0043] recites “After the image encoder 150 is pre-trained via the self-supervised training process 200, the supervised training process 160 trains the image analysis model 170 on a second training data set 203 that includes the plurality of annotated MD medical images 204” (i.e., a set of features). Park para. [0062] recites “Referring to FIG. 3, the multimodal embedding method according to some embodiments of the present disclosure may begin with S31, which involves preparing paired data sets”. Park para. [0063] recites “The paired datasets may include positive pairs and/or negative pairs. The positive pairs may refer to pairs where text samples and image samples are matched, while the negative pairs may refer to pairs where text samples and image samples are not matched” (i.e., a training dataset may include features from in-domain and out-of-domain documents)).
Regarding claim 4, the combination of Park, Chen, and Loo teaches the system of claim 1, wherein the computer-executable code that causes the system to extract the set of features includes executable code that causes the system to: select the common feature to omit from one of either the OOD training document or an in-domain training document (Loo para. [0008] recites “The data analyzer comprises a data removal module for identifying and removing portions of the set of indexed data having insufficient discriminatory power based on the ensemble statistics of the set of indexed data. The data removal module functions to remove portions of the spectra from further processing if such portions do not have sufficient discriminatory power (namely if they cannot help classify the spectrum into an appropriate category). In this regard, the data removal module may comprise a common characteristic removal module which includes means for identifying and removing common characteristics of the set of indexed data based on the ensemble statistics of the set of indexed data”. Chen para. [0037] recites “The pre-trained image encoder may be integrated into an image analysis model and fine-tuned using annotated multi-dimensional medical images to perform a particular downstream vision task. The annotated multi-dimensional medical images used to fine-tune the pre-trained image encoder, and ultimate train the image analysis model to perform the particular vision task, may each include a plurality of image voxels each paired with a corresponding ground-truth label indicating a class the corresponding image voxel belongs to. In this way, implementations of the present disclosure are further directed toward executing a supervised training process to train the image segmentation model on the plurality of annotated multi-dimensional medical images to teach the image segmentation model to learn how to predict the corresponding ground-truth labels for the plurality of image voxels for each annotated multi-dimensional medical image, whereby the image segmentation model includes the pre-trained image encoder initialized on the unannotated multi-dimensional images via the self-supervised MIM training process and fine-tuned on the annotated multi-dimensional images via the supervised training process” (i.e., identifying common features between different training data sets, which can be masked, or selected for omission)).
Regarding claim 5, the combination of Park, Chen, and Loo teaches the system of claim 1, wherein the computer-executable code that causes the system to extract the set of the features includes executable code that causes the system to select portions of the set of features at a same location in at least two training forward propagations of a plurality of training forward propagations (Chen para. [0051] recites “The training process 200 further randomly masks the portion of the image patches 210 by using one of a central region masking strategy, a block-wise masking strategy, or a uniformly random masking strategy that uses different masked patch sizes and masking ratios” (i.e., selecting a central region or features at a same location). Park para. [0118] recites “Referring back to FIG. 3, in S39, a determination may be made as to whether termination conditions for training are satisfied. The termination conditions may be established based on factors such as the number of iterations (e.g., epochs), training time, magnitude of losses, and training status of all sample pairs, but the present disclosure is not limited thereto” (i.e., a termination condition based on a number of epochs implies at least two training iterations, or forward propagation steps)).
Regarding claim 6, the combination of Park, Chen, and Loo teaches the system of claim 1, wherein the computer-executable code that causes the system to extract the set of the features includes executable code that causes the system to: identify a particular feature that is repeatedly activated in encoding layers during training of the machine learning model as being associated with one or more classifications; and include the particular feature in the set of features (Park para. [0095] recites “FIG. 11 shows an image sample 111 with an activated feature region (i.e., a feature activated in encoding layers during training) associated with a specific token of "deer," on the left, and an image sample 112 with a hard mask applied thereto, on the right”. Park para. [0096] recites “a soft mask simply suppresses or weakens features ( or information) of a specific region (or patch) of the image sample 111 because it masks each feature region based on the level of activation”. Park para. [0111] recites “Referring back to FIG. 12, in S123, a soft mask may be generated based on the attention values extracted in S122 (e.g., by inverting the extracted attention values)”. Park para. [0112] recites “Thereafter, in Sl24, the soft mask may be applied to the patch features. For example, the embedding system 10 may apply the soft mask to the patch features through an operation such as element-wise multiplication. As a result, only the patch features associated with the specific token may be selectively weakened” (i.e., a particular feature is selected or included to be omitted from a set of features in an input training document)).
Regarding claim 7, the combination of Park, Chen, and Loo teaches the system of claim 1, wherein the computer-executable code that causes the system to produce the trained machine learning model includes executable code that causes the system to train the machine learning model using one or both of: a confidence measure associated with information used to predict whether an input document is out-of-domain, or a distance metric associated with the information (Park para. [0087] recites “the embedding system 10 may input an embedding/feature 95 corresponding to a CLS (i.e., classification) token, among the joint embeddings 94, into a prediction layer 91 to generate a matching score for the two samples (i.e., to predict the matching status of the two samples). The prediction layer 91 may be implemented as a binary classification layer (e.g., a fully-connected layer) performing binary classification for matching status (i.e., a matching or nonmatching class), and a confidence score for the matching class or a processed value thereof may serve as the matching score” (i.e., using a confidence measure associated with predicting whether documents are non-matching, or out-of-domain)).
Claim 8 is a method claim and its limitation is included in claim 1. The only difference is that claim 8 requires a method (Chen para. [0004] recites “One aspect of the disclosure provides a computer implemented method that when executed on data processing hardware causes the data processing hardware to perform operations that include obtaining a first training data set including a plurality of unannotated multi-dimensional medical images and executing a self-supervised masked image modeling (MIM) training process to pre-train an image encoder on the first training data set”). Therefore, claim 8 is rejected for the same reasons as claim 1.
Regarding claim 9, the combination of Park, Chen, and Loo teaches the computer-implemented method of claim 8, wherein selecting the set of the features of training data is performed based, at least in part, on using a pseudorandom process (Chen para. [0034] recites “There are a number of random masking techniques, including but not limited to, a central region masking strategy, a complex block-wise masking strategy, and/or a uniformly random masking method at patch level using different masked patch sizes and masking ratios (i.e., randomly masking features of a training document)).
Regarding claim 10, the combination of Park, Chen, and Loo teaches the computer-implemented method of claim 8, wherein selecting the set of the features includes: obtaining a constraint that specifies a size or number of portions of the set of features to be omitted; and omitting the set of features in accordance with the constraint (Chen para. [0034] recites “There are a number of random masking techniques, including but not limited to, a central region masking strategy, a complex block-wise masking strategy, and/or a uniformly random masking method at patch level using different masked patch sizes and masking ratios (i.e., using a block-wise masking strategy would impose a constraint on the size of the features being omitted)).
Regarding claim 11, the combination of Park, Chen, and Loo teaches the computer-implemented method of claim 8, further comprising: receiving a document as an input to the trained machine learning model; receiving a classification of the document as an output of the trained machine learning model; and determining, based at least in part on the classification, that the document is OOD (Park para. [0062] recites “Referring to FIG. 3, the multimodal embedding method according to some embodiments of the present disclosure may begin with S31, which involves preparing paired data sets”. Park para. [0063] recites “The paired datasets may include positive pairs and/or negative pairs. The positive pairs may refer to airs where text samples and image samples are matched, while the negative pairs may refer to pairs where text samples and image samples are not matched”. Park para. [0087] recites “the embedding system 10 may input an embedding/feature 95 corresponding to a CLS (i.e., classification) token, among the joint embeddings 94, into a prediction layer 91 to generate a matching score for the two samples (i.e., to predict the matching status of the two samples). The prediction layer 91 may be implemented as a binary classification layer (e.g., a fully-connected layer) performing binary classification for matching status (i.e., a matching or nonmatching class), and a confidence score for the matching class or a processed value thereof may serve as the matching score” (i.e., receiving input documents, classifying those documents with a machine learning model, and determining whether the documents match, or are in-domain, or do not match, or are out-of-domain)).
Regarding claim 12, the combination of Park, Chen, and Loo teaches the computer-implemented method of claim 8, wherein training the machine learning model includes generating a threshold of confidence measures associated with a plurality of training documents used in training the machine learning model (Park para. [0087] recites “the embedding system 10 may input an embedding/feature 95 corresponding to a CLS (i.e., classification) token, among the joint embeddings 94, into a prediction layer 91 to generate a matching score for the two samples (i.e., to predict the matching status of the two samples). The prediction layer 91 may be implemented as a binary classification layer (e.g., a fully-connected layer) performing binary classification for matching status (i.e., a matching or nonmatching class), and a confidence score for the matching class or a processed value thereof may serve as the matching score” (i.e., using a confidence measure)).
Regarding claim 14, the combination of Park, Chen, and Loo teaches the computer-implemented method of claim 8, wherein a training document, from which the set of features are extracted, includes at least one of: plaintext data, image data, or layout data (Chen para. [0036] recites “Implementations herein are directed toward executing a self-supervised masked image modeling (MIM) training process to pre-train an image encoder on a plurality of unannotated (e.g., unlabeled) multi-dimensional medical images” (i.e., features can be extracted from at least image data)).
Claim 15 is a non-transitory computer readable medium claim and its limitation is included in claim 1. The only difference is that claim 15 requires a non-transitory computer readable medium (Chen para. [0072] recites “The non-transitory memory may be physical devices used to store programs (e.g., sequences of instructions) or data (e.g., program state information) on a temporary or permanent basis for use by a computing device”). Therefore, claim 15 is rejected for the same reasons as claim 1.
Claim 16 is a non-transitory computer readable medium claim and its limitation is included in claim 14. Claim 16 is rejected for the same reasons as claim 14.
Regarding claim 17, the combination of Park, Chen, and Loo teaches the non-transitory computer-readable storage medium of claim 15, wherein the computer-executable instructions that cause the computer system to select the set of features include executable instructions that cause the computer system to determine which features of the set of features to omit by causing the computer system to at least: select the features according to a pseudorandom process, select the features from a same location in the training document and in other training documents that are used for training the machine learning model, or select the features that are associated with one or more weights by the machine learning model based at least in part on identifying one or more features that are activated and associated with one or more classifications, the one or more features being activated at a frequency that exceeds a threshold (Chen para. [0034] recites “There are a number of random masking techniques, including but not limited to, a central region masking strategy, a complex block-wise masking strategy, and/or a uniformly random masking method at patch level using different masked patch sizes and masking ratios (i.e., randomly masking features of a training document)).
Regarding claim 18, the combination of Park, Chen, and Loo teaches the non-transitory computer-readable storage medium of claim 15, wherein the computer-executable instructions that cause the computer system to select the set of features include executable instructions that cause the computer system to determine the features to omit by causing the computer system to: select a first feature of the training document that is associated with a weight (Park para. [0132] recites “the embedding system 10 may calculate ITC loss (e.g., a type of focal loss), by reflecting a focal weight in the feature similarity 159. The focal weight may be determined to be smaller for a greater feature similarity 159 and larger for a smaller feature similarity 159. In this manner, a smaller weight may be assigned to easier sample pairs and a greater weight may be assigned to more challenging sample pairs” (i.e., a training document feature can be associated with a weight)); select a second feature of the training document pseudorandomly (Chen para. [0034] recites “There are a number of random masking techniques, including but not limited to, a central region masking strategy, a complex block-wise masking strategy, and/or a uniformly random masking method at patch level using different masked patch sizes and masking ratios (i.e., randomly masking features of a training document)); and the features to omit include the first and second features (Chen para. [0051] recites “FIG. 2A shows the MIM training process 200 training the image encoder 150 having the MAE architecture by randomly masking a portion of the image patches 210 divided from a corresponding unannotated MD medical image 202. The training process 200 further randomly masks the portion of the image patches 210 by using one of a central region masking strategy, a block-wise masking strategy, or a uniformly random masking strategy that uses different masked patch sizes and masking ratios. The training process further generates, using an image tokenizer 230 configured to receive the unannotated MD medical image 202 as input, a sequence of discrete visual tokens 240 that characterize the corresponding unannotated MD medical image 202” (i.e., features to be omitted can include the randomly selected features; one of ordinary skill in the art would understand that the weighted features from Park could also be selected in this process)).
Regarding claim 19, the combination of Park, Chen, and Loo teaches the non-transitory computer-readable storage medium of claim 15, wherein the set of features is omitted from the at least the one machine learning training forward propagation by using a computer-generated shape to obfuscate the set of features within the computer-generated shape (Chen fig. 3-4 illustrate examples of training data samples that are masked by obfuscating features of the sample with a computer generated shape)).
Claim 20 is a non-transitory computer readable medium claim and its limitation is included in claim 6. Claim 20 is rejected for the same reasons as claim 6.
Regarding claim 22, the combination of Park, Chen, and Loo teaches the system of claim 1, wherein the computer-executable code that, as a result of execution by the one or more processors, further causes the system to translate at least some of the set of features into dense vector embeddings that are used to train the machine learning model (Park para. [0050] recites “the multimodal embedding system 10 may train the deep learning model 11 by utilizing paired datasets 12 and may embed multimodal data using the trained deep learning model 11. The deep learning model 11 may also be referred to as an "embedding model" or "multimodal embedding model". For the convenience of explanation, the multimodal embedding system 10 will hereinafter be abbreviated as the "embedding system 10”. Park para. [0079] recites “The embedding layer 71 may refer to a layer that receives each of multiple tokens 63 (e.g., receives the one-hot vectors of the respective tokens 63) and outputs token-level embeddings 75”. Park para. [0082] recites “Referring back to FIG. 3, in S34, token features and patch features may be input (fed) into a multimodal encoder to generate joint embeddings. Here, the multimodal encoder refers to a module that encodes text-related features (i.e., token features) and image-related features (i.e., patch features) together to generate joint embeddings” (i.e., using embedding vectors created from extracted features to train the model)).
Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Park et al (US 20240290065 A1, herein Park) in view of Chen et al (US 20230410483 A1, herein Chen), in further view of Loo et al (US 20070009160 A1, herein Loo), in further view of Rodriguez-Serrano et al (US 20140355835 A1, herein Rodriguez-Serrano).
Regarding claim 13, the combination of Park, Chen, and Loo teaches the computer-implemented method of claim 8.
However, the combination of Park, Chen, and Loo does not teach wherein the training the machine learning model includes generating a distance metric using a Mahalanobis distance algorithm.
Rodriguez-Serrano teaches wherein the training the machine learning model includes generating a distance metric using a Mahalanobis distance algorithm (para. [0101] recites “a DxE matrix W is used to project the text representation e(t) into a common space of image representations ( or vice versa with an ExD matrix), where D represents the number of elements in each text string representation and E represents the number of elements in each image representation”. Para. [0102] recites “This is strictly equivalent to projecting the image embedding x(I) in the space of text embeddings and then using the dot-product between x(I)T·W and e(t)”. Para. [0104] recites “It will be appreciated that while the dot product is used herein as the similarity measure, any similarity measure suited to computing the similarity between the representations can be used. For example, the Manhattan distance, KL divergence, the Hellinger (HE) divergence, the Renyi divergence, the Euclidean distance, the Mahalanobis distance, the L1 distance, or the chi-squared similarity measure can be used” (i.e., computing a Mahalanobis distance)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these teachings by supplementing the confidence score calculation from Park (as modified by Chen and Loo) with the distance metric calculation from Rodriguez-Serrano. Rodriguez-Serrano states in at least paragraph [0105] that its calculated similarity metric, which includes the Mahalanobis distance metric, can be used as a confidence measure or otherwise processed to compute a confidence measure. Accordingly, one of ordinary skill in the art would understand how to modify the confidence score calculation from Park to incorporate the distance metric from Rodriguez-Serrano.
Claim 21 is rejected under 35 U.S.C. 103 as being unpatentable over Park et al (US 20240290065 A1, herein Park) in view of Chen et al (US 20230410483 A1, herein Chen), in further view of Loo et al (US 20070009160 A1, herein Loo), in further view of Oberoi et al (US 20220253472 A1, herein Oberoi).
Regarding claim 21, the combination of Park, Chen, and Loo teaches the system of claim 1.
However, while the combination of Park, Chen, and Loo teaches an attention masking technique (see at least paragraphs [0101]-[0111] of Park) and Loo teaches masking a common feature (see at least paragraphs [0006]-[0008]), the combination of Park, Chen, and Loo does not explicitly teach wherein the attention masking comprises added one or more padded tokens to the feature.
Oberoi teaches wherein the attention masking comprises adding one or more padded tokens to the feature (para. [0044] recites “Tokenize each block using the RoBERTa tokenizer to return IDs for each token in the text and attention masks that shows the model where padding was added to ensure uniform input size, regardless of text length”. Para. [0062] recites “the text processing system initiates the axiom recommendation process by parsing the incoming text block, step 410, into plain text and relevant metadata, such as paragraph type and location, in addition to other data that describes features of the block of input text”. Para. [0064] recites “The tokenizer may truncate or pad input to a maximum token length, e.g., 117 tokens, and returns an attention mask to prevent model attention on the padded indices, step 430” (i.e., adding one or more padded tokens to a feature during an attention masking operation)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these teachings by implementing the token padding method from Oberoi to the attention masking operation taught by Park (as modified by Chen and Loo). Park and Oberoi are both directed to systems which apply attention masking on features of training data. While Park does not explicitly teach added padded tokens to features as part of its attention masking technique, one of ordinary skill in the art would understand how the similar attention masking technique from Oberoi, which does include adding padded tokens to features, could modify the technique from Park.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
“Multi-layer Learnable Attention Mask for Multimodal Tasks” (Barrios et al) teaches a Learnable Attention Mask (LAM) to globally regulate attention maps, prioritize critical tokens within the sequence, and capture associations between tokens.
“HINT: High-Quality INpainting Transformer With Mask-Aware Encoding and Enhanced Attention” (Chen et al) teaches a mask-aware pixel-shuffle down-sampling module (MPD) to preserve visible information extracted from a corrupted image and a Spatially-activated Channel Attention Layer (SCAL) self-attention mechanism to model the corrupted image at multiple scales.
“DAN: A Dual Adversarial Domain Adaption Network for Unsupervised Non-overlapping Cross-domain Recommendation” (Guo et al) teaches a domain adaption-based method to eliminate domain-specific features in the common feature space by using a dual generative adversarial network with a multi-target adversarial loss to model each domain separately.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEAH M FEITL whose telephone number is (571) 272-8350. The examiner can normally be reached on M-F 0900-1700 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Viker Lamardo can be reached on (571) 270-5871. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/L.M.F./ Examiner, Art Unit 2147
/JAMES T TSAI/ Primary Examiner, Art Unit 2147