Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 12/29/2025 have been fully considered but they are not persuasive.
Regarding applicant arguments for 35. U.S.C. 101 on page 8 and 11 the applicant argued “Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The rejection is moot with respect to canceled claims 4-6 and 13-15. To the extent the Examiner believes the rejection applies to the amended claims, the Applicant traverses this rejection as follows. …
It may be seen that, in the fields of AI processing on the image and text data, the above steps may achieve the following technical effects: (1) by dividing a large data amount of the image/text modal data into finer grained image blocks/text symbols, the data processing load of the single semantic encoding process is significantly reduced, the efficiency of semantic encoding processing is effectively improved, the semantic encoding performance of the multimodal data is thus enhanced, and the accuracy of semantic representation is improved; (2) it is possible to achieve joint semantic representation learning of nonaligned image-text data based on the grounded dictionary, thereby enabling computers to effectively utilize and process the large-scale nonaligned image-text data.
In summary, the present application solves the issue of the prior art and therefore describes a specific solution to a technological problem, and the claims focus on specific improvements to the fields of artificial intelligence, deep learning, and natural language processing. The fact that none of the prior art documents teaches or suggests the subject matter of the amended independent claims also supports the position that the present claims recite a technological improvement. See MPEP § 2106.0S(l)(A)(v). Applicant respectfully requests withdrawal of the§ 101 rejection.” The applicant argues that the amended claim limitations solve a technical problem as supported by the amended independent claims. However the applicant argues first that the amendments to the independent claims render the 101 rejection moot. However the amended limitations have not been examined rendering the argument moot and not persuasive. See the update 101 rejection.
Regarding applicant arguments for 35 U.S.C. 103, the applicant argues in page 12 -19 “Amended independent claim 1 recites: … Corresponding features also may be found at independent claims 10 and 19. Applicants respectfully submits that none of the cited documents teach or suggest the above features 1-3 in amended claim 1.” The applicant amended independent claims 1, 8 and 15 and canceled claims 4-6, and 13-15. The applicant argues in page 15 “In a word, Lin fails to disclose or teach the above feature 1 "dividing the first modal data into a plurality of image blocks, wherein the plurality of image blocks comprise pieces of image pixel information, inputting the pieces of image pixel information to a visual transformer, encoding the pieces of image pixel information by a multi-layer attention mechanism of the visual transformer to obtain a plurality of image block tokens, and determining the plurality of image block tokens as a first token" in amended claim 1.” However this is an amended limitation. Applicant further argues in page 15 “Therefore, Li fails to teach or suggest the above feature 2 ''parsing the first token and the second token to obtain a grounded token ID, determining a grounded token matching the grounded token ID from a grounded dictionary as an initial grounded token, wherein the grounded dictionary comprises grounded token IDs and grounded tokens matching the grounded token IDs; obtaining an associated token by fusing and encoding the first token, the second token and the initial grounded token, wherein the associated token is a token having a similarity that satisfies a preset condition between the first token and the second toke" in amended claim 1.” The limitation is amended into claim 1. The last feature to be argues is “Applicant also argues in page 18 and 19 “As reproduced below, Lin discloses that the token vectors are generated by the model Word2Vec, and the segmentation tensor for the image is generated, but fails to teach how to determine the shared token between the image modal data and the text modal data. Therefore, Lin fails to disclose or teach the above feature 3 "recognizing a target shared token between the first modal data and the second modal data based on the first token, the second token and the associated token" in amended claim 1.” The limitations explain in features 1, 2, and 3 are amended limitations and the amended limitations have not been examined. Rendering the arguments for feature 1, 2 and 3 moot and not persuasive.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract idea without significantly more. The claim(s) recite(s) significantly more. The subject matter eligibility test for products and process is describe below for claim 1 in view of dependent claims.
Regarding claim 1:
Step 1: Is the claim to a process machine manufacture or composition of matter?
Yes – Claim 1 recites a method and that falls under the statutory categories.
Step 2A Prong 1: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes – The claim recites the following:
“and determining the plurality of image block tokens as a first token;” - The limitations recites a mental process of determining the plurality of image block tokens as the first token (see MPEP 2106.04(a)(2)III).
“determining the plurality of text tokens as a second token;” The limitations recites a mental process of determining the plurality of text tokens as the second tokens (see MPEP 2106.04(a)(2)III).
“determining a grounded token matching the grounded token ID from a grounded dictionary as an initial grounded token, wherein the grounded dictionary comprises grounded token IDs and grounded tokens matching the grounded token IDs;” - The limitations recites a mental process of determining a grounded token that matches the grounded token IDs. (see MPEP 2106.04(a)(2)III).
“obtaining an associated token by fusing and encoding the first token, [[and]] the second token, and the initial grounded token, wherein the associated token is a token having a similarity that satisfies a preset condition between the first token and the second token;” - The limitations recites a mental process of obtaining an associated token that has a similarity that satisfies a preset condition. (see MPEP 2106.04(a)(2)III).
“and recognizing a target shared token between the first modal data and the second modal data based on the first token, the second token and the associated token.” - The limitations recites a mental process of recognizing an a target shared token. (see MPEP 2106.04(a)(2)III).
Step 2 Prong 2: Does the claim recite additional elements that integrate the judicial exception into a particular application? No –
The claim includes the additional element(s):
“A method for recognizing a token, performed by an electronic device, comprising:”
The additional elements fall under “apply it” as using a generic computer to implement a method to recognize a token. See Mere Instructions to Apply an Exemption (see MPEP 2106.05(f)).
“obtaining first modal data and second modal data, wherein the first modal is an image modal and the second modal is a text modal;”
The additional elements fall under Insignificant Extra-Solution Activity as mere data gathering by obtaining data for the modals. See MPEP 2106.5(g).
“dividing the first modal data into a plurality of image blocks, wherein the plurality of image blocks comprise pieces of image pixel information, inputting the pieces of image pixel information to a visual transformer, encoding the pieces of image pixel information by a multi- layer attention mechanism of the visual transformer to obtain a plurality of image block tokens,”
The additional elements fall under “apply it” as using a generic computer to divided the first modal data into block and using a transformer to produce a token. See Mere Instructions to Apply an Exemption (see MPEP 2106.05(f)).
“dividing the second modal data into a plurality of text symbols, inputting the plurality of text symbols into a text transformer, encoding the plurality of text symbols by a multi-layer attention mechanism of the text transformer to obtain a plurality of text tokens,”
The additional elements fall under “apply it” as using a generic computer to divide the second modal data into a plurality of text symbols and using a transformer to encode them to produce text tokens. See Mere Instructions to Apply an Exemption (see MPEP 2106.05(f)).
“parsing the first token and the second token to obtain a grounded token ID,”
The additional elements fall under “apply it” as using a generic computer to parse the first and second token to obtain a grounded token ID. See Mere Instructions to Apply an Exemption (see MPEP 2106.05(f)).
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception?
No - The claim does not include additional elements that are sufficient to amount to a significantly more than the judicial exemption. As an order whole, the claim is directed determining an association between modal tokens. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of dividing, encoding, and parsing fall under using generic computer to apply an exemption and mere data gathering. The method does not improve on the function of a computer, transforms an article into another article, nor is it applied by a particular machine, making the claim not patent eligible.
Regarding claim 2:
Step 2A Prong 1:
“recognizing the target shared token between the first modal data and the second modal data based on the first target token, the second target token, and the associated token.” – The limitations recites a mental process of recognizing the target token between models (see MPEP 2106.04(a)(2)III).
Step 2A Prong 2, Step 2B: The additional element(s):
“wherein recognizing the target shared token between the first modal data and the second modal data comprises:
obtaining a first target token by processing the first token based on the associated token;
obtaining a second target token by processing the second token based on the associated token;”
The additional elements fall under Insignificant Extra-Solution Activity as mere data gathering by obtaining data for the models based on association token. See MPEP 2106.5(g). The judicial exemptions do not integrate into a practical application nor provide an improvement. The process does not provide an inventive concept nor provides a practical application.
Regarding claim 3:
Step 2A Prong 1:
“wherein obtaining the first target token comprises: aligning the associated token and the first token, and determining the aligned first token as the first target token; and obtaining the second target token comprises: aligning the associated token and the second token, and determining the aligned second token as the second target token” – The limitations recites a mental process of aligning and determining the alignment of the tokens (see MPEP 2106.04(a)(2)III).
Step 2A Prong 2, Step 2B: The additional element(s):
No additional elements. The judicial exemptions do not integrate into a practical application nor provide an improvement. The process does not provide an inventive concept nor provides a practical application
Regarding claim 7:
Step 2A Prong 1:
“The method of claim [[6]] 1, wherein determining the initial grounded token comprises: determining cluster description information between the first token and the second token;”
The limitations recites a mental process of determining the cluster information (see MPEP 2106.04(a)(2)III).
“determining a grounded token matching the cluster description information from a grounded dictionary as the initial grounded token;”
The limitations recites a mental process of determining a grounded token matching information in the grounded dictionary (see MPEP 2106.04(a)(2)III).
Step 2A Prong 2, Step 2B: The additional element(s):
“the grounded dictionary comprises: pieces of cluster description information and grounded tokens matching the pieces of cluster description information.”
The additional elements fall under Insignificant Extra-Solution Activity. See MPEP 2106.5(g).
The judicial exemptions do not integrate into a practical application nor provide an improvement. The process does not provide an inventive concept nor provides a practical application.
Regarding claim 8:
Step 2A Prong 1:
“and determining the similarity information as the cluster description information;”
The limitations recites a mental process of determining the cluster information (see MPEP 2106.04(a)(2)III).
Step 2A Prong 2, Step 2B: The additional element(s):
“The method of claim 7, wherein determining the cluster description information between the first token and the second token, comprises: obtaining similarity information between a target image block token and a target text token,”
The additional elements fall under Insignificant Extra-Solution Activity as mere data gathering by obtaining similarity information between image and text token. See MPEP 2106.5(g).
The judicial exemptions do not integrate into a practical application nor provide an improvement. The process does not provide an inventive concept nor provides a practical application.
“wherein the target image block token belongs to the plurality of image block tokens, the target text token belongs to the plurality of text tokens, and the target image block token and the target text token belong to the same data category obtained by clustering.”
The additional elements fall under Insignificant Extra-Solution Activity. See MPEP 2106.5(g).
The judicial exemptions do not integrate into a practical application nor provide an improvement. The process does not provide an inventive concept nor provides a practical application.
Regarding claim 9:
Step 2A Prong 1:
“determining fusion weight information based on the similarity information; and”
The limitations recites a mental process of determining a weight to based on the similarity (see MPEP 2106.04(a)(2)III).
Step 2A Prong 2, Step 2B: The additional element(s):
“The method of claim 8, wherein obtaining the associated token by fusing and encoding the first token, the second token and the initial grounded token, comprises: obtaining the associated token by fusing and encoding the first token, the second token and the initial grounded token based on the fusion weight information.”
The additional elements fall under “apply it” as using a generic computer to obtain the associated token by fusing and encoding. See Mere Instructions to Apply an Exemption (see MPEP 2106.05(f)).
The judicial exemptions do not integrate into a practical application nor provide an improvement. The process does not provide an inventive concept nor provides a practical application.
Claims 10-18 recite a system and are analogous to the method of claims 1-9. Therefore, the rejections of claim 1-9 above applies to claims 10-18.
Claims 19-20 recite a computer readable medium product and are analogous to the method of claims 1-2. Therefore, the rejections of claim 1-2 above applies to claims 19-20.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 10, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Hu, Ronghang, and Amanpreet Singh. "Unit: Multimodal multitask learning with a unified transformer." Proceedings of the IEEE/CVF international conference on computer vision. 2021 (“Hu”) in view of Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu, Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning (2021) (“Huang”).
Regarding claim 1 and analogous claims 10 and 19, Hu teaches the method of claim 1 a method for recognizing a token, performed by an electronic device, comprising:
obtaining first modal data and second modal data, wherein the first modal is an image modal and the second modal is a text modal (Hu Page 3. 3. UniT: Unified Transformer across domains In this work, we jointly learn multiple tasks across different modalities with a unified single model. Our model, UniT, is built upon the transformer encoder-decoder architecture [59, 5], consisting of separate encoders for each input modality type followed by a decoder (per-task or shared) with simple task-specific heads. Figure 2 shows an overview of UniT.
We consider two input modalities: images and text. For our transformer-based encoder on image inputs, inspired by [5], we first apply a convolutional neural network backbone to extract a visual feature map, which is further encoded by a transformer encoder into a list of hidden states to incorporate global contextual information. For language inputs, we use BERT [14], specifically the 12-layer uncased version, to encode the input words (e.g. questions) into a sequence of hidden states from BERT’s last layer. After encoding input modalities into hidden state sequences, we apply the transformer decoder on either a single encoded modality or the concatenated sequence of both encoded modalities, depending on whether the task is uni-modal (i.e. vision-only or language-only) or multimodal. We explore either having separate (i.e. task-specific) or shared decoders among all tasks. Finally, the representation from the transformer decoder is passed to a task-specific head such as a simple twolayer classifier, which outputs the final predictions. Given the simplicity of UniT, it can be extended easily to more modalities and inputs [obtaining first modal data and second modal data, wherein the first modal is an image modal and the second modal is a text modal].
Page 5, 3.5 Training
We jointly train UniT on multiple tasks. At each iteration during training, we randomly select a task and a dataset to fill a batch of samples. We manually specify a sampling probability for each task based on the dataset size and empirical evidence. In our implementation, we train with a batch size of 64 on 64 Nvidia Volta V100-SXM2-32GB GPUs (batch size 1 per GPU) in a distributed fashion, using PyTorch [41] [A method for recognizing a token, performed by an electronic device].);
dividing the first modal data into a plurality of image blocks, wherein the plurality of image blocks comprise pieces of image pixel information, inputting the pieces of image pixel information to a visual transformer, encoding the pieces of image pixel information by a multi-layer attention mechanism of the visual transformer to obtain a plurality of image block tokens, and determining the plurality of image block tokens as a first token (Hu Page 3 Figure 2 Image Encoder
PNG
media_image1.png
347
225
media_image1.png
Greyscale
[inputting the pieces of image pixel information to a visual transformer,]
Page 3 and 4, 3.1. Image Encoder,
The vision-only tasks (such as object detection) and vision-and-language tasks (such as visual question answering and visual entailment) require perceiving and understanding an image I as input. In our model, we encode the input image I with a convolutional neural network followed by a transformer encoder, into a list of encoded visual hidden states
h
v
=
{
h
1
v
,
h
2
v
,
…
,
h
L
v
}
.
Our image encoding process is inspired by DETR [5]. First, a convolutional neural network backbone B is applied on the input image to extract a visual feature map xv of size
PNG
media_image2.png
52
456
media_image2.png
Greyscale
In our implementation, the backbone network B follows the structure of ResNet-50 [19] with dilation [66] applied to its last C5 block, and is pretrained on object detection in [5] [dividing the first modal data into a plurality of image blocks, wherein the plurality of image blocks comprise pieces of image pixel information,].
We apply a visual transformer encoder
E
v
with
N
v
layers and hidden size
d
v
e
on top of the feature map
x
v
to further encode it to visual hidden states
h
v
of size L x
d
v
e
(where L =
H
v
x
W
v
is the length of the encoded visual hidden states). In addition, given that different tasks (such as object detection and VQA) might require extracting different types of information, we also add a task embedding vector
w
v
t
a
s
k
into the transformer encoder to allow it to extract task-specific information in its output as follows.
PNG
media_image3.png
34
458
media_image3.png
Greyscale
P
b
→
e
is a linear projection from visual feature dimension
d
v
b
to encoder hidden size
d
v
e
. The structure of the visual transformer encoder
E
v
follows DETR [5], where positional encoding is added to the feature map. The task token
w
t
a
s
k
is a learned parameter of dimension
d
v
e
, which is concatenated to the beginning of the flattened visual feature list
P
b
→
e
(
x
v
)
and stripped from the output hidden states
h
v
[encoding the pieces of image pixel information by a multi-layer attention mechanism of the visual transformer to obtain a plurality of image block tokens, and determining the plurality of image block tokens as a first token;]);
dividing the second modal data into a plurality of text symbols, inputting the plurality of text symbols into a text transformer, encoding the plurality of text symbols by a multi-layer attention mechanism of the text transformer to obtain a plurality of text tokens, and determining the plurality of text tokens as a second token (Hu Page 3 Figure 2. Text encoder
PNG
media_image4.png
321
185
media_image4.png
Greyscale
[inputting the plurality of text symbols into a text transformer,]
Page 4 3.2. Text Encoder,
GLUE benchmark [60] tasks such as QNLI [46], MNLI [62], QQP [24], and SST-2 [51] as well as the joint vision-and-language reasoning tasks such as VQA and visual entailment provide a textual input. We encode the textual input using BERT [14] – a transformer encoder model pretrained on large corpora with masked language modeling and next sentence prediction tasks.
Given the input text (e.g. a sentence or a pair of sentences), we tokenize it in the same way as in BERT into a sequence of S tokens
{
w
1
,
…
,
w
S
, with w1 = [CLS] (the special pooling token in BERT for classification) [dividing the second modal data into a plurality of text symbols,]. The token sequence is then used as input to a pretrained BERT model to extract a sequence of textual hidden states
h
t
of size S x
d
t
e
, where det is the BERT hidden size. Similar to the image encoder, in the text encoder, we also add a learned task embedding vector
w
t
t
a
s
k
t as part of the BERT input by prefixing it at the beginning of the embedded token sequence, and later stripping it from the output text hidden states as follows.
PNG
media_image5.png
56
458
media_image5.png
Greyscale
However, we find that it works nearly equally well in practice to keep only the hidden vector corresponding to [CLS] in
h
t
as input to the decoder (which saves computation).
In our implementation, we use a pretrained BERT-base uncased model from the Huggingface’s Transformers library [63], which has
d
t
e
= 768 and
N
t
= 12 layers [encoding the plurality of text symbols by a multi-layer attention mechanism of the text transformer to obtain a plurality of text tokens, and determining the plurality of text tokens as a second token]);
However Hu does not explicitly teach parsing the first token and the second token to obtain a grounded token ID, determining a grounded token matching the grounded token ID from a grounded dictionary as an initial grounded token, wherein the grounded dictionary comprises grounded token IDs and grounded tokens matching the grounded token IDs;
obtaining an associated token by fusing and encoding the first token, [[and]] the second token, and the initial grounded token, wherein the associated token is a token having a similarity that satisfies a preset condition between the first token and the second token;
and recognizing a target shared token between the first modal data and the second modal data based on the first token, the second token and the associated token.
However Huang teaches parsing the first token and the second token to obtain a grounded token ID, determining a grounded token matching the grounded token ID from a grounded dictionary as an initial grounded token, wherein the grounded dictionary comprises grounded token IDs and grounded tokens matching the grounded token IDs (Huang Page 3 Figure 2,
PNG
media_image6.png
512
779
media_image6.png
Greyscale
[parsing the first token and the second token to obtain a grounded token ID]
Page 4 3.2. Visual Dictionary,
The visual feature V extracted by visual feature encoder is more diverse and dense than language word tokens, which will bring difficulty to the learning of cross-modal understanding. To bridge its representation gap from language tokens, we propose a visual dictionary (VD) to tokenize the visual features by aggregating similar visual semantic into the same image feature.
Page 8 4.4. Visualization of Visual Dictionary
To share insights on what the proposed Visual Dictionary (VD) learned, we visualize some representative VD indices in Figure 3. As introduce in Sec 3.2, a VD index is correlated with many visual features, where each visual feature corresponds to an image patch. We randomly sample some indices from VD and visualize their corresponding image patches. As shown in Figure 3, the VD groups meaningful and consistent image patches into different indices, which reflects an abstraction of visual semantics. The visualization shows the strong capability of the learned VD. More cases can be found in supplementary materials.
[determining a grounded token matching the grounded token ID from a grounded dictionary as an initial grounded token,]
Figure 3,
PNG
media_image7.png
256
370
media_image7.png
Greyscale
[wherein the grounded dictionary comprises grounded token IDs and grounded tokens matching the grounded token IDs;]);
obtaining an associated token by fusing and encoding the first token, [[and]] the second token, and the initial grounded token, wherein the associated token is a token having a similarity that satisfies a preset condition between the first token and the second token ((Page 3, Figure 2
PNG
media_image8.png
217
726
media_image8.png
Greyscale
[ obtaining an associated token between by fusing and encoding the first token, [[and]] the second token, and the initial grounded token,]
Page 4 3.3. Pre-Training Pipeline para 1, We apply a multi-layer Transformer to learn cross-modal representations with the fusion of visual and language features. In order to learn a universal representation for vision and language-related tasks, we apply the self-supervised method to pre-train the model on a large aggregated dataset. We follow the existing works [7, 22, 27, 36, 39, 50] to adopt Masked Language Modeling (MLM) and Image-Text Matching (ITM) pre-training tasks. Besides, we propose a novel Masked Visual Modeling (MVM) pre-training task based on the virtual visual semantic labels produced by the visual dictionary.
Page 5 3.3. Pre-Training Pipeline para 6, Image-Text Matching. To enhance the cross-modal matching, we adopt Image-Text Matching (ITM) task for pretraining as in previous works [7]. We apply a binary classifier
PNG
media_image9.png
23
28
media_image9.png
Greyscale
on the joint embedding feature of [CLS] token to predict whether the input image and text are matched or not [wherein the associated token is a token having a similarity that satisfies a preset condition between the first token and the second token;].);
and recognizing a target shared token between the first modal data and the second modal data based on the first token, the second token and the associated token (Huang Page 1 Figure 1,
PNG
media_image10.png
252
372
media_image10.png
Greyscale
Page 2 2. Pre-training for Vision-Lanuage, VLPT works are 1) SOHO adopts a simple VLPT pipeline. Our vision backbone only uses ImageNet pre-trained parameters, and achieves even higher performance than existing VLPT works using VG features on five downstream tasks. 2) SOHO uses the least annotations to achieve SOTA performances. 3) SOHO enriches visual semantics by directly optimizing visual inputs for target language tasks.
Page 4, 3.3. Pre-training Pipeline para 3, Masked Language Modeling. We follow [7] and adopt Masked Language Modeling (MLM) to encourage the model to build the mapping between language tokens and visual contents. The goal of MLM is to predict the masked word tokens based on other word tokensWni and all image features f(V) by minimizing the negative log-likelihood. The learning target can be formulated as:
PNG
media_image11.png
17
331
media_image11.png
Greyscale
where D indicate hereinafter the whole training dataset. We adopt the same masking strategy used in BERT [10].
Page 5, Image-Text Matching. To enhance the cross-modal matching, we adopt Image-Text Matching (ITM) task for pretraining as in previous works [7]. We apply a binary classifier
PNG
media_image9.png
23
28
media_image9.png
Greyscale
on the joint embedding feature of [CLS] token to predict whether the input image and text are matched or not [and recognizing a target shared token]).
Hu and Huang are considered to be analogous to the claim invention because they are in the same field of machine learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filling date of the claimed invention to have modified Hu in view of Huang to incorporate a dictionary to determine grounded IDs. Doing so to get facilitate cross-modal understanding (Huang Abstract line 15-26 In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR2 test-P split, 6.7% accuracy on SNLI-VE test split, respectively.).
Regarding claim 3 and analogous claim 12, Hu and Huang teaches the method of claim 1 and analogous claims 10 and 19.
Hu and Huang are combine in the same rational as set forth above with respect to claim 1 and analogous claims 10 and 19.
Hu does not explicitly teach wherein obtaining the first target token comprises: aligning the associated token and the first token, and determining the aligned first token as the first target token; and obtaining the second target token comprises: aligning the associated token and the second token, and determining the aligned second token as the second target token.
However Huang teaches wherein obtaining the first target token comprises: aligning the associated token and the first token, and determining the aligned first token as the first target token; and obtaining the second target token comprises: aligning the associated token and the second token, and determining the aligned second token as the second target token (Huang
Page 1,
PNG
media_image12.png
227
334
media_image12.png
Greyscale
Page 6, 4.2.1 Task I: Image-Text Retrieval
Image-text retrieval requires a model to retrieve the most relevant caption from candidate images, or vice versa. It is one of the most typical tasks in the field of vision-language learning which enables a broad range of applications (e.g., image searching). Image-text retrieval includes two subtasks of image-to-text retrieval (TR) [aligning the associated token and the first token] and text-to-image retrieval (IR [aligning the associated token and the second token]). During training, we construct aligned and unaligned pairs inside of a mini-batch like most image-text retrieval models. We randomly sample t aligned image caption pairs from ground truth annotations to form a minibatch. All the other t-1 captions are used as the unaligned captions for each image. To encourage the model to predict the right labels for both the aligned and unaligned pairs, we In our implementation, we use the joint embedding representation of the [CLS] token from Transformers to predict whether an image-caption pair is aligned or not. Since the objective of image-text retrieval task is consistent with the image-text matching (ITM) task in pre-training stage, the pre-trained parameters can well be inherited for fine-tuning. We adopt AdamW optimizer with 1e-4 learning rate and 1e-2 weight decay. The mini-batch size t is set to 24. We train 20 epochs until convergence and decay the learning rate by half at 3rd, 5th , 9th and 13th epoch empirically.(i.e. both image and text tokens will become the target tokens based on the alignment process)).
Claim(s) 2, 11, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Hu in view of Huang and further in view of Lin Sun, Jiquan Wang, Kai Zhang, Yindu Su, Fangsheng Weng, RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER (2021), (“Sun”).
Regarding claim 2 and analogous claims 11 and 20, Hu and Huang teaches the method of claim 1.
Hu and Huang are combine in the same rational as set forth above with respect to claim 1 and analogous claims 10 and 19.
Hu does not explicitly teach wherein recognizing the target shared token between the first modal data and the second modal data comprises: obtaining a first target token by processing the first token based on the associated token; obtaining a second target token by processing the second token based on the associated token; and recognizing the target shared token between the first modal data and the second modal data based on the first target token, the second target token, and the associated token.
However Sun teaches wherein recognizing the target shared token between the first modal data and the second modal data comprises: obtaining a first target token by processing the first token based on the associated token; obtaining a second target token by processing the second token based on the associated token; and recognizing the target shared token between the first modal data and the second modal data based on the first target token, the second target token, and the associated token (Sun Page 13862,
PNG
media_image13.png
535
1197
media_image13.png
Greyscale
[obtaining a first target token by processing the first token based on the associated token; obtaining a second target token by processing the second token based on the associated token;]
Page 13863, Multitask Training for MNER para 7, Combining Task#1 and Task#2, the complete training procedure of RpBERT for MNER is illustrated in Algorithm 1. θRpBERT , θResNet, θFCs, θbiLSTM, and θCRF represent the parameters of RpBERT, ResNet, FCs, biLSTM, and CRF, respectively. In each epoch, the procedure first performs Task#1 to train the text-image relation on the TRC dataset and then performs Task#2 to train the model on MNER dataset. In the test stage, we execute lines 8-10 of Algorithm 1 and decode the valid sequence of labels using Viterbi algorithm (Lafferty, McCallum, and Pereira 2001).
Page 13863,
PNG
media_image14.png
611
621
media_image14.png
Greyscale
[recognizing the target shared token between the first modal data and the second modal data based on the first target token, the second target token, and the associated token]).
Hu and Sun are considered to be analogous to the claim invention because they are in the same field of machine learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filling date of the claimed invention to have modified Hu in view of Sun to incorporate obtaining target tokens based on the associated token. Doing so to get better leverage visual features to enhance the context of text (Sun Page 13864, Result of MNER Para line 10-16, The results show that the best “+ RpBERTGs” achieves increases of 4.5% and 7.3% compared to “biLSTM-CRF” on the Fudan Univ. and Snap Res. datasets, respectively. In terms of the role of visual features, the increase of “+ RpBERTGs” achieves approximately 2.3% compared to “+ BERT”, which is larger than those of the biLSTM-CRF based multimodal models such as Zhang et al. (2018) and Lu et al. (2018) compared to biLSTM-CRF. This indicates that the RpBERT model can better leverage visual features to enhance the context of tweets.).
Claim(s) 7 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Hu in view of Huang and further in view of Xiujun Li et al."Oscar-Object-SemanticsAlignedPre-trainingforVision-LanguageTasks;"2020; pgs.121-137 (“Li”).
Regarding claim 7 and analogous claim 16, Hu and Huang teaches the method of claim 1.
Hu and Huang are combine in the same rational as set forth above with respect to claim 1 and analogous claims 10 and 19.
Huang does not explicitly teach wherein determining the initial grounded token comprises: determining cluster description information between the first token and the second token; determining a grounded token matching the cluster description information from [[a]] the grounded dictionary as the initial grounded token; wherein the grounded dictionary comprises: pieces of cluster description information and grounded tokens matching the pieces of cluster description information.
However Li teaches wherein determining the initial grounded token comprises: determining cluster description information between the first token and the second token (Li Page 123 Figure. 2(c),
PNG
media_image15.png
178
202
media_image15.png
Greyscale
(i.e. each cluster has determined cluster description information));
determining [[a]] the grounded token matching the cluster description information from a grounded dictionary as the initial grounded token; wherein the grounded dictionary comprises: pieces of cluster description information and grounded tokens matching the pieces of cluster description information (Li Page 3,
PNG
media_image16.png
415
720
media_image16.png
Greyscale
(Li Page 124-125, Humans perceive the world through many channels. Even though any individual channel might be incomplete or noisy, important factors are still perceivable since they tend to be shared among multiple channels (e.g., dog can be described visually and verbally, as in Fig. 2). With this motivation, we propose a new vector (i.e., R = 4 or 6)3. We concatenate v0 and z to form a position-sensitive region feature vector, which is further transformed into v using a linear projection to ensure that it has the same vector dimension as that of word embeddings. Meanwhile, the same Faster R-CNN is used to detect a set of high precision object tags. q is the sequence of word embeddings of the object tags [determining a grounded token matching the cluster description information from a grounded dictionary as the initial grounded token].
Page 125 Pre-Training Objective The Oscar input can be viewed from two different perspectives as
PNG
media_image17.png
74
562
media_image17.png
Greyscale
where x is the modality view to distinguish the representations between a text and an image; while x0 is the dictionary view4 to distinguish the two different semantic spaces, in which the input is represented. The two-view perspective allows us to design a novel pre-training objective.
Page 125, A semantic space can be viewed a vector space defined by a dictionary, which maps an input to a vector representation in the semantic space. For example, BERT can be viewed as a dictionary that defines a linguistic semantic space. BERT maps an input word or word sequence into a feature vector in the semantic space [wherein the grounded dictionary comprises: pieces of cluster description information and grounded tokens matching the pieces of cluster description information]).
Hu and Li are considered to be analogous to the claim invention because they are in the same field of machine learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filling date of the claimed invention to have modified Lin in view of Li to incorporate the use of a grounded token to determine obtain an associated token. Doing so to allow faster convergence and make the model faster at training (Li Page 12, (iii) Ground-truth Tags: The ground-truth tags from COCO dataset are utilized to serve as a performance \upper bound" for our method. The experiments are conducted with the same BERT base model on three representative tasks, including VQA, image retrieval, and image captioning. As shown in Fig. 6, the learning curves for fine-tuning with object tags converges significantly faster and better than the VLP method without tags on all tasks. On the VQA and retrieval tasks, training using tags only takes half of the training time to achieve the final performance of the baseline, showing that Oscar is a more practical and efficient scheme for VLP. With more accurate object detectors developed in the future, Oscar can achieve even better performance, closing the gap demonstrated by using the ground-truth tags)
Claim(s) 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Hu in view of Huang and Li and further in view of Chaudhary, C., Goyal, P., Tuli, S. et al. A novel multimodal clustering framework for images with diverse associated text. Multimed Tools Appl 78, 17623–17652 (2019) (“Chaudhary”).
Regarding claim 8 and analogous claim 17, Hu in view of Huang and Li teaches the method of claim 7 and analogous 16.
Hu and Huang are combine in the same rational as set forth above with respect to claim 1 and analogous claims 10 and 19.
Hu and Li are combine in the same rational as set forth above with respect to claim 7 and analogous claim 16.
Hu does not explicitly teach wherein determining the cluster description information between the first token and the second token, comprises: obtaining similarity information between a target image block token and a target text token, and determining the similarity information as the cluster description information; wherein the target image block token belongs to the plurality of image block tokens, the target text token belongs to the plurality of text tokens, and the target image block token and the target text token belong to the same data category obtained by clustering.
However Chaudhary teaches wherein determining the cluster description information between the first token and the second token, comprises: obtaining similarity information between a target image block token and a target text token (Chaudhary Page 17630-17631 3 Proposed Method Para 4-5, We use a bipartite graph G to model the relations between a set T = {t1, t2,..,tM} of M textual features and a set I = {i1,i2,…,iN} of N images. G is denoted by G = (T, I, E) where E ⊆T×I. Figure 3a shows a sample bipartite graph linking nine textual features (left vertices) with four images (right vertices). The keywords representing the textual features depend on the type of dataset used. An edge e = (tp, iq) ∈ E connects a text [a target text token] tp of textual features and an image iq when tp appears in the surrounding text of iq, or iq is clicked for the query containing tp, or tp is tagged for iq [a target image block token]. Every edge is associated with the similarity weights TW for textual features and VW for visual features),
and determining the similarity information as the cluster description information; wherein the target image block token belongs to the plurality of image block tokens, the target text token belongs to the plurality of text tokens, and the target image block token and the target text token belong to the same data category obtained by clustering (Chaudhary Page 17635 Case 2: Clustering of images with tags (CIT), We now consider a dataset which consists of images and their tags. An image may be tagged by many users. Tags hold valuable information about the image and often the tags/keywords reflect the objects and events of significance and can thus be considered as textual features. However, the tags can be noisy (as different users have different perceptions), misspelled, ambiguous etc. A bipartite graph is constructed using images and tags. If an image is tagged, an edge is created between them. We introduce the concept of neighborhood voting, which is used to calculate factors X in TW(tp, iq) and Y in VW(tp, iq) [and determining the similarity information as the cluster description information] (i.e. the tag and image information becomes the edge cluster description information))
Page 17629, Fig. 2
PNG
media_image18.png
524
1006
media_image18.png
Greyscale
[wherein the target image block token belongs to the plurality of image block tokens, the target text token belongs to the plurality of text tokens, and the target image block token and the target text token belong to the same data category obtained by clustering).
Hu and Chaudhary are considered to be analogous to the claim invention because they are in the same field of machine learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filling date of the claimed invention to have modified Hu in view of Chaudhary to incorporate clustering by obtaining similarity information. Doing so to improve the quality of image clusters (Chaudhary Abstract line 9-15, The proposed framework can be applied to a wide variety of image datasets with different characteristics, viz., search results with noisy surrounding text, and tagged images. It can also cluster image search queries and their corresponding clicked images. The respective datasets used include image search results, Flicker (NUS-WIDE), and Clickture (Bing querylog). The proposed framework is shown to be versatile on Clickture dataset, which has not been examined by any of the previous approaches. The experimental results show that MHCI significantly improves the quality of image clusters as compared to existing methods.
Claim(s) 9 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Hu in view of Huang , Li, Chaudhary and further in view of Alexander H. Liu, SouYoung Jin, Cheng-I Jeff Lai, Andrew Rouditchenko, Aude Oliva, James Glass, Cross-Modal Discrete Representation Learning (2021) (“Liu”).
Regarding claim 9 and analogous claim 18, Hu in view of Huang, Li, and Chaudhary teach the method of claim 8 and analogous 17.
Hu and Huang are combine in the same rational as set forth above with respect to claim 1 and analogous claims 10 and 19.
Hu and Li are combine in the same rational as set forth above with respect to claim 7 and analogous claim 16.
Hu and Chaudhary are combine in the same rational as set forth above with respect to claim 8 and analogous claim 17.
Lin does not explicitly teach wherein obtaining the associated token by fusing and encoding the first token, the second token and the initial grounded token, comprises: determining fusion weight information based on the similarity information; and obtaining the associated token by fusing and encoding the first token, the second token and the initial grounded token based on the fusion weight information.
However Liu teaches wherein obtaining the associated token by fusing and encoding the first token, the second token and the initial grounded token, comprises: determining fusion weight information based on the similarity information; and obtaining the associated token by fusing and encoding the first token, the second token and the initial grounded token based on the fusion weight information (Liu Page 2 Figure 1,
This distribution is essentially the normalized frequency of codeword usage for a given sequence of fine-grained representations. Next, for a pair of cross-modal data (
x
i
A
,
x
j
B
), we define their code similarity as the negative symmetric cross entropy of probability distribution over the codebook [based on the similarity information]
PNG
media_image19.png
64
820
media_image19.png
Greyscale
Finally, we propose the Cross-Modal Code Matching (CMCM) objective using code similarity as
PNG
media_image20.png
71
657
media_image20.png
Greyscale
Intuitively, the proposed objective encourages the model to represent the input (xAi ; xBj) with similar codewords for positive pairs (i = j) and non-matching codewords for negative pairs (i 6= j). As a consequence, each codeword is expected to be a modality invariant representation of a more fine-grained concept, action, or word that can be discovered from cross-modal data. For example, a codeword could correspond to both the visual scene of a man juggling, and also the spoken word “juggling,” as we demonstrate in our experimental results in Table 2 and Figure 4.
The full objective of our proposed cross-modal representation learning framework is the combination of objectives at different levels
PNG
media_image21.png
21
200
media_image21.png
Greyscale
Where controls the weight between the two terms. Empirically, we found worked well = 0:1
across different settings [determining fusion weight information]
Liu Page 2,
PNG
media_image22.png
405
824
media_image22.png
Greyscale
2 Methodology, Figure 1 provides an overview of the proposed framework. We begin by describing the two-branch cross-modal representation learning paradigm in Section 2.1 (the blue and yellow regions). Next, we introduce our shared discrete embedding space in Section 2.2 (the green region). Finally, in Section 2.3 and Figure 2, we introduce the Cross-Modal Code Matching objective which guides the model to learn semantically meaningful representations through the shared discrete embedding space [by fusing and encoding the first token, the second token and the initial grounded token based on the fusion weight information].
Page 8,
PNG
media_image23.png
826
824
media_image23.png
Greyscale
[obtaining the associated token]).
Hu and Liu are considered to be analogous to the claim invention because they are in the same field of machine learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filling date of the claimed invention to have modified Hu in view of Liu to determine an associated token based on similarity information between tokens by determining a fusion weight. Doing so to allow a self-supervised learning framework to learn representation between different modalities (Liu Abstract, Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. In our experiments we show that the proposed discretized multi-modal fine-grained representation (e.g., pixel/word/frame) can complement high-level summary representations (e.g., video/sentence/waveform) for improved performance on cross-modal retrieval tasks. We also observe that the discretized representation uses individual clusters to represent the same semantic concept across modalities.).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALFREDO CAMPOS whose telephone number is (571)272-4504. The examiner can normally be reached 7:00 - 4:00 pm M - F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J. Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ALFREDO CAMPOS/Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129