Last updated: May 29, 2026
Application No. 18/443,808
BUILDING VISION-LANGUAGE MODELS USING MASKED DISTILLATION FROM FOUNDATION MODELS

Non-Final OA §103
Filed
Feb 16, 2024
Examiner
DRYDEN, EMMA ELIZABETH
Art Unit
2677
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
1 (Non-Final)
Interview Optional

— +30.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 62% grant rate with +30.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 13 resolved cases, 2023–2026
Examiner Intelligence

DRYDEN, EMMA ELIZABETH View full profile →
Grants 62% of resolved cases
Career Allowance Rate
8 granted / 13 resolved
-0.5% vs TC avg
Strong +30% interview lift
Without
With
+30.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
14 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
2.9%
-37.1% vs TC avg
§103
95.2%
+55.2% vs TC avg
§102
1.0%
-39.0% vs TC avg
§112
1.0%
-39.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 13 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Specification
The disclosure is objected to because of the following informalities:
In para 43, “with the teach image embedding 224” should read “with the teacher image embedding 224”
In para 44, “with the teacher text embedding 226 a masked distillation loss” should read “with the teacher text embedding 226 using a masked distillation loss”
In para 96: “The acts of FIGS. 9-11 are be performed as part of a method” should read “The acts of FIGS. 9-11 are to be performed as part of a method”
Appropriate correction is required.
Claim Objections
Claim 5 is objected to because of the following informalities: “extracting the text embedding…to a dimensionality of the pretrained large language model into” should read “generating the text embedding comprises…to a dimensionality of the pretrained large language model  objected to because of the following informalities: “extracting the image embedding comprises” should read “ generating the image embedding comprises”. Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-11 and 13-14 are rejected under 35 U.S.C. 103 as being unpatentable over Yang et al. (cited in IDS - Yang, C., An, Z., Huang, L., Bi, J., Yu, X., Yang, H., ... & Xu, Y. (2023). CLIP-KD: An Empirical Study of CLIP Model Distillation. arXiv preprint arXiv:2307.12732.), hereinafter Yang, in view of Dong et al. (cited in IDS - Dong, X., Bao, J., Zheng, Y., Zhang, T., Chen, D., Yang, H., ... & Yu, N. (2023). Maskclip: Masked self-distillation advances contrastive language-image pretraining. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10995-11005)), hereinafter Dong, in further view of Singh et al. (Singh, P., De Clercq, O., Lefever, E.  (2023). Distilling Monolingual Models from Large Multilingual Transformers. Electronics 2023, 12, 1022. https://doi.org/10.3390/electronics12041022), hereinafter Singh.

Regarding claim 1, Yang teaches a computer-implemented method (Yang, pg. 5, Training details: “All experiments are ran over 8 V100 GPUs”) comprising: 
generating, utilizing a vision encoder of a vision-language model (Yang, Student visual encoder of the Masked Feature Distillation, MFD, model in FIG 1c attached below), an image embedding in a unified embedding space of the vision-language model (Yang, see image embedding in FIG 1c; section 3.1 on pg. 2: “CLIP performs an image-text alignment task to push the paired image-text close and unpaired ones apart in the feature embedding space”) from a masked digital image comprising a digital image with one or more masked patches (Yang, masked image input into MFD in FIG 1c; section 3.2.3 on pg. 3-4: “Masked Feature Distillation (MFD) uses masked images as the input to a student”); 
generating, utilizing a text encoder of the vision-language model (Yang, Student text encoder of the MFD model in FIG 1c), a text embedding in the unified embedding space (Yang, see text embedding in FIG 1c; see also section 3.1 on pg. 2 cited above) from a text phrase comprising a text description of the digital image (Yang, text input into MFD in FIG 1c); 
generating, utilizing a pretrained model, a teacher text embedding of the text description (Yang, section 4.3 on pg. 6: “Given the pretrained teacher CLIP model, we distill several light-weight student CLIP models with various architectures.”; see Teacher text encoder and embedding in FIG 1c); and 
modifying parameters of the vision-language model according to a loss between the teacher text embedding and the text embedding generated by the text encoder (Yang, section 3.2.3 on pg. 4: “we utilize MSE loss to align the student’s and teacher’s visual and text embeddings.”; see sk difference value calculated in the total loss).  

    PNG
    media_image1.png
    384
    745
    media_image1.png
    Greyscale

Yang fails to explicitly teach (1) wherein the text phrase is a masked text phrase comprising a text description of the digital image with one or more masked tokens; and (2) wherein the pretrained model is a pretrained large language model (emphasis added). Additionally, since the text phrase is not masked, Yang fails to explicitly teach (3) wherein the loss is a masked distillation loss.
However, Dong similarly teaches a CLIP (Contrastive Language Image Pre-training) model (Dong, MaskCLIP). Dong teaches wherein the text phrase is a masked text phrase comprising a text description of the digital image with one or more masked tokens (Dong, Masked text description in FIG 1d, attached below; section 3.3 on pg. 10998: “masked text tokens”) and modifying parameters of the vision-language model according to a masked distillation loss (Dong, Token-wise distillation loss in FIG 1d; see loss value in section 3.3 on pg. 10997). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the masked text description and associated loss value, as taught by Dong, in the MFD CLIP model in the method of Yang in order to improve the model by jointly learning representations of the image and their corresponding masked descriptions, described by Dong as follows (Dong, last para on pg. 10995: “the learned representation for local patches shall possess semantic meanings, being consistent with the global representation receiving semantic text supervision”; 3rd para on pg. 10996: “we argue that local semantic supervision on the text branch is also helpful for the text encoder and eventually beneficial for zero-shot performance. So we introduce the same mask-data-modeling format supervision into the text branch as well”).

    PNG
    media_image2.png
    517
    557
    media_image2.png
    Greyscale

Lastly, Singh teaches a method wherein a pretrained large language model is utilized as a teacher for a smaller model (Singh, abstract: “apply knowledge
 distillation techniques to filter language-specific information from a large multilingual model into a small, fast monolingual model that can often outperform the teacher model”; see Large Language Models in the last para on pg. 3). Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have utilized a pretrained large language model to distill knowledge into a smaller model, as taught by Singh, in the method of Yang in view of Dong in order to refine the text encoder of the MFD CLIP model using the extensive language knowledge of a large language model (Singh, 1st para in section 7 on pg. 15: “The experimental results confirmed that language-distillation is viable, especially in low-resourced settings, and the resulting students were often able to outperform the teacher multilingual models while being up to four times smaller and six times faster for inference than their respective teachers”). Distilling knowledge from a large language model (LLM) for a student text embedding provides an advantage due to the extensive language knowledge of the LLM, as opposed to pretrained models less focused on language.

Regarding claim 2 (dependent on claim 1), Yang in view of Dong and Singh teaches further comprising modifying the parameters of the vision-language model according to an additional masked distillation loss (Yang, section 3.2.3 on pg. 4: “we utilize MSE loss to align the student’s and teacher’s visual and text embeddings.”; see vk difference value calculated in the total loss) between the image embedding generated by the vision encoder and a teacher image embedding generated by a pretrained vision foundation model (Yang, section 4.3 on pg. 6: “Given the pretrained teacher CLIP model, we distill several light-weight student CLIP models with various architectures.”; see Teacher visual encoder and embedding in FIG 1c).  

Regarding claim 3 (dependent on claim 2), Yang in view of Dong and Singh teaches wherein modifying the parameters of the vision-language model according to the additional masked distillation loss comprises distilling features learned by the pretrained vision foundation model into the vision encoder of the vision-language model to encourage the vision encoder to learn to replicate the teacher image embedding of the pretrained vision foundation model from the masked digital image (Yang, section 3.2.3 on pg. 3-4: “The core idea is to recover the masked regions using contextual
information modeling by a vision transformer… In the scenario of distillation, the
teacher is a good supervisor that could provide valuable information to help the student recover the visual semantics given the masked image as input”).  

Regarding claim 4 (dependent on claim 1), Yang in view of Dong and Singh teaches wherein modifying the parameters of the vision-language model according to the masked distillation loss comprises distilling features learned by the pretrained large language model into the text encoder of the vision-language model to encourage the text encoder to learn to replicate the teacher text embedding of the pretrained large language model from the masked text phrase (Performed with the student/teacher text embeddings of Yang, see section 3.2.3 on pg. 3-4 citation in claim 3 above; Further supported by the implementation of text masking in Dong, 3rd para on pg. 10996: “So we introduce the same mask-data-modeling format supervision into the text branch as well”).  

Regarding claim 5 (dependent on claim 1), Yang in view of Dong and Singh teaches wherein: generating the text embedding comprises utilizing the text encoder of the vision-language model to project features from the unified embedding space of the vision encoder and the text encoder to a dimensionality of the pretrained large language model (Yang, section 3.2.2 on pg. 3: “when the embedding sizes between the teacher and student are different, we apply a linear projection head to student embeddings to match the dimension”; LLM taught in combination with Singh in claim 1); and modifying the parameters of the vision-language model is based on projecting the features (Projection affects the distillation loss value, thus modification of model parameters is based on the projection).  
	It is recognized that the citations and evidence provided above are derived from potentially different embodiments of a single reference (section 3.2.2 of Yang describes the Feature Distillation approach, not the Masked Feature Distillation).  Nevertheless, it would have been obvious, before the effective filing date of the claimed invention, to a person having ordinary skill in the art to which the claimed invention pertains to employ combinations and sub-combinations of these complementary embodiments, because both feature distillation methods described by Yang utilize the same encoding architectures and loss values. Additionally, the projecting of features to a dimensionality of a teacher model is a known-technique in the art.

Regarding claim 6 (dependent on claim 1), Yang in view of Dong and Singh teaches wherein: generating the image embedding comprises utilizing the vision encoder of the vision-language model to project features from the unified embedding space of the vision encoder and the text encoder to a dimensionality of a pretrained vision foundation model (Yang, section 3.2.2 on pg. 3: “when the embedding sizes between the teacher and student are different, we apply a linear projection head to student embeddings to match the dimension”); and modifying the parameters of the vision-language model is based on projecting the features (Projection affects the distillation loss value, thus modification of model parameters is based on the projection).  
	It is recognized that the citations and evidence provided above are derived from potentially different embodiments of a single reference (section 3.2.2 of Yang describes the Feature Distillation approach, not the Masked Feature Distillation).  Nevertheless, it would have been obvious, before the effective filing date of the claimed invention, to a person having ordinary skill in the art to which the claimed invention pertains to employ combinations and sub-combinations of these complementary embodiments, because both feature distillation methods described by Yang utilize the same encoding architectures and loss values. Additionally, the projecting of features to a dimensionality of a teacher model is a known-technique in the art.

Regarding claim 7 (dependent on claim 1), Yang in view of Dong and Singh fails to explicitly teach further comprising modifying the parameters of the vision-language model to learn a projection from multilingual text embeddings of the pretrained large language model to text input embeddings of the text encoder (Yang teaches wherein the student text encoder learns a projection from the teacher encoder using a projection head – see linear projection head in section 3.2.2 on pg. 3, while the combination with Singh teaches wherein the teacher may be a multilingual large language model).  
It is recognized that the citations and evidence provided above are derived from potentially different embodiments of a single reference (section 3.2.2 of Yang describes the Feature Distillation approach, not the Masked Feature Distillation).  Nevertheless, it would have been obvious, before the effective filing date of the claimed invention, to a person having ordinary skill in the art to which the claimed invention pertains to employ combinations and sub-combinations of these complementary embodiments, because both feature distillation methods described by Yang utilize the same encoding architectures and loss values. Additionally, the projecting of features to a dimensionality of a teacher model is a known-technique in the art.

Regarding claim 8, Yang teaches a non-transitory computer readable medium storing executable instructions which, when executed by a processing device (Yang, instructions to execute the training and test the machine-learned models, pg. 5, Training details: “All experiments are ran over 8 V100 GPUs”), cause the processing device to perform operations comprising: 
generating, utilizing a vision encoder of a vision-language model (Yang, Student visual encoder of the Masked Feature Distillation, MFD, model in FIG 1c), an image embedding in a unified embedding space of the vision-language model (Yang, see image embedding in FIG 1c; section 3.1 on pg. 2: “CLIP performs an image-text alignment task to push the paired image-text close and unpaired ones apart in the feature embedding space”) from a masked digital image comprising a digital image with one or more masked patches (Yang, masked image input into MFD in FIG 1c; section 3.2.3 on pg. 3-4: “Masked Feature Distillation (MFD) uses masked images as the input to a student”);
generating, utilizing a text encoder of the vision-language model (Yang, Student text encoder of the MFD model in FIG 1c) and from a text phrase comprising a text description of the digital image (Yang, text input into MFD in FIG 1c), a text embedding in the unified embedding space (Yang, see text embedding in FIG 1c; see also section 3.1 on pg. 2 cited above) by projecting features from the unified embedding space to a dimensionality of a pretrained model (Yang, section 3.2.2 on pg. 3: “when the embedding sizes between the teacher and student are different, we apply a linear projection head to student embeddings to match the dimension”); and 
modifying parameters of the vision-language model based on projecting the features (Projection affects the distillation loss value, thus modification of model parameters is based on the projection).  
Regarding the projection of features, it is recognized that the citations and evidence provided above are derived from potentially different embodiments of a single reference (section 3.2.2 of Yang describes the Feature Distillation approach, not the Masked Feature Distillation).  Nevertheless, it would have been obvious, before the effective filing date of the claimed invention, to a person having ordinary skill in the art to which the claimed invention pertains to employ combinations and sub-combinations of these complementary embodiments, because both feature distillation methods described by Yang utilize the same encoding architectures and loss values. Additionally, the projecting of features to a dimensionality of a teacher model is a known-technique in the art.
Yang fails to explicitly teach (1) wherein the text phrase is a masked text phrase comprising a text description of the digital image with one or more masked tokens; and (2) wherein the pretrained model is a pretrained large language model (emphasis added). 
However, Dong similarly teaches a CLIP (Contrastive Language Image Pre-training) model (Dong, MaskCLIP). Dong teaches wherein the text phrase is a masked text phrase comprising a text description of the digital image with one or more masked tokens (Dong, Masked text description in FIG 1d; section 3.3 on pg. 10998: “masked text tokens”). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the masked text description, as taught by Dong, in the MFD CLIP model in the method of Yang in order to improve the model by jointly learning representations of the image and their corresponding masked descriptions, described by Dong as follows (Dong, last para on pg. 10995: “the learned representation for local patches shall possess semantic meanings, being consistent with the global representation receiving semantic text supervision”; 3rd para on pg. 10996: “we argue that local semantic supervision on the text branch is also helpful for the text encoder and eventually beneficial for zero-shot performance. So we introduce the same mask-data-modeling format supervision into the text branch as well”).
Additionally, Singh teaches a method wherein a pretrained large language model is utilized as a teacher for a smaller model (Singh, abstract: “apply knowledge
 distillation techniques to filter language-specific information from a large multilingual model into a small, fast monolingual model that can often outperform the teacher model”; see Large Language Models in the last para on pg. 3). Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have utilized a pretrained large language model to distill knowledge into a smaller model, as taught by Singh, in the method of Yang in view of Dong in order to refine the text encoder of the MFD CLIP model using the extensive language knowledge of a large language model (Singh, 1st para in section 7 on pg. 15: “The experimental results confirmed that language-distillation is viable, especially in low-resourced settings, and the resulting students were often able to outperform the teacher multilingual models while being up to four times smaller and six times faster for inference than their respective teachers”). Distilling knowledge from a large language model (LLM) for a student text embedding provides an advantage due to the extensive language knowledge of the LLM, as opposed to pretrained models less focused on language.

Regarding claim 9 (dependent on claim 8), Yang in view of Dong and Singh teaches wherein the operations further comprise: generating the image embedding by utilizing the vision encoder of the vision-language model to project features from the unified embedding space to a dimensionality of a pretrained vision foundation model (Yang, section 3.2.2 on pg. 3: “when the embedding sizes between the teacher and student are different, we apply a linear projection head to student embeddings to match the dimension”); and modifying the parameters of the vision-language model based on projecting the features to the pretrained vision foundation model (Projection affects the distillation loss value, thus modification of model parameters is based on the projection).  

Regarding claim 10 (dependent on claim 8), Yang in view of Dong and Singh teaches wherein the operations further comprise: generating, utilizing the vision-language model to process the masked digital image and the masked text phrase, a predicted text embedding of the text description in a unified embedding space of the vision encoder and the text encoder (Yang, operation of CLIP, section 1 on pg. 1: “applies contrastive learning to (image, text) pairs. It guides the model to predict the correct (image, text) pair among the candidate image and text samples”; section 3.1 on pg. 2: “CLIP performs an image-text alignment task to push the paired image-text close and unpaired ones apart in the feature embedding space”; Masking of the text phrase taught in combination with Dong in claim 8); and modifying the parameters of the vision-language model based on a masked distillation loss between the predicted text embedding and a teacher text embedding generated by the pretrained large language model (Masked distillation loss taught in combination with Dong in the same way as described in claim 1– Yang teaches a loss between the predicted student and teacher text embeddings, see sk difference value in section 3.2.3, while Dong teaches applying this loss for masked text, see Token-wise distillation loss in FIG 1d and loss value in section 3.3 on pg. 10997; LLM as the teacher taught by Singh in claim 8).  

Regarding claim 11 (dependent on claim 8), Yang in view of Dong and Singh teaches wherein the operations further comprise: generating, utilizing the vision-language model to process the masked digital image and the masked text phrase, a predicted image embedding of the digital image in a unified embedding space of the vision encoder and the text encoder (Yang, operation of CLIP, section 1 on pg. 1: “applies contrastive learning to (image, text) pairs. It guides the model to predict the correct (image, text) pair among the candidate image and text samples”; section 3.1 on pg. 2: “CLIP performs an image-text alignment task to push the paired image-text close and unpaired ones apart in the feature embedding space”; Masking of the text phrase taught in combination with Dong in claim 8); and modifying the parameters of the vision-language model based on a masked distillation loss between the predicted image embedding and a teacher image embedding generated by a pretrained vision foundation model (Yang teaches a loss between the predicted student and teacher visual embeddings, see vk difference value in section 3.2.3).  

Regarding claim 13 (dependent on claim 8), Yang in view of Dong and Singh teaches wherein the operations further comprise: generating, utilizing the vision-language model to process the masked digital image and the masked text phrase, a predicted text embedding of the text description of the digital image (Taught by the text encoders of Yang in view of Dong, see claim 8); and modifying the parameters of the vision-language model using a contrastive loss to predict correctness of the predicted text embedding (Yang, contrastive loss for image-text alignment in section 3.1 on pg. 2; Further supported by the use of both a contrastive loss and distillation loss in Dong, see section 3.1 on pg. 10997).  

Regarding claim 14 (dependent on claim 8), Yang in view of Dong and Singh teaches wherein the operations further comprise: generating, utilizing the vision-language model to process the masked digital image and the masked text phrase, a predicted image embedding of the digital image (Taught by the vision encoders of Yang in view of Dong, see claim 8); and modifying the parameters of the vision-language model using a contrastive loss to predict correctness of the predicted image embedding (Yang, contrastive loss for image-text alignment in section 3.1 on pg. 2; Further supported by the use of both a contrastive loss and distillation loss in Dong, see section 3.1 on pg. 10997).  

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Yang in view of Dong, Singh, and Sakuma et al. (Sakuma, J., & Yoshinaga, N. (2019, November). Multilingual model using cross-task embedding projection. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) (pp. 22-32).), hereinafter Sakuma.

Regarding claim 12 (dependent on claim 8), Yang in view of Dong and Singh teaches wherein the operations further comprise modifying the parameters of the vision-language model to learn a projection from multilingual text embeddings of the pretrained large language model to text input embeddings of the text encoder (Yang teaches wherein the student text encoder learns a projection from the teacher encoder using a projection head – see linear projection head in section 3.2.2 on pg. 3, while the combination with Singh teaches wherein the teacher may be a multilingual large language model). 
Since the model of Yang does not input multilingual training data, it is not explicitly taught that the projection would be without inputting multilingual training data into the vision-language model. However, Sakuma teaches a cross-lingual mapping from one embedding to another embedding of a trained model without inputting multilingual training data into the trained model (Sakuma, see FIG 1 attached below and the cross-task projection based on dimension reduction at the end of pg. 24 into pg. 25; bottom left of pg. 28: “our projection successfully induced task-specific cross-lingual word embeddings”). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the embedding layer projection, taught by Sakuma, in the vision-language model of Yang in view of Dong and Singh in order to benefit from the multilingual capabilities of the LLM without requiring the training data that trained the LLM (Sakuma, section 6 on pg. 30: “The locally linear mapping assumes and preserves the local topology across the semantic spaces before and after the projection. Experimental results demonstrated that the locally linear mapping successfully obtains task-specific word embeddings of the target language, and the resulting fully task-specific multilingual model exhibited better model accuracy than the existing multilingual model that fixes its embedding layer to general word embeddings”).

    PNG
    media_image3.png
    254
    350
    media_image3.png
    Greyscale


Claims 15-16 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Yang in view of Sakuma and Singh.

Regarding claim 15, Yang teaches a system comprising: one or more memory devices; and one or more processors coupled to the one or more memory devices (Yang, storage and computing components allowing for the execution of the training and testing of the machine-learned models, pg. 5, Training details: “All experiments are ran over 8 V100 GPUs”), the one or more processors configured to cause the system to perform operations comprising: 
processing a digital image utilizing a vision-language model comprising a vision encoder (Yang, Student visual encoder of the Masked Feature Distillation, MFD, model in FIG 1c) and a text encoder (Yang, Student text encoder of the MFD model in FIG 1c) trained to project features from a dimensionality of a model into a unified embedding space of the vision encoder and the text encoder (Yang, section 3.2.2 on pg. 3: “when the embedding sizes between the teacher and student are different, we apply a linear projection head to student embeddings to match the dimension”); and 
generating, utilizing the vision-language model, a vision-language output by processing the digital image (Yang, see section 4.3 on pg. 6-7 describing output testing of the distilled smaller CLIP models).  
Yang fails to teach wherein the text encoder is trained to project a lookup table of features from a dimensionality of a large language model trained on multilingual data into a unified embedding space of the vision encoder and the text encoder; and generating an output by utilizing the lookup table projected by the text encoder (emphasis added).
However, Sakuma teaches a cross-lingual mapping from one embedding to another embedding of a trained model (Sakuma, see FIG 1 attached below and the cross-task projection based on dimension reduction at the end of pg. 24 into pg. 25; bottom left of pg. 28: “our projection successfully induced task-specific cross-lingual word embeddings”), thus teaching to project a lookup table of features (Sakuma, the implemented cross-lingual embedding layer of the trained model) from a dimensionality of a cross-lingual embedding trained on multilingual data (Sakuma, the general cross-lingual word embedding, see FIG 1, contains English and French words) into another embedding space (Sakuma, See FIG 1). It would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have combined the embedding layer projection, taught by Sakuma, in the system of Yang in order to implement multilingual capabilities in the vision-language model (Sakuma, section 6 on pg. 30: “The locally linear mapping assumes and preserves the local topology across the semantic spaces before and after the projection. Experimental results demonstrated that the locally linear mapping successfully obtains task-specific word embeddings of the target language, and the resulting fully task-specific multilingual model exhibited better model accuracy than the existing multilingual model that fixes its embedding layer to general word embeddings”). In the combination of Yang in view of Sakuma, the student encoder of the vision-language model can utilize the projected lookup table as claimed.
Additionally, Singh teaches a method wherein a large language model trained on multilingual data is utilized as a teacher for a smaller model (Singh, abstract: “apply knowledge distillation techniques to filter language-specific information from a large multilingual model into a small, fast monolingual model that can often outperform the teacher model”; see Large Language Models in the last para on pg. 3). Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to have utilized the multilingual large language model to distill knowledge into a smaller model, as taught by Singh, in the method of Yang in view of Sakuma in order to refine the text encoder of the MFD CLIP model using the extensive language knowledge of a large language model (Singh, 1st para in section 7 on pg. 15: “The experimental results confirmed that language-distillation is viable, especially in low-resourced settings, and the resulting students were often able to outperform the teacher multilingual models while being up to four times smaller and six times faster for inference than their respective teachers”). 

Regarding claim 16 (dependent on claim 15), Yang in view of Sakuma and Singh teaches wherein the one or more processors are further configured to cause the system to generate the vision-language output (Yang, section 4.3 on pg. 6-7: “Given the pretrained teacher CLIP model, we distill several lightweight student CLIP models with various architectures.”) by using the vision-language model to determine a classification of the digital image (Yang, section 4.3 on pg. 6-7: “zero-shot ImageNet classification”).  

Regarding claim 19 (dependent on claim 15), Yang in view of Sakuma and Singh teaches wherein the one or more processors are further configured to cause the system to generate the vision-language output (Yang, section 4.3 on pg. 6-7: “Given the pretrained teacher CLIP model, we distill several lightweight student CLIP models with various architectures.”) by using the vision-language model to retrieve, from digital image database, one or more digital images corresponding to the digital image (Yang, section 4.3 on pg. 6-7: “cross-modal retrieval…text -> image retrieval”).  

Regarding claim 20 (dependent on claim 15), Yang in view of Sakuma and Singh teaches wherein the one or more processors are further configured to cause the system to generate an additional vision-language output (Yang, section 4.3 on pg. 6-7: “Given the pretrained teacher CLIP model, we distill several lightweight student CLIP models with various architectures.”) by using the vision-language model to generate a caption that describes relational composition of objects depicted in the digital image (Yang, section 4.3 on pg. 6-7: “cross-modal retrieval…image -> text retrieval”).

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Yang in view of Sakuma, Singh, and Dong.

Regarding claim 17 (dependent on claim 15), Yang in view of Sakuma and Singh fails to explicitly teach wherein the one or more processors are further configured to cause the system to generate the vision-language output by using the vision-language model to determine segmentations of objects depicted within the digital image.  
	Dong teaches a vision-language output by using the vision-language model to determine segmentations of objects depicted within the digital image (Dong, last para in left column on pg. 10996: “We train our MaskCLIP on a subset of a publicly available image-text pairs dataset…semantic segmentation…detection and segmentation”). Yang teaches a system comprising a vision-language model. The use of vision-language models to segment objects within a digital image is a known technique (see MPEP 2143(I)(D)). The training and use of the CLIP model of Dong to detect and segment objects from surrounding pixels could be applied to the CLIP model of Yang in view of Sakuma and Singh. Therefore, a person having ordinary skill in the art, before the effective filing date of the claimed invention, could have applied the known technique, as taught by Dong, in the same way to the system of Yang in view of Sakuma and Singh and achieved predictable results of obtaining a computer vision model for segmentation tasks (See “Semantic segmentation on ADE20K” and “Object detection and instance segmentation on MS-COCO” on pg. 11000 of Dong).

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Yang in view of Sakuma, Singh, and Goudar et al. (Goudar, R. H., Dhananjaya, G. M., Kambar, V. A., Kulkarni, A., Deshpande, S. L., & Rathod, V. (2023, November). Translingual Image-to-Text Conversion: Bridging Visual and Multilingual Semantic Representations. In 2023 IEEE North Karnataka Subsection Flagship International Conference (NKCon) (pp. 1-6). IEEE.), hereinafter Goudar.

Regarding claim 18 (dependent on claim 15), Yang in view of Sakuma and Singh fails to explicitly teach wherein the one or more processors are further configured to cause the system to generate the vision-language output by using the vision-language model to generate a non-English caption for the digital image.  
Goudar teaches a vision-language output by using a vision-language model to generate a non-English caption for the digital image (Goudar, abstract: “translingual image-to-text conversion… the suggested method's efficiency in accurately converting images to multilingual text”; based on a multilingual embedding, section D on pg. 3: “The extracted visual features and linguistic embeddings are fused to create a unified multilingual semantic representation of the picture content”). Yang teaches a system comprising a vision-language model. The use of vision-language models to generate captions for a digital image is a known technique (see MPEP 2143(I)(D)). The training and use of the multilingual model of Goudar to generate non-English captions could be applied to the CLIP model of Yang in view of Sakuma and Singh. Therefore, a person having ordinary skill in the art, before the effective filing date of the claimed invention, could have applied the known technique, as taught by Goudar, in the same way to the system of Yang in view of Sakuma and Singh and achieved predictable results of obtaining a computer vision model for multilingual captioning.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Chen et al. (Chen, C., Zhong, A., Wu, D., Luo, J., & Li, Q. (2023). Contrastive Masked Image-Text Modeling for Medical Visual Representation Learning. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).) teaches a masked image input to an image encoder and a masked text description input to a text encoder.

    PNG
    media_image4.png
    482
    866
    media_image4.png
    Greyscale


Li et al. (Li, P., Liu, G., He, J., Zhao, Z., & Zhong, S. (2023). Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering. arXiv preprint arXiv:2307.05314.) teaches a masked image and text vision-language model using contrastive learning.
Gupta et al. (Gupta, K., Gautam, D., & Mamidi, R. (2022). cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation. arXiv preprint arXiv:2206.03354.) teaches a cross-lingual vision-language model using knowledge distillation.
Lu et al. (Lu, J., Zhang, D., Wu, X., Gao, X., Gan, R., Zhang, J., ... & Zhang, P. (2023). Ziya-visual: Bilingual large vision-language model via multi-task instruction tuning. arXiv preprint arXiv:2310.08166.) teaches a bilingual vision-language model.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to EMMA E DRYDEN whose telephone number is (571)272-1179. The examiner can normally be reached M-F 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ANDREW BEE can be reached at (571) 270-5183. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/EMMA E DRYDEN/Examiner, Art Unit 2677                                                                                                                                                                                                        
/ANDREW W BEE/Supervisory Patent Examiner, Art Unit 2677
Read full office action
Prosecution Timeline

Feb 16, 2024
Application Filed
Mar 10, 2026
Non-Final Rejection mailed — §103
Apr 20, 2026
Interview Requested
Apr 28, 2026
Applicant Interview (Telephonic)
Apr 28, 2026
Examiner Interview Summary
Apr 30, 2026
Response Filed
Precedent Cases

Applications granted by this same examiner with similar technology

18/340,515
Patent 12632966
METHOD, ELECTRONIC DEVICE, AND COMPUTER PROGRAM PRODUCT FOR RECOGNIZING OBJECT REGIONS IN IMAGE
2y 11m to grant Granted May 19, 2026
18/171,522
Patent 12561873
IMAGE PROCESSING APPARATUS AND METHOD
3y 0m to grant Granted Feb 24, 2026
17/641,440
Patent 12543950
SLIT LAMP MICROSCOPE, OPHTHALMIC INFORMATION PROCESSING APPARATUS, OPHTHALMIC SYSTEM, METHOD OF CONTROLLING SLIT LAMP MICROSCOPE, AND RECORDING MEDIUM
3y 11m to grant Granted Feb 10, 2026
17/951,249
Patent 12526379
AUTOMATIC IMAGE ORIENTATION VIA ZONE DETECTION
3y 3m to grant Granted Jan 13, 2026
17/934,618
Patent 12340443
METHOD AND APPARATUS FOR ACCELERATED ACQUISITION AND ARTIFACT REDUCTION OF UNDERSAMPLED MRI USING A K-SPACE TRANSFORMER NETWORK
2y 9m to grant Granted Jun 24, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
62%
Grant Probability
92%
With Interview (+30.0%)
2y 10m (~7m remaining)
Median Time to Grant
Low
PTA Risk
Based on 13 resolved cases by this examiner. Grant probability derived from career allowance rate.