Last updated: May 04, 2026

Application No. 18/771,779

GENERATING MULTILINGUAL VISION LANGUAGE MODELS UTILIZING CONTRASTIVE LANGUAGE IMAGE PRETRAINING

Non-Final OA §102§103

Filed

Jul 12, 2024

Examiner

WOZNIAK, JAMES S

Art Unit

2655

Tech Center

2600 — Communications

Assignee

Adobe Inc.

OA Round

1 (Non-Final)

This examiner grants 59% of cases after interview

— +39.9% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 386 resolved cases, 2023–2026

Examiner Intelligence

WOZNIAK, JAMES S View full profile →

Grants 59% of resolved cases

Career Allowance Rate

228 granted / 386 resolved

-2.9% vs TC avg

Strong +40% interview lift

Without

With

+39.9%

Interview Lift

resolved cases with interview

Typical timeline

3y 8m

Avg Prosecution

42 currently pending

Career history

428

Total Applications

across all art units

Statute-Specific Performance

§101

18.2%

-21.8% vs TC avg

§103

40.1%

+0.1% vs TC avg

§102

18.3%

-21.7% vs TC avg

§112

16.0%

-24.0% vs TC avg

Black line = Tech Center average estimate • Based on career data from 386 resolved cases

Office Action

§102 §103

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Examiner notes on patent subject matter eligibility under 35 U.S.C. 101

Independent claims 1, 8, and 14 regard a process/processing for "training a multilingual large language model" that involves steps that uses a combination of a vision encoder and a multilingual LLM to determine similarity metrics and adjust parameters based upon a contrastive loss function.  While certain steps can be performed by a human under the broadest reasonable interpretation (BRI) such as determining pairings between images and text by making a mental evaluation of text and images to make a judgment on relationships between the multimodal data, the claim represents a technical process for improving a specific multilingual language model that also does not involve organization of any human behavior. Moreover, while certain steps may involve math such as a contrastive loss function, the process is not purely mathematical.  Accordingly, since the independent claims and their dependents by virtue of their dependency do not fall within the categories of abstract ideas, claims 1-20 are found to be directed towards patent eligible subject matter under step 2A prong 1.




Claim Rejections - 35 USC § 102

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-4, 8-9, and 14-16 are rejected under 35 U.S.C. 102(a)(1)/(a)(2) as being anticipated by Chen, et al. (U.S. PG Publication:  2024/0153239 A1).
With respect to Claim 1, Chen discloses:
A computer-implemented method comprising: 
training a multilingual large language model to embed text into an embedding space of a vision language model, the vision language model comprising a text encoder for a first language and a vision encoder (training of a "transformer model" having an encoder that associates natural language processing text in a first language (e.g., English) with image objects in a computer vision application wherein text embeddings (225) are placed into a shared embedding space (232) with an vision/object embedding (215) wherein the input to such a model space include a text encoder and a vision/object encoder, Paragraphs 0024-0027, 0033 and 0052-0053; Fig. 2A; note that the "transformer model" that handles natural language inputs in a deep learning context constitutes a large language model), by: 
determining pairings between images and text corresponding to the images, the text being in languages other than the first language (determining an image and "associated text" in a pairing, Paragraphs 0023 and 0031; see "Assoc." in Fig. 2A; the text may be in a language other than a first language such as English, Paragraph 0025); 
generating, utilizing the vision encoder, image embeddings for the images (generating "object embeddings 215" that are "derived from...the image," Paragraph 0024; Fig. 2A., Element 210); 
generating, utilizing the multilingual large language model, text embeddings for the text (multilingual transformer generates "text embeddings" for multilingual text inputs, Paragraph 0025; Fig. 2, Element 220);
determining similarity metrics between the image embeddings for the images and the text embeddings for the text (after the embeddings (i.e., object and text) are projected into the shared space, a "similarity measure" between such embeddings are determined, Paragraphs 0026 and 0033); and 
adjusting parameters of the multilingual large language model to reduce an output of a contrastive loss function based on the similarity metrics without adjusting parameters of the vision encoder adjusting parameters of the multilingual large language model to reduce an output of a contrastive loss function based on the similarity metrics without adjusting parameters of the vision encoder (weight parameters for the transformer models are trained based upon a loss function so that text embeddings are closer to "positive object embeddings than to negative object embeddings" wherein the use of loss in relation to positive and negative examples constitutes a contrastive loss function, Paragraphs  0017, 0023, 0026, 0033, and 0060; specifically note the discussion at paragraph 0033 where it is described that the training pertains to the text transformer encoder and that the text embeddings are guided in the shared space to be closer to the appropriate object embeddings- accordingly, the object detector/encoder weights are not adjusted in the similarity-based training operation).
With respect to Claim 2, Chen further discloses:
The computer-implemented method of claim 1, further comprising combining the multilingual large language model with the vision encoder of the vision language model to create a multilingual vision language model for predicting text-image pairs (the NLP transformer model is combined with the object encoder and trained vector space to predict a most likely text image pair having a highest similarity score, Paragraph 0019-0020, 0023, and 0025-0026; Fig. 2A, Elements 210, 215, 220, and 225).
With respect to Claim 3, Chen further discloses:
The computer-implemented method of claim 2, further comprising processing a query text (Paragraph 0019- user text query returning database results), in a language other than the first language, through the multilingual vision language model to determine one or more digital images corresponding to the query text (handling second language (e.g., non-English) texts using the transformer/LLM to determine at least one image object of interest with a highest similarity, Paragraphs 0025-0026; Fig. 2A, Element 240).
With respect to Claim 4, Chen discloses:
determining the pairings between the images and the text comprises: 
determining a first pairing between a first image and a first text caption (a plurality of "text embeddings" of text captions associated with images are projected into a shared space with a plurality of "image embeddings" wherein a "similarity measure" between a text embedding and its positive/first object is determined, Paragraphs 0026 and 0033); and 
determining a second pairing between the first image and a second text caption (similarity measure between another text embedding and the first object as a negative objection is determined, Paragraphs 0026 and 0033); 
determining the similarity metrics between the image embeddings and the text embeddings comprises:  determining a first similarity metric for the first pairing (see above determination of a "similarity measure" or "similarity score" and the first pairing described above, Paragraphs 0026 and 0033); and 
determining a second similarity metric for the second pairing (see above determination of a "similarity measure" or "similarity score" and the second pairing described above, Paragraphs 0026 and 0033); and 
adjusting the parameters of the multilingual large language model comprises:  adjusting the parameters of the multilingual large language model to increase the first similarity metric and to reduce the second similarity metric (see that the transformer model weight parameters are adjusted to move the positive pairing closer together/increase similarity and push apart negative examples/decrease similarity, Paragraphs 0026 and 0033).
Claim 8 is a system embodiment comprising one or more memory devices and one or more processors for carrying out the method of claim 1, and thus, is rejected under similar rationale.  Chen also discloses method implementation using the transformer model as program instructions stored on in a memory along with a processor (Paragraphs 0062-0063).  Note that claim 8 features an additional step/functional to the claim 1 method that is also addressed via the teachings of Chen:
process a query text in a language other than the first language through a combined model comprising the multilingual large language model and the vision encoder of the vision language model to determine one or more digital images corresponding to the query text (handling second language (e.g., non-English) text queries using the transformer/LLM to determine at least one image object of interest with a highest similarity, Paragraphs 0019, 0025-0026; Fig. 2A, Element 240).
Claim 9 contains subject matter similar to Claim 4 wherein the additional text caption/similarity metric of claim 9 corresponds to the second text embedding similarity with the first image embedding as a negative example, and thus, is rejected under similar rationale.
Claim 14 is directed towards an embodiment of the invention in the form of a non-transitory computer-readable medium stored processor-executable instructions for carrying out the method of claim 1, and thus, is rejected under similar rationale.  Furthermore, Chen teaches method implementation as a computer program stored on a non-transitory computer-readable medium (Paragraphs 0070 and 0079).
Claim 15 contains subject matter similar to claim 3 (incorporating the subject matter of claim 2 by virtue of its dependency), and thus, is rejected under similar rationale.
Claim 16 contains subject matter similar to Claim 4, and thus, is rejected under similar rationale.  Moreover, the added limitations of "for a subsequent training iteration" is not an action positively recited as being performed when the program is executed by the processing device, and thus, are not patentably limiting though it should be noted that Chen describes a "repeated" training process at paragraph 0033.

Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5, 10, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Chen, et al. in view of Oktay, et al. (PG Publication:  2025/0173613 A1).
With respect to Claim 5, Chen teaches the training method for the vision language model that generates image predictions from an input text as applied to Claim 1.  Chen, however, does not explicitly teach the fine-tuning process set forth in claim 5.  Oktay, however discloses:
generating translated text in a second language from supplemental text in the first language (augmenting document supplemental text via translation into a second natural language, Paragraphs 0010 and 0073); 
determining finetuning pairings between the translated text in the second language and finetuning images corresponding to the supplemental text in the first language (performing additional training of fine-tuning weights using between image embeddings and the second text passage embedding pairs, Paragraphs 0071, 0073, and 0077); 
determining finetuning similarity metrics between finetuning image embeddings for the finetuning images and finetuning text embeddings for the translated text (distance/similarity metric determined for weight fine-tuning, Paragraphs 0071, 0073, 0075, and 0077); and 
adjusting the parameters of the multilingual large language model to reduce the output of the contrastive loss function based on the finetuning similarity metrics without adjusting the parameters of the vision encoder (weight fine-tuning to minimize contrastive loss wherein only the text LLM is trained independently, Paragraphs 0063, 0071, 0073, 0077, and 0150 (mentioning a BERT, which is a specific type of LLM)).
	Chen and Carlsson are analogous art because they are from a similar field of endeavor in multi-lingual image prediction using transformer models.  Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date to utilize the iterative training taught by Oktay in the associative text and image training taught by Chen to provide a predictable result of gradually reducing prediction errors over iterations thus improving model effectiveness (Oktay, Paragraph 0075).
Claim 10 contains subject matter similar to Claim 5, and thus, is rejected under similar rationale.  Moreover, note that the text in Oktay is a description or a caption of the image (Paragraph 0051).
Claim 17 contains subject matter similar to Claim 5, and thus, is rejected under similar rationale.

Claims 6, 11, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Chen, et al. in view of Carlsson, et al. ("Cross-lingual and Multilingual CLIP," 2022).
With respect to Claim 6, Chen teaches the training method for the vision language model that generates image predictions from an input text as applied to Claim 1.  Chen, however, does not teach the knowledge distillation training methodology that reduces an output of a mean-squared-error (MSE) loss function based on the text embeddings for the text and parallel text encodings of parallel text.  Carlsson, however, discloses:
adjusting, utilizing knowledge distillation from the text encoder of the vision language model, the parameters of the multilingual large language model to reduce an output of a mean-squared-error loss function based on the text embeddings for the text and parallel text encodings of parallel text generated by the text encoder of the vision language model (creation of "language parallel data" via machine translation of a source text to generate matching embeddings, Section 3, Page 6849; then see knowledge distillation learning via teacher-student model learning that operates by "minimizing MSE" using parallel text encodings of "original texts" and "translated texts," Section 3.2, Page 6850; equation 1).
Chen and Carlsson are analogous art because they are from a similar field of endeavor in multi-lingual image prediction using transformer models.  Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date to utilize the teacher-student knowledge distillation learning taught by Carlsson in the associative text and image training taught by Chen to provide a predictable result of achieving a smaller and more efficient student model specialized for a second language (Carlsson, Section 2.3, Page 6849).
Claim 11 contains subject matter similar to Claim 6, and thus, is rejected under similar rationale.  Moreover, note that the text in Carlsson is a "caption" of the image (Section 3.2, Page 6850).
Claim 18 contains subject matter similar to Claim 6, and thus, is rejected under similar rationale.

Claims 7, 12, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Chen, et al. in view of Chen, et al. ("AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities," 2023; hereinafter Chen2).
With respect to Claim 7, Chen teaches the training method for the vision language model that generates image predictions from an input text as applied to Claim 1.  Chen, however, does not teach the combination data augmentation and reduction of training datasets as set forth in claim 7.  Chen2, however, discloses:
augmenting a second-language batch of text in a second language by translating a set of text in the first language to generate translated text in the second language (see parallel sentences with a first source language (English) including a second target language (Chinese) and a third target language (Italian) in Fig. 1, wherein the parallel text is generated by augmenting a source language via machine translation, Sections 4.1 and 5.1, Pages 8668-8669); and 
reducing a third-language batch of text in a third language by omitting a subset of text of the third-language batch of text (note that the selection/collection data "for each language" uses "the same amount of data" implying that the training data for a language such as Italian or another language would have a reduced data selection to account for using "the same amount of data for each language, Section 4.1, Pages 8668-8669).
Chen and Chen2 are analogous art because they are from a similar field of endeavor in multi-lingual image prediction using transformer models.  Thus, it would have been obvious to one of ordinary skill in the art to utilize Chen2’s data pre-processing of parallel text in the associative text and image training taught by Chen to provide a predictable result of adding additional training data for underrepresented language and ensuring that certain languages are not overtrained/training is balanced.
Claim 12 contains subject matter similar to Claim 7, and thus, is rejected under similar rationale.
Claim 19 contains subject matter similar to Claim 7, and thus, is rejected under similar rationale.  Furthermore, note that the augmented text and reduced text is for text embeddings that are paired with images in Chen2 (Section 3.2, Page 8668).
With respect to Claim 20, Chen2 further discloses:
The non-transitory computer-readable medium of claim 19, wherein the operations further comprise: determining an augmentation metric for the second language and a reduction metric for the third language; augmenting the second-language batch of text based on the augmentation metric; and reducing the third-language batch of text based on the reduction metric ("the same amount of data" for each language is used wherein a language requiring machine translation augmentation would have a number of samples to increase whereas a language where only a subset is select would have a number of training samples to decrease/subset to not select wherein the operations of augmentation and subset selection leading to the same amount of data constitute the augmentation and reduction metrics, Section 4.1 and 5.1, Pages 8668-8669; three languages shown in Fig. 1).

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Chen, et al. in view of Chen2 and further in view of Su (U.S. PG Publication:  2020/0320289 A1).
With respect to Claim 13, Chen in view of Chen2 teaches the multi-lingual transformer model training for image prediction based upon text as applied to Claim 12.  While Chen2’s disclosure of keeping “the same amount of data” seems to imply that machine translated data can increase one language data pool by a certain percentage (e.g., 10 to 11 samples via an additional machine translation represents a 10% augmentation ratio) or an elimination of 9 samples out of 20 for the same 11 samples represents a 45% reduction proportion for another language, Chen in view of Chen 2 does not specifically mention the use of such resampling and reduction metrics based on the resampling ratio as set forth in claim 13.  Su, however, discloses a user-based percentage input that determines whether to augment the data a certain percentage/ratio or only to select a certain percentage of text files (Paragraphs 0075-0078).  Combined with the teachings of Chen2 where a "same amount" is reached for each language leads to a reduction metric directly related to the augmentation ratio as both amounts are attempting to reach the "same amount" for each language.
Chen, Chen2, and Su are analogous art because they are from a similar field of endeavor in data augmentation for machine learning model training.  Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date to apply the percentage metric to achieve “the same amount of data for each language” for Chen in view of Chen2 to provide the system with a specific input to reach the desired amount of training data that adds additional training data for underrepresented language and ensures that certain languages are not overtrained/training is balanced.

Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Li, et al. ("Translation-Enhanced Multilingual Text-to-Image Generation,"2003)- teaches a process for multilingual text-to-image generation using machine translation for generating target language text (Section 3.1, Page 9176) and contrastive adversarial training (Section 3.3, Pages 9177-9178).
Lev-Tov, et al. (U.S. Patent:  10,445,431)- teaches training machine learning language models to minimize a distance between a language vector in a given language generated by a language model and an image vector generated by a vision model (Col. 4, Lines 37-49; Col. 6, Lines 22-47; and Col. 7, Lines 38-63).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAMES S WOZNIAK whose telephone number is (571)272-7632. The examiner can normally be reached 7-3, off alternate Fridays.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant may use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

JAMES S. WOZNIAK
Primary Examiner
Art Unit 2655



/JAMES S WOZNIAK/               Primary Examiner, Art Unit 2655

Read full office action

Prosecution Timeline

Jul 12, 2024

Application Filed

Mar 09, 2026

Non-Final Rejection — §102, §103

May 01, 2026

Interview Requested

Precedent Cases

Applications granted by this same examiner with similar technology

18/535,521

Patent 12609113

NATURAL LANGUAGE PROCESSING SYSTEMS AND METHODS FOR INTENT CLASSIFICATION OF SPEECH TRANSCRIPTION

2y 4m to grant Granted Apr 21, 2026

18/544,354

Patent 12609106

EMOTIVE TEXT-TO-SPEECH WITH AUTO DETECTION OF EMOTIONS

2y 4m to grant Granted Apr 21, 2026

18/399,876

Patent 12597422

SPEAKING PRACTICE SYSTEM WITH RELIABLE PRONUNCIATION EVALUATION

2y 3m to grant Granted Apr 07, 2026

18/488,578

Patent 12586569

Knowledge Distillation with Domain Mismatch For Speech Recognition

2y 5m to grant Granted Mar 24, 2026

18/359,113

Patent 12511476

CONCEPT-CONDITIONED AND PRETRAINED LANGUAGE MODELS BASED ON TIME SERIES TO FREE-FORM TEXT DESCRIPTION GENERATION

2y 5m to grant Granted Dec 30, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

59%

Grant Probability

99%

With Interview (+39.9%)

3y 8m (~1y 10m remaining)

Median Time to Grant

Low

PTA Risk

Based on 386 resolved cases by this examiner. Grant probability derived from career allowance rate.