Last updated: May 04, 2026

Application No. 18/342,954

GENERATING TEXT PROMPTS FOR DIGITAL IMAGES UTILIZING VISION-LANGUAGE MODELS AND CONTEXTUAL PROMPT LEARNING

Non-Final OA §103

Filed

Jun 28, 2023

Examiner

OPSASNICK, MICHAEL N

Art Unit

2658

Tech Center

2600 — Communications

Assignee

Adobe Inc.

OA Round

3 (Non-Final)

Interview Optional

— +10.6% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 82% grant rate with +10.6% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 901 resolved cases, 2023–2026

Examiner Intelligence

OPSASNICK, MICHAEL N View full profile →

Grants 82% — above average

Career Allowance Rate

738 granted / 901 resolved

+19.9% vs TC avg

Moderate +11% lift

Without

With

+10.6%

Interview Lift

resolved cases with interview

Typical timeline

3y 2m

Avg Prosecution

48 currently pending

Career history

949

Total Applications

across all art units

Statute-Specific Performance

§101

17.6%

-22.4% vs TC avg

§103

33.0%

-7.0% vs TC avg

§102

29.9%

-10.1% vs TC avg

§112

6.3%

-33.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 901 resolved cases

Office Action

§103

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114

A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 02/13/2026 has been entered.

 Allowable Subject Matter

Claim 3 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

The following is a statement of reasons for the indication of allowable subject matter:  As per dependent claim 3, the claim limitations toward selecting prompt context tokens from the distribution and utilizing the prompt context tokens as learnable parameters for the vision language model, in conjunction with the claim limitations in the independent claim, is not explicitly taught by the prior art of record.  Regarding the prior art of record, Khattak et al (20240220722) teaches cross learning between the image representation vectors and the text representation vectors – para 0025; see para 0035, wherein the image prompt tokens are introduced to the language branch prompt learning framework; this process forces an alignment between the image tokens and text tokens – para 0036; see also the “coupling functions” as the alignment process – para 0029)).  Li et al (20220391755) teaches a multimodal encoder taking images, generating encoded image parameters, along with a separate text encoder generating text encoding embedding, of which, both types are inserted into self attention and cross attention layers (see para 0073), using pretrained weights, generating token values representing the relationship between the image encoded information and the text encoding information, used in a MLP classifier (para 0074), as well as para 0032 detailing the joint representation of the image/text pair.  However, neither reference explicitly teaches the claim limitations of dependent claim 3 (with independent claim 1), as noted above.    

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:

A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1,2,4-20 are rejected under 35 U.S.C. 103 as being unpatentable over Khattak et al (20240220722) in view of Li et al (20220391755).

As per claim 1, Khattak et al (20240220722) teaches a computer-implemented method comprising: 
generating, utilizing an image encoder of a vision language machine learning model (as using vision language models (para 0025), image patch feature representations that represent patches from an input image (para 0025 – the vision language models generate image-text pairs; the text portion is the patch feature representation; see further, in para 0034, wherein the image is split into “M” patch embeddings);
initializing prompt context tokens from a distribution; utilizing context tokens according to the image patch feature representations (as weighting based on a loss function representing a differential between the prediction and ground-truth value, affecting the score – see para 0043 reflecting back on para 0042);

generating, utilizing an attention layer of the vision language machine learning model, the localized context tokens based on the 
and generating, utilizing a text encoder of the vision language machine learning model, a text caption describing the input image from the localized context tokens (as, generating text from the image using text prompt tokens – para 0031) generated based 
	As per claim 1, Khattak et al (20240220722) teaches an interaction between the image prompt tokens into the language prompt learning framework -- see para 0035; the “coupling function in para 0029, forces ‘an alignment’ between the image tokens and text tokens – para 0036; however, Khattak et al (20240220722) does not explicitly state the use of alignment vectors as a go-between among the image tokens and the text prompt tokens; Li et al (20220391755) teaches a multimodal encoder taking images, generating encoded image parameters, along with a separate text encoder generating text encoding embedding, of which, both types are inserted into self attention and cross attention layers (see para 0073), using pretrained weights, generating token values representing the relationship between the image encoded information and the text encoding information, used in a MLP classifier (para 0074), as well as para 0032 detailing the joint representation of the image/text pair.  Therefore, it would have been obvious to one of ordinary skill in the art of image/text encoding/pairing to modify the relationship calculations between text/image as taught in Khattak et al (20240220722) with attention layer type calculations, as taught by Li et al (20220391755), because it would advantageously add to the image-text pair information semantically similar image/texts (see Li et al (20220391755), para 0061).      

As per claim 2, Khattak et al (20240220722) teaches the computer-implemented method of claim 1, wherein generating the image patch feature representations comprises:
extracting the patches from the input image; generating, utilizing the image encoder, image patch feature vectors from the patches (as operating on patches from the images – para 0034, and generating patch feature vectors – see para 0037, patch embeddings and appended learnable class tokens); and generating, utilizing a neural network, conditional image patch tokens from the image patch feature vectors (generating image prompt tokens – para 0030).

As per claim 4, Khattak et al (20240220722) teaches the computer-implemented method of claim 3, wherein generating, utilizing the attention layer of the vision language machine learning model (see neural network layers, and explanation of attention layer, in the mapping of claim 1), by generating context vectors for the patches from the input image by combining the alignment vectors and the prompt context tokens (as, Khattak et al (20240220722) , cross-learning the transformed blocks across the vision/image and language/text blocks, to align the image/vision – text/language representations – para 0025; and combining the word embeddings with the learnable text prompt tokens – para 0031; Li et al (20220391755) teaches a multimodal encoder taking images, generating encoded image parameters, along with a separate text encoder generating text encoding embedding, of which, both types are inserted into self attention and cross attention layers (see para 0073), using pretrained weights, generating token values representing the relationship between the image encoded information and the text encoding information, used in a MLP classifier (para 0074), as well as para 0032 detailing the joint representation of the image/text pair).

As per claim 5, Khattak et al (20240220722) teaches the computer-implemented method of claim 4, wherein generating, utilizing the attention layer of the vision language machine learning model (see neural network layers, and explanation of attention layer, in the mapping of claim 1), the localized context tokens further comprises:
combining the context vectors for the patches from the input image with the prompt context tokens to generate the localized context tokens (as, combining the modified patch embedding with the text prompt tokens – para 0031).

As per claim 6, Khattak et al (20240220722) teaches the computer-implemented method of claim 1, further comprising training the vision language machine learning model by:
generating, utilizing the image encoder, an image feature vector of the input image (as generating image tokens from the patches – para 0029, see the ‘vision branch’, with the image transformer layers; the image transformer layers generate tokens – para 0035 – para 0037); determining a measure of loss by comparing the text representation with the image feature vector (as generating loss functions, including cross entropy between the text prompt tokens and the image prompt tokens – para 0043); 
and modifying the prompt context tokens and weights of the attention layer of the vision language machine learning model based on the determined measure of loss (as, backpropagating the learnable text/image tokens – para 0043).

As per claim 7, Khattak et al (20240220722) teaches the computer-implemented method of claim 6, further comprising training the vision language machine learning model by generating the text representation, utilizing the text encoder of the vision language machine learning model, from the localized context tokens and a ground truth class corresponding to the input image (as comparing the image-text pairs to known ground-truth values – para 0043).

Claims 8-14 are system claims whose steps are performed by the method steps 1-7 above and as such, claims 8-14 are similar in scope and content to claims 1-7; therefore, claims 8-14 are rejected under similar rationale as presented against claims 1-7 above.  Furthermore, Khattak et al (20240220722) teaches a processor executing instructions performing the disclosed steps (see para 0074 – 0079).  Further, to claim 8, regarding the alignment vectors by weighing the prompt context tokens according to the image patch feature representations --  see Khattak et al (20240220722) – as weighting based on a loss function representing a differential between the prediction and ground-truth value, affecting the score – see para 0043 reflecting back on para 0042).  

Claims 15-20 are non-transitory computer readable claims, whose steps are performed by the method steps 1-7 above and as such, claims 15-20 are similar in scope and content to claims 1-7; therefore, claims 15-20 are rejected under similar rationale as presented against claims 1-7 above.  Furthermore, Khattak et al (20240220722) teaches a processor executing instructions performing the disclosed steps (see para 0007).  Likewise to claim 8 above, further to claim 15, regarding the alignment vectors by weighing the prompt context tokens according to the image patch feature representations --  see Khattak et al (20240220722) – as weighting based on a loss function representing a differential between the prediction and ground-truth value, affecting the score – see para 0043 reflecting back on para 0042.  

Response to Arguments

Applicants amendment to the abstract, has been reviewed, accepted; and the objection withdrawn.  Applicant’s arguments with respect to the claim(s) have been considered but are moot in view of the new grounds of rejection.  Examiner notes the use of the Li et al (20220391755) reference to address the further claim limitations toward attention layers used for alignment purposes between the context vectors and the image vectors.  The majority of applicants arguments are toward these features, and the Khattak et al (20240220722) reference.  Examiner notes the indication of allowable subject matter, of the claim limitations, found in dependent claim 3.  
Furthermore, Fei et al (20220383048) teaches self attention layered networks as well (para 0042, 0054).  Gopalkrishna et al (20230281963) teaches injection of an inferred text, as a caption, for the related image – see para 0034.  
Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.   Please see related art listed on the PTO-892 form.
Mangla et al (20240153258) teaches a multimodal network mapping image information to text/language – see fig 3
Guo et al (20240119257) teaches multimodal tying to image with text prompts – Fig. 3
Gopalkrishna et al (20230281963) teaches multilevel alignment for vision-language pretraining – abstract.
Li et al (20220391755) teaches vision-language representation learning – abstract, figure 2.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Opsasnick, telephone number (571)272-7623, who is available Monday-Friday, 9am-5pm. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Mr. Richemond Dorvil, can be reached at (571)272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/Michael N Opsasnick/Primary Examiner, Art Unit 2658                                                                                                                                                                                                        02/28/2026

Read full office action

Prosecution Timeline

Show 8 earlier events

Jan 26, 2026

Applicant Interview (Telephonic)

Jan 29, 2026

Examiner Interview Summary

Feb 13, 2026

Request for Continued Examination

Feb 20, 2026

Response after Non-Final Action

Feb 28, 2026

Non-Final Rejection — §103

Mar 24, 2026

Interview Requested

Apr 01, 2026

Applicant Interview (Telephonic)

Apr 01, 2026

Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

18/392,813

Patent 12609117

APPARATUS AND METHOD FOR SPEECH RECOGNITION

2y 4m to grant Granted Apr 21, 2026

18/471,635

Patent 12609101

INTELLIGENT SYSTEM AND METHOD OF PROVIDING SPEECH ASSISTANCE DURING A COMMUNICATION SESSION

2y 7m to grant Granted Apr 21, 2026

18/492,377

Patent 12609129

Audio Signal Enhancement with Recursive Restoration Employing Deterministic Degradation

2y 6m to grant Granted Apr 21, 2026

18/512,723

Patent 12602554

SYSTEMS AND METHODS FOR PRODUCING RELIABLE TRANSLATION IN NEAR REAL-TIME

2y 4m to grant Granted Apr 14, 2026

17/698,029

Patent 12592246

SYSTEM AND METHOD FOR EXTRACTING HIDDEN CUES IN INTERACTIVE COMMUNICATIONS

4y 0m to grant Granted Mar 31, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

82%

Grant Probability

92%

With Interview (+10.6%)

3y 2m (~4m remaining)

Median Time to Grant

High

PTA Risk

Based on 901 resolved cases by this examiner. Grant probability derived from career allowance rate.