Prosecution Insights
Last updated: April 19, 2026
Application No. 18/342,954

GENERATING TEXT PROMPTS FOR DIGITAL IMAGES UTILIZING VISION-LANGUAGE MODELS AND CONTEXTUAL PROMPT LEARNING

Non-Final OA §103
Filed
Jun 28, 2023
Examiner
OPSASNICK, MICHAEL N
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
3 (Non-Final)
82%
Grant Probability
Favorable
3-4
OA Rounds
3y 3m
To Grant
92%
With Interview

Examiner Intelligence

Grants 82% — above average
82%
Career Allow Rate
737 granted / 900 resolved
+19.9% vs TC avg
Moderate +10% lift
Without
With
+10.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
46 currently pending
Career history
946
Total Applications
across all art units

Statute-Specific Performance

§101
17.7%
-22.3% vs TC avg
§103
33.0%
-7.0% vs TC avg
§102
29.9%
-10.1% vs TC avg
§112
6.3%
-33.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 900 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Continued Examination Under 37 CFR 1.114 A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 02/13/2026 has been entered. Allowable Subject Matter Claim 3 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The following is a statement of reasons for the indication of allowable subject matter: As per dependent claim 3, the claim limitations toward selecting prompt context tokens from the distribution and utilizing the prompt context tokens as learnable parameters for the vision language model, in conjunction with the claim limitations in the independent claim, is not explicitly taught by the prior art of record. Regarding the prior art of record, Khattak et al (20240220722) teaches cross learning between the image representation vectors and the text representation vectors – para 0025; see para 0035, wherein the image prompt tokens are introduced to the language branch prompt learning framework; this process forces an alignment between the image tokens and text tokens – para 0036; see also the “coupling functions” as the alignment process – para 0029)). Li et al (20220391755) teaches a multimodal encoder taking images, generating encoded image parameters, along with a separate text encoder generating text encoding embedding, of which, both types are inserted into self attention and cross attention layers (see para 0073), using pretrained weights, generating token values representing the relationship between the image encoded information and the text encoding information, used in a MLP classifier (para 0074), as well as para 0032 detailing the joint representation of the image/text pair. However, neither reference explicitly teaches the claim limitations of dependent claim 3 (with independent claim 1), as noted above. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1,2,4-20 are rejected under 35 U.S.C. 103 as being unpatentable over Khattak et al (20240220722) in view of Li et al (20220391755). As per claim 1, Khattak et al (20240220722) teaches a computer-implemented method comprising: generating, utilizing an image encoder of a vision language machine learning model (as using vision language models (para 0025), image patch feature representations that represent patches from an input image (para 0025 – the vision language models generate image-text pairs; the text portion is the patch feature representation; see further, in para 0034, wherein the image is split into “M” patch embeddings); initializing prompt context tokens from a distribution; utilizing context tokens according to the image patch feature representations (as weighting based on a loss function representing a differential between the prediction and ground-truth value, affecting the score – see para 0043 reflecting back on para 0042); generating, utilizing an attention layer of the vision language machine learning model, the localized context tokens based on the and generating, utilizing a text encoder of the vision language machine learning model, a text caption describing the input image from the localized context tokens (as, generating text from the image using text prompt tokens – para 0031) generated based As per claim 1, Khattak et al (20240220722) teaches an interaction between the image prompt tokens into the language prompt learning framework -- see para 0035; the “coupling function in para 0029, forces ‘an alignment’ between the image tokens and text tokens – para 0036; however, Khattak et al (20240220722) does not explicitly state the use of alignment vectors as a go-between among the image tokens and the text prompt tokens; Li et al (20220391755) teaches a multimodal encoder taking images, generating encoded image parameters, along with a separate text encoder generating text encoding embedding, of which, both types are inserted into self attention and cross attention layers (see para 0073), using pretrained weights, generating token values representing the relationship between the image encoded information and the text encoding information, used in a MLP classifier (para 0074), as well as para 0032 detailing the joint representation of the image/text pair. Therefore, it would have been obvious to one of ordinary skill in the art of image/text encoding/pairing to modify the relationship calculations between text/image as taught in Khattak et al (20240220722) with attention layer type calculations, as taught by Li et al (20220391755), because it would advantageously add to the image-text pair information semantically similar image/texts (see Li et al (20220391755), para 0061). As per claim 2, Khattak et al (20240220722) teaches the computer-implemented method of claim 1, wherein generating the image patch feature representations comprises: extracting the patches from the input image; generating, utilizing the image encoder, image patch feature vectors from the patches (as operating on patches from the images – para 0034, and generating patch feature vectors – see para 0037, patch embeddings and appended learnable class tokens); and generating, utilizing a neural network, conditional image patch tokens from the image patch feature vectors (generating image prompt tokens – para 0030). As per claim 4, Khattak et al (20240220722) teaches the computer-implemented method of claim 3, wherein generating, utilizing the attention layer of the vision language machine learning model (see neural network layers, and explanation of attention layer, in the mapping of claim 1), by generating context vectors for the patches from the input image by combining the alignment vectors and the prompt context tokens (as, Khattak et al (20240220722) , cross-learning the transformed blocks across the vision/image and language/text blocks, to align the image/vision – text/language representations – para 0025; and combining the word embeddings with the learnable text prompt tokens – para 0031; Li et al (20220391755) teaches a multimodal encoder taking images, generating encoded image parameters, along with a separate text encoder generating text encoding embedding, of which, both types are inserted into self attention and cross attention layers (see para 0073), using pretrained weights, generating token values representing the relationship between the image encoded information and the text encoding information, used in a MLP classifier (para 0074), as well as para 0032 detailing the joint representation of the image/text pair). As per claim 5, Khattak et al (20240220722) teaches the computer-implemented method of claim 4, wherein generating, utilizing the attention layer of the vision language machine learning model (see neural network layers, and explanation of attention layer, in the mapping of claim 1), the localized context tokens further comprises: combining the context vectors for the patches from the input image with the prompt context tokens to generate the localized context tokens (as, combining the modified patch embedding with the text prompt tokens – para 0031). As per claim 6, Khattak et al (20240220722) teaches the computer-implemented method of claim 1, further comprising training the vision language machine learning model by: generating, utilizing the image encoder, an image feature vector of the input image (as generating image tokens from the patches – para 0029, see the ‘vision branch’, with the image transformer layers; the image transformer layers generate tokens – para 0035 – para 0037); determining a measure of loss by comparing the text representation with the image feature vector (as generating loss functions, including cross entropy between the text prompt tokens and the image prompt tokens – para 0043); and modifying the prompt context tokens and weights of the attention layer of the vision language machine learning model based on the determined measure of loss (as, backpropagating the learnable text/image tokens – para 0043). As per claim 7, Khattak et al (20240220722) teaches the computer-implemented method of claim 6, further comprising training the vision language machine learning model by generating the text representation, utilizing the text encoder of the vision language machine learning model, from the localized context tokens and a ground truth class corresponding to the input image (as comparing the image-text pairs to known ground-truth values – para 0043). Claims 8-14 are system claims whose steps are performed by the method steps 1-7 above and as such, claims 8-14 are similar in scope and content to claims 1-7; therefore, claims 8-14 are rejected under similar rationale as presented against claims 1-7 above. Furthermore, Khattak et al (20240220722) teaches a processor executing instructions performing the disclosed steps (see para 0074 – 0079). Further, to claim 8, regarding the alignment vectors by weighing the prompt context tokens according to the image patch feature representations -- see Khattak et al (20240220722) – as weighting based on a loss function representing a differential between the prediction and ground-truth value, affecting the score – see para 0043 reflecting back on para 0042). Claims 15-20 are non-transitory computer readable claims, whose steps are performed by the method steps 1-7 above and as such, claims 15-20 are similar in scope and content to claims 1-7; therefore, claims 15-20 are rejected under similar rationale as presented against claims 1-7 above. Furthermore, Khattak et al (20240220722) teaches a processor executing instructions performing the disclosed steps (see para 0007). Likewise to claim 8 above, further to claim 15, regarding the alignment vectors by weighing the prompt context tokens according to the image patch feature representations -- see Khattak et al (20240220722) – as weighting based on a loss function representing a differential between the prediction and ground-truth value, affecting the score – see para 0043 reflecting back on para 0042. Response to Arguments Applicants amendment to the abstract, has been reviewed, accepted; and the objection withdrawn. Applicant’s arguments with respect to the claim(s) have been considered but are moot in view of the new grounds of rejection. Examiner notes the use of the Li et al (20220391755) reference to address the further claim limitations toward attention layers used for alignment purposes between the context vectors and the image vectors. The majority of applicants arguments are toward these features, and the Khattak et al (20240220722) reference. Examiner notes the indication of allowable subject matter, of the claim limitations, found in dependent claim 3. Furthermore, Fei et al (20220383048) teaches self attention layered networks as well (para 0042, 0054). Gopalkrishna et al (20230281963) teaches injection of an inferred text, as a caption, for the related image – see para 0034. Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Please see related art listed on the PTO-892 form. Mangla et al (20240153258) teaches a multimodal network mapping image information to text/language – see fig 3 Guo et al (20240119257) teaches multimodal tying to image with text prompts – Fig. 3 Gopalkrishna et al (20230281963) teaches multilevel alignment for vision-language pretraining – abstract. Li et al (20220391755) teaches vision-language representation learning – abstract, figure 2. Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Opsasnick, telephone number (571)272-7623, who is available Monday-Friday, 9am-5pm. If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Mr. Richemond Dorvil, can be reached at (571)272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). /Michael N Opsasnick/Primary Examiner, Art Unit 2658 02/28/2026
Read full office action

Prosecution Timeline

Jun 28, 2023
Application Filed
Jun 05, 2025
Non-Final Rejection — §103
Jul 02, 2025
Interview Requested
Aug 06, 2025
Examiner Interview Summary
Aug 06, 2025
Applicant Interview (Telephonic)
Aug 15, 2025
Response Filed
Nov 13, 2025
Final Rejection — §103
Jan 09, 2026
Interview Requested
Jan 26, 2026
Applicant Interview (Telephonic)
Jan 29, 2026
Examiner Interview Summary
Feb 13, 2026
Request for Continued Examination
Feb 20, 2026
Response after Non-Final Action
Feb 28, 2026
Non-Final Rejection — §103
Mar 24, 2026
Interview Requested
Apr 01, 2026
Applicant Interview (Telephonic)
Apr 01, 2026
Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602554
SYSTEMS AND METHODS FOR PRODUCING RELIABLE TRANSLATION IN NEAR REAL-TIME
2y 5m to grant Granted Apr 14, 2026
Patent 12592246
SYSTEM AND METHOD FOR EXTRACTING HIDDEN CUES IN INTERACTIVE COMMUNICATIONS
2y 5m to grant Granted Mar 31, 2026
Patent 12586580
System For Recognizing and Responding to Environmental Noises
2y 5m to grant Granted Mar 24, 2026
Patent 12579995
Automatic Speech Recognition Accuracy With Multimodal Embeddings Search
2y 5m to grant Granted Mar 17, 2026
Patent 12567432
VOICE SIGNAL ESTIMATION METHOD AND APPARATUS USING ATTENTION MECHANISM
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
82%
Grant Probability
92%
With Interview (+10.5%)
3y 3m
Median Time to Grant
High
PTA Risk
Based on 900 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month