Last updated: May 29, 2026

Application No. 18/397,688

METHOD AND APPARATUS FOR VISION-LANGUAGE UNDERSTANDING

Non-Final OA §102

Filed

Dec 27, 2023

Priority

Sep 22, 2022 — DE 20220100775 +2 more

Examiner

CRUZ, IRIANA

Art Unit

2681

Tech Center

2600 — Communications

Assignee

Samsung Electronics Co., Ltd.

OA Round

1 (Non-Final)

Interview Optional

— +9.5% interview lift. Interview lift (+9.5%) is below the 15.0% threshold. A written response is recommended.

Based on 742 resolved cases, 2023–2026

Examiner Intelligence

CRUZ, IRIANA View full profile →

Grants 81% — above average

Career Allowance Rate

604 granted / 742 resolved

+19.4% vs TC avg

Moderate +10% lift

Without

With

+9.5%

Interview Lift

resolved cases with interview

Typical timeline

2y 9m

Avg Prosecution

18 currently pending

Career history

777

Total Applications

across all art units

Statute-Specific Performance

§101

1.7%

-38.3% vs TC avg

§103

79.7%

+39.7% vs TC avg

§102

15.0%

-25.0% vs TC avg

§112

1.9%

-38.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 742 resolved cases

Office Action

§102

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Election/Restrictions
Claims 9-15 are withdrawn from further consideration pursuant to 37 CFR 1.142(b), as being drawn to a nonelected species, there being no allowable generic or linking claim. Applicant timely traversed the restriction (election) requirement in the reply filed on 02/19/2026.
Significant search and consideration must already be performed for a single elected invention. The current application’s specification describes figure 3 as an example of how the ML model may be used/implemented (implying one of a plurality of instances). Creating, training (Figures 1-2), and implementing (figure 3) a machine learning model are each in their own right mutually exclusive. While they overlap in scope the search and consideration are different. Combining them would exponentially increases the required time and complexity hindering prosecution. The restriction is upheld. 
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-8 and 16-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Vu et al. (US 2024/0020546 A1).
With respect to Claim 1, Vu’546 shows a computer-implemented method for training a vision-language machine learning, ML, model (Figure 2, paragraphs [0092] and [0105] the model 230 can be an image processing and/or computer vision model) to classify images depicting novel or known classes (paragraph [0104] describes first training dataset 210 associated with a specific task such as a classification task for training a prompt), the method comprising: 
obtaining a first training dataset (figure 2, first training dataset 210) comprising a plurality of class names (paragraphs [0092] and [0104] classification of different object classes); and 
training the vision-language ML model by: 
generating, for each class name in the first training dataset (figure 2, first training dataset 210), at least one augmented textual prompt (paragraph [0103] prompt 202) to condition the vision-language ML model (figure 2 pre-trained machine learned model 230) to output a class name for an object detected in an image (paragraphs [0092] and [0196]); 
inputting the at least one augmented textual prompt (202) into a frozen text encoder (paragraph [0105]) of the vision-language ML model (Figure 2,  202 to 230); 
outputting, from the frozen text encoder, a first text embedding for each augmented textual prompt, the first text embedding representing the class name in the augmented textual prompt (Figure 3 and paragraphs [0111]-[0113] an arrangement in which embedding 308 is a generated target which is adjusted by a loss function 350 evaluation of the output 326 from the model 330); 
generating a plurality of first inputs by concatenating each learnable soft prompt from a plurality of learnable soft prompts, to each class name from the first training dataset (paragraph [0220] prompt tuning can be a more efficient and effective method for conditioning frozen models using tunable soft prompts. Similar to engineered text prompts, soft prompts can be concatenated to the input text); 
inputting the class names from the first training dataset (210) and the generated plurality of first inputs into the frozen text encoder of the vision-language ML model (230) (paragraph [0220] and figure 2); 
outputting (figure 2 first output 216), from the frozen text encoder of the vision-language ML model (230), a second text embedding (figure 2 second embedding 208) for each first input, the second text embedding representing the class name in each first input (paragraph [0107] the training loop can be repeated for a portion of the second training dataset 220 to generate the second embedding 208); and 
minimizing a cross-entropy text-to-text loss between the first text embeddings and the second text embeddings (Figure 3 loss function 350 backpropagation to target embedding (first 206 and second 208 embeddings) in paragraph [0078] to regard cross entropy loss).
With respect to Claim 2, Vu’546 shows the method as claimed in claim 1, wherein minimizing a cross-entropy text-to-text loss between the first text embeddings and the second text embeddings comprises adjusting the learnable soft prompts so that, for each class name, the second text embedding is similar to the first text embedding (paragraph [0108] first embedding 206 can be determined to be similar to the second embedding 208. The first prompt 202 can then be obtained from the prompt database 240 to initialize the training of the second prompt 204).
	With respect to Claim 3, Vu’546 shows the method as claimed in claim 2, wherein generating at least one augmented textual prompt comprises: selecting at least one manually-defined augmentation template (defined by the current application’s originally filed specification as a prompt in paragraph [0071] of published specification) from a plurality of augmentation templates, each augmentation template being a text phrase into which a class name is insertable (paragraph [0176] and [0219] text prompts may be manual); and inserting a class name from the first training dataset into the selected at least one augmentation template, thereby generating at least one augmented textual prompt (Figure 3 paragraphs [0113]-[0114] prompt database 340 queried by target embedding 308 for initializing training of prompt 307).
	With respect to Claim 4, Vu’546 shows the method as claimed in claim 3, wherein selecting at least one augmentation template comprises selecting at least one group of augmentation templates (Figure 3 paragraphs [0113]-[0114] prompt database 340 (group of prompts/augmented templates) queried by target embedding 308 for initializing training of prompt 307).
With respect to Claim 5, Vu’546 shows the method as claimed in claim 4, further comprising: obtaining a second training dataset (figure 2 230) comprising a plurality of data pairs (222+224), each data pair comprising an image depicting an object and a class name for the object (paragraph [0196]); wherein training the vision-language ML model further comprises: generating a plurality of second inputs by concatenating each learnable soft prompt from the plurality of learnable soft prompts, to each class name from the data pairs in the second training dataset (220) (paragraph [0220]); inputting the class names from the second training dataset (220) and the generated plurality of second inputs into the frozen text encoder of the vision-language ML model (230) (paragraph [0220]); outputting (226), from the frozen text encoder of the vision-language ML model (230), a third text embedding (306) for each second input, the third text embedding representing the class name in each second input (paragraph [0114]); inputting into an image encoder of the vision-language ML model (230) the images in each data pair of the second training dataset (220) (figure 2 220 to 230); outputting (326), from the image encoder (230/330), an image embedding for the object in each input image (paragraph [0091] task for embedding visual input data and paragraph [0092] for image classification of objects); and minimizing a cross-entropy image-to-text loss (350) between the third text embeddings and the image embeddings (paragraph [0078]).
With respect to Claim 6, Vu’546 shows the method as claimed in claim 5, wherein training the vision-language ML model by minimizing a cross-entropy image-to-text loss (350) between the third text embeddings (306) and the image embeddings (paragraph [0091]) comprises fine-tuning layer normalisations (paragraph [0155] describes to fine tune all model parameters a description of normalization as supported by applicant’s published specification paragraph [0080]) of the image encoder of the ML model (230), to thereby train the image encoder to, for each data pair (222+224), output image embeddings that are similar to the third text embeddings (306) (paragraph [0113]).
With respect to Claim 7, Vu’546 shows the method as claimed in claim 6, wherein training the vision-language ML model further comprises reducing an impact of data distribution shift by: learning an offset at the output of the text encoder for realigning the vision and text encoders; and adding the offset to weights of the frozen text encoder (paragraph [0215]).
With respect to Claim 8, Vu’546 shows the method as claimed in claim 7, wherein training the vision-language ML model by minimizing a cross-entropy image-to-text loss between the third text embeddings and the image embeddings comprises adjusting the learnable soft prompts used to generate the second inputs into the frozen text encoder (230), so that, for each data pair (222+224), the third text embedding (306) is similar to the image embedding (paragraph [0113]). 
	With respect to Claim 16, rejection analogous to those presented for claim 1, are applicable.
With respect to Claim 17, rejection analogous to those presented for claim 2, are applicable.
With respect to Claim 18, rejection analogous to those presented for claim 3, are applicable.
With respect to Claim 19, rejection analogous to those presented for claim 4, are applicable.
With respect to Claim 20, rejection analogous to those presented for claim 5, are applicable.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Oktay et al. (US 2025/0173613 A1): paragraphs [0008]-[0010] shows performing first training of the text model, wherein the first training comprises: inputting each first text passage into the text model in order to generate a respective value of a first text embedding, inputting each second text passage into the text model in order to generate a corresponding value of a second text embedding, and training the text model to minimize a measure of statistical difference between the value of the first text embedding and the corresponding value of the second text embedding over the plurality of text passage combinations.
Zheng et al. (US 2023/0230571 A1) shows in paragraph [0100]-[0101] neural network model further includes a text encoder and a second classifier. In order to maximize information that can be shared between different objects, namely, the text encoder is shared by all the objects, an adversarial training mechanism is introduced into the text encoder, namely, a second classifier with a gradient reversal layer is added after the text encoder to prevent text encoding from capturing object information. The text sample is encoded by the text encoder to obtain a content embedding vector of the text sample, object prediction is performed on the content embedding vector of the text sample by the second classifier to obtain a predicted object of the text sample, a fourth loss function is constructed based on the predicted object of the text sample and an object tag of the object sample, and the fourth loss function is reversed to obtain a second loss function of the neural network model. By the adversarial training mechanism, the text encoding is prevented from capturing object information, so as to separate the text from the object information, decouple the text from the object information, improve the accuracy of the content embedding vector, and avoid coupling with other information.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to IRIANA CRUZ whose telephone number is (571)270-3246. The examiner can normally be reached 10-6.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Akwasi M. Sarpong can be reached at (571) 270-3438. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/IRIANA CRUZ/Primary Examiner, Art Unit 2681

Read full office action

Prosecution Timeline

Dec 27, 2023

Application Filed

May 06, 2026

Non-Final Rejection mailed — §102 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/475,810

Patent 12639979

BIOMETRIC IDENTIFICATION IN A VEHICLE ENVIRONMENT

2y 8m to grant Granted May 26, 2026

18/104,987

Patent 12629135

Cardiac imaging machinery based on ultrasound methods

3y 3m to grant Granted May 19, 2026

18/548,841

Patent 12633156

METHOD, COMPUTER PROGRAM, STORAGE MEDIUM, PERSON DETECTOR AND MONITORING ARRANGEMENT FOR PERSON DETECTION

2y 8m to grant Granted May 19, 2026

17/880,829

Patent 12625654

IMAGE FORMING APPARATUS, NON-TRANSITORY COMPUTER READABLE MEDIUM, AND IMAGE FORMING METHOD

3y 9m to grant Granted May 12, 2026

18/084,531

Patent 12622658

ACTUATION METHOD FOR X-RAY DEVICE AND X-RAY DEVICE

3y 4m to grant Granted May 12, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

81%

Grant Probability

91%

With Interview (+9.5%)

2y 9m (~4m remaining)

Median Time to Grant

Low

PTA Risk

Based on 742 resolved cases by this examiner. Grant probability derived from career allowance rate.