Last updated: April 19, 2026
Application No. 18/421,672
Zero-Shot Prompt Ensembling for Zero-Shot Classification with Text-Image Models

Non-Final OA §101§102§103§112
Filed
Jan 24, 2024
Examiner
KOPPOLU, VAISALI RAO
Art Unit
2664
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
1 (Non-Final)
Interview Optional

— +26.8% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 113 resolved cases, 2023–2026
Examiner Intelligence

KOPPOLU, VAISALI RAO View full profile →
Grants 79% — above average
Career Allow Rate
89 granted / 113 resolved
+16.8% vs TC avg
Strong +27% interview lift
Without
With
+26.8%
Interview Lift
resolved cases with interview
Typical timeline
2y 12m
Avg Prosecution
22 currently pending
Career history
135
Total Applications
across all art units
Statute-Specific Performance

§101
10.4%
-29.6% vs TC avg
§103
49.2%
+9.2% vs TC avg
§102
13.3%
-26.7% vs TC avg
§112
25.5%
-14.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 113 resolved cases
Office Action

§101 §102 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 1 is objected to because of the following informalities:
Claim 1, 6th limitation recites “and wherein each text embedding set comprises a particular text embedding”; it should read as “and wherein the each text embedding set comprises a particular text embedding” since “each text embedding set” is previously defined in the preceding sentence. Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 17 – 20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 17 recites the limitation “an image embedding model” in 7th limitations. There is insufficient antecedent basis for this limitation in the claim as “an image embedding model” was previously defined in the 5th limitation for generating an input image embedding. It is unclear and confusing to one of the ordinary skill in the art if the same image embedding model or different image embedding model is being used for generating input image embedding and control image embedding. Appropriate corrections are required.
Claims 18 – 20 are rejected for being dependent on rejected base claim 17. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1 – 21 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The limitations, under their broadest reasonable interpretation, cover mental process (concept performed in a human mind, including as observation, evaluation, judgment, opinion, organizing human activity and mathematical concepts and calculations). The claim(s) recite(s) a method, and computer-readable storage medium configured to detect a focus of attention. This judicial exception is not integrated into a practical application because the steps do not add meaningful limitations to be considered specifically applied to a particular technological problem to be solved .The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the steps of the claimed invention can be done mentally and no additional features in the claims would preclude them from being performed as such except for the generic computer elements at high level of generality (i.e., processor, memory).
According to the USPTO guidelines, a claim is directed to non-statutory subject matter if: 
•	STEP 1: the claim does not fall within one of the four statutory categories of invention (process, machine, manufacture or composition of matter), or 
•	STEP 2: the claim recites a judicial exception, e.g. an abstract idea, without reciting additional elements that amount to significantly more than the judicial exception, as determined using the following analysis:
o	STEP 2A (PRONG 1): Does the claim recite an abstract idea, law of nature, or natural phenomenon?
o	STEP 2A (PRONG 2): Does the claim recite additional elements that integrate the judicial exception into a practical application?
o	STEP 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception?
Using the two-step inquiry, it is clear that claims 1 and 10 are directed to an abstract idea as shown below:
STEP 1: Do the claims fall within one of the four statutory categories?  YES.  Claims 1, 11, 17 and 21 are directed to a system, methods and apparatus (non-transitory CRM).  
STEP 2A (PRONG 1): Is the claim directed to a law of nature, a natural phenomenon or an abstract idea? YES, the claims recite steps that fall into the abstract idea category of mental processes.
	With regard to STEP 2A (PRONG 1), the guidelines provide three groupings of subject matter that are considered abstract ideas:
•	Mathematical concepts – mathematical relationships, mathematical formulas or equations, mathematical calculations;
•	Certain methods of organizing human activity – fundamental economic principles or practices (including hedging, insurance, mitigating risk); commercial or legal interactions (including agreements in the form of contracts; legal obligations; advertising, marketing or sales activities or behaviors; business relations); managing personal behavior or relationships or interactions between people (including social activities, teaching, and following rules or instructions); and
•	Mental processes – concepts that are practicably performed in the human mind (including an observation, evaluation, judgment, opinion).
The system in claim 1, method in claim 11, device in claim 17 and 21 comprise a mental process that can be practicably performed in the human mind (or generic computers or components configured to perform the method) and, therefore, an abstract idea. 
Regarding Claims 1 and 11 and 21: A computing system, the system comprising:
processing the image with an image embedding model to generate an image embedding (mathematical concepts, mathematical relationships, mathematical formulas or equations, mathematical calculations; image embedding model is a generic computer model and generating an image embedding is a mathematical concept as image embeddings are numerical vectors (lists of numbers) generated by a generic computer program or components)
processing each of the plurality of candidate text labels with each of the plurality of prompts with a text embedding model to generate a plurality of text embedding sets, wherein each text embedding set is associated with a different prompt of the plurality of prompts, and wherein each text embedding set comprises a particular text embedding associated with a particular candidate text label of the plurality of candidate text labels (mathematical concepts,  mathematical relationships, mathematical formulas or equations, mathematical calculations; text embedding model is a generic computer model and generating a text embedding is a mathematical concept as text embeddings are numerical vectors (lists of numbers) generated by a generic computer program or components; generating text embedding sets…);
determining a score for each respective prompt of the plurality of prompts (mental process including observation and evaluation, and can be done mentally in the human mind or a generic computer program or components configured to perform the method; determining scores…); 
generating a plurality of weighted text representations based on the plurality of text embeddings sets and the plurality of respective scores, wherein each weighted text representation is associated with a respective prompt of the plurality of prompts and a respective candidate text label of the plurality of candidate text labels (mental process including observation and evaluation, and can be done mentally in the human mind or a generic computer program or components configured to perform the method; generating weighted text representations…); and
determining an image classification based on the plurality of weighted text representations and the image embedding, wherein the image classification comprises a selected candidate text label of the plurality of candidate text labels (mental process including observation and evaluation, and can be done mentally in the human mind or a generic computer program or components configured to perform the method; determining an image classification…).

Regarding Claim 17 recites the following additional limitations:
processing the control image with an image embedding model to generate a control image embedding (mathematical concepts, mathematical relationships, mathematical formulas or equations, mathematical calculations; image embedding model is a generic computer model and generating an image embedding is a mathematical concept as image embeddings are numerical vectors (lists of numbers) generated by a generic computer program or components)
determining a prompt score based on the input image embedding, the control image embedding, and the plurality of text embeddings (mental process including observation and evaluation, and can be done mentally in the human mind or a generic computer program or components configured to perform the method; determining prompt score…);
These limitations, as drafted, is a simple process that, under their broadest reasonable interpretation, covers performance of the limitations in the mind or by a human. The Examiner notes that under MPEP 2106.04(a)(2)(III), the courts consider a mental process (thinking) that “can be performed in the human mind, or by a human using a pen and paper" to be an abstract idea. CyberSource Corp. v. Retail Decisions, Inc., 654 F.3d 1366, 1372, 99 USPQ2d 1690, 1695 (Fed. Cir. 2011). As the Federal Circuit explained, "methods which can be performed mentally, or which are the equivalent of human mental work, are unpatentable abstract ideas the ‘basic tools of scientific and technological work’ that are open to all.’" 654 F.3d at 1371, 99 USPQ2d at 1694 (citing Gottschalk v. Benson, 409 U.S. 63, 175 USPQ 673 (1972)). See also Mayo Collaborative Servs. v. Prometheus Labs. Inc., 566 U.S. 66, 71, 101 USPQ2d 1961, 1965 ("‘[M]ental processes[] and abstract intellectual concepts are not patentable, as they are the basic tools of scientific and technological work’" (quoting Benson, 409 U.S. at 67, 175 USPQ at 675)); Parker v. Flook, 437 U.S. 584, 589, 198 USPQ 193, 197 (1978) (same).  As such, a person could mentally analyze an image and determine a fill level, either mentally or using a pen and paper.  The mere nominal recitation that the various steps are being executed by a device/in a device (e.g. processing unit) does not take the limitations out of the mental process grouping.  
The use of algorithm or machine learning model to identify segmented regions of pixels and then determining and performing action based on the outcome is a common pattern of data input, analysis, and output, which courts have consistently held as abstract.
The claimed functions –recognition, inspection and dividing – could be performed conceptually by a human using pen and paper, and thus fall under abstract mental steps. 
Conclusions: Thus, the claims are directed to an abstract idea.   
	STEP 2A (PRONG 2): Does the claim recite additional elements that integrate the judicial exception into a practical application? NO, the claims do not recite additional elements that integrate the judicial exception into a practical application.
With regard to STEP 2A (prong 2), whether the claim recites additional elements that integrate the judicial exception into a practical application, the guidelines provide the following exemplary considerations that are indicative that an additional element (or combination of elements) may have integrated the judicial exception into a practical application:
an additional element reflects an improvement in the functioning of a computer, or an improvement to other technology or technical field;
an additional element that applies or uses a judicial exception to affect a particular treatment or prophylaxis for a disease or medical condition; 
an additional element implements a judicial exception with, or uses a judicial exception in conjunction with, a particular machine or manufacture that is integral to the claim;
an additional element effects a transformation or reduction of a particular article to a different state or thing; and
an additional element applies or uses the judicial exception in some other meaningful way beyond generally linking the use of the judicial exception to a particular technological environment, such that the claim as a whole is more than a drafting effort designed to monopolize the exception.
While the guidelines further state that the exemplary considerations are not an exhaustive list and that there may be other examples of integrating the exception into a practical application, the guidelines also list examples in which a judicial exception has not been integrated into a practical application:
an additional element merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea; 
an additional element adds insignificant extra-solution activity to the judicial exception; and 
an additional element does no more than generally link the use of a judicial exception to a particular technological environment or field of use.
Claims 1, 11, 17 and 21 does/do not recite any of the exemplary considerations that are indicative of an abstract idea having been integrated into a practical application. 
Claim 1, 11 and 21 recites further limitations: 
obtaining an image and a plurality of candidate text labels, wherein the plurality of candidate text labels are associated with a particular task (insignificant extra-solution activity of data acquisition);
obtaining a plurality of prompts, wherein the plurality of prompts are associated with a phrase to provide with a classification output (insignificant extra-solution activity of data acquisition);
Claim 17 recites further elements:
obtaining a control image, wherein the control image differs from the one or more images of the input data (insignificant extra-solution activity of data acquisition);
Claim 17 and 21 recites further limitations:
One or more non-transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations (instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea)
These limitations are recited at a high level of generality (i.e. as a general action or change being taken based on the results of the acquiring step) and amounts to mere post solution actions, which is a form of insignificant extra-solution activity. Further, the claims are claimed generically and are operating in their ordinary capacity such that they do not use the judicial exception in a manner that imposes a meaningful limit on the judicial exception.
•	There is no indication that the method improves the functioning of a computer, the learning model, or classification itself. 
Conclusion: Accordingly, even in combination, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.  
STEP 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? NO, the claims do not recite additional elements that amount to significantly more than the judicial exception.
With regard to STEP 2B, whether the claims recite additional elements that provide significantly more than the recited judicial exception, the guidelines specify that the pre-guideline procedure is still in effect.  Specifically, that examiners should continue to consider whether an additional element or combination of elements:
adds a specific limitation or combination of limitations that are not well-understood, routine, conventional activity in the field, which is indicative that an inventive concept may be present; or  
simply appends well-understood, routine, conventional activities previously known to the industry, specified at a high level of generality, to the judicial exception, which is indicative that an inventive concept may not be present.
Claims 1, 11, 17 and 21 does/do not recite any additional elements that are not well-understood, routine or conventional.  
•	The claims lack an inventive concept sufficient to transform the abstract idea into patent-eligible subject matter. 
•	The use of model, performing embedding based on received data, is routine and conventional in the field of machine learning. 
•	The claims are functionally generic with no details about architecture, training, dataset specifics, or a novel arrangement of components. 
Conclusion: The claims does not add significantly more than the abstract idea. 

Final Determination: INELIGIBLE under 35 U.S.C. 101. The Claims 1, 11, 17 and 21 are: (a) directed toward an abstract idea (mental process and data manipulation) using conventional tools in a generic way, (b) do not recite additional elements that integrate the judicial exception into a practical application, and (c) do not recite additional elements that amount to significantly more than the judicial exception.

Regarding Claims 2 – 10, 12 - 16 and 18 – 20: the additional elements recited in the claims do not integrate the mental process into a practical application or add significantly more to the mental process. The limitations merely recite that the functions are performed “by model” does not demonstrate a technological improvement. The additional limitations further recite calculations that are mathematical concepts and fall under abstract ideas. The claims are functionally generic with no details about architecture, training, dataset specifics, or a novel arrangement of components. Since the claims are directed toward an abstract idea (mental process and data manipulation) using conventional tools in a generic way, without integration into a practical application or an inventive concept, they are ineligible under 35 U.S.C. 101.




Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

(g)(1) during the course of an interference conducted under section 135 or section 291, another inventor involved therein establishes, to the extent permitted in section 104, that before such person’s invention thereof the invention was made by such other inventor and not abandoned, suppressed, or concealed, or (2) before such person’s invention thereof, the invention was made in this country by another inventor who had not abandoned, suppressed, or concealed it. In determining priority of invention under this subsection, there shall be considered not only the respective dates of conception and reduction to practice of the invention, but also the reasonable diligence of one who was first to conceive and last to reduce to practice, from a time prior to conception by the other.


A rejection on this statutory basis (35 U.S.C. 102(g) as in force on March 15, 2013) is appropriate in an application or patent that is examined under the first to file provisions of the AIA  if it also contains or contained at any time (1) a claim to an invention having an effective filing date as defined in 35 U.S.C. 100(i) that is before March 16, 2013 or (2) a specific reference under 35 U.S.C. 120, 121, or 365(c) to any patent or application that contains or contained at any time such a claim.
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 11 – 12 and 14 – 16 are rejected under 35 U.S.C. 102 (a)(1) as being anticipated by Zhou et al. (Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337-2348; hereafter referred to as Zhou)

Regarding Claim 11, Zhou teaches:
A computer-implemented method, the method comprising: 
obtaining, by a computing system comprising one or more processors (Zhou, page, 2340, col. 1, “computer vision tasks”), input data, wherein the input data is descriptive of one or more images (Zhou, Fig. 2, image and labels such as airplane, butterfly…pizza are text labels as inputs); 
obtaining, by the computing system, a plurality of candidate text labels and a prompt, wherein the plurality of candidate text labels are descriptive of a plurality of candidate classifications (Zhou, Fig. 1 plurality of prompts; page 2339, Fig. 2 plurality of candidate text labels airplane, butterfly, pizza); 
generating, by the computing system, a plurality of text strings based on the plurality of candidate text labels and the prompt, wherein each of the plurality of text strings are generated by augmenting the prompt with a candidate text label of the plurality of candidate text labels (Zhou, page 2340, 3.3 Vision Language pre-training, zero-shot inference “image features with the classification weights synthesized by the text encoder, which takes as input textual descriptions specifying classes of interest. Formally, let f be image features extracted by the image encoder for an image x and {wi}K i=1 a set of weight vectors generated by the text encoder. K denotes the number of classes and each wi is derived from a prompt that could have the form of “a photo of a [CLASS].” where the class token is replaced by the specific class name, such as “cat,” “dog” or “car.” The prediction probability is then computed”); and 
processing, by the computing system, each text string of the plurality of text strings with a text embedding model to generate a plurality of text embeddings, wherein each text embedding of the plurality of text embeddings is associated with a respective text string (Zhou, Fig. 2, page 2340, 3.1 Vision-Language Pre-Training, text encoder “the text encoder is built on top of a Transformer and aims to generate text representations from natural language”; Fig. 1, prompt; section 3.1 vision-language pre-training “the prompt given to the text encoder g(·) is designed with the following form, t = [V]1[V]2 ... [V]M [CLASS], (2) where each [V]m (m ∈ {1,..., M}) is a vector with the same dimension as word embeddings (i.e., 512 for CLIP), and M is a hyperparameter specifying the number of context tokens”);
processing, by the computing system, the input data with an image embedding model to generate an image embedding (Zhou, Fig. 2, image encoder, 3.1 Vision-language pre-training, “The image encoder aims to map high-dimensional images into a low-dimensional embedding space”); 
determining, by the computing system, a prompt score based on a similarity measure between the image embedding and the plurality of text embeddings (Zhou, Fig. 2, similarity scores; page 2341, 3.3 Discussion, “take both visual and textual data as input and produce alignment scores used for image recognition”); 
generating, by the computing system, a plurality of weighted text embeddings based on the prompt score and the plurality of text embeddings (Zhou, 3.2 Context Optimization, “the prompt given to the text encoder g(·) is designed with the following form, t = [V]1[V]2 ... [V]M [CLASS], (2) where each [V]m (m ∈ {1,..., M}) is a vector with the same dimension as word embeddings (i.e., 512 for CLIP), and M is a hyperparameter specifying the number of context tokens. By forwarding a prompt t to the text encoder g(·), we can obtain a classification weight vector representing a visual concept (still from the [EOS]token position)”); and
determining, by the computing system, a classification output based at least in part on the plurality of weighted text embeddings (Zhou, 3.1 Vision language pre-training, Zero-Shot Inference Since CLIP is pre-trained to predict whether an image matches a textual description, it naturally fits zero-shot recognition. This is achieved by comparing image features with the classification weights synthesized by the text encoder, which takes as input textual descriptions specifying classes of interest”);

Regarding Claim 12, Zhou teaches the method of claim 11, wherein determining the classification output comprises: 
determining, by the computing system, a similarity measure associated with each weighted text embedding of the plurality of weighted text embeddings based on the image embedding and the plurality of weighted text embeddings (Zhou, Fig. 2, similarity scores, 3.1 Vision-Language Pre-Training, Training, “CLIP maximizes the cosine similarity for matched pairs while minimizes the cosine similarity for all other unmatched pairs”; Zero-Shot Inference, “let f be image features extracted by the image encoder for an image x and {wi}K i=1 a set of weight vectors generated by the text encoder. K denotes the number of classes and each wi is derived from a prompt that could have the form of “a photo of a [CLASS].” where the class token is replaced by the specific class name, such as “cat,” “dog” or “car.” The prediction probability is then computed”).

Regarding Claim 14, Zhou teaches the method of claim 11, wherein the text embedding model comprises a text encoder, wherein the image embedding model comprises an image encoder, and wherein the text embedding model and the image embedding model were pre-trained on a training dataset (Zhou, page 2340, 3.1 Vision-Language Pre-Training, “Models CLIP consists of two encoders, one for images and the other for text. The image encoder aims to map high-dimensional images into a low-dimensional embedding space. The architecture of the image encoder can take the form of a CNN like ResNet-50 (He et al., 2016) or a ViT (Dosovitskiy et al., 2021). On the other hand, the text encoder is built on top of a Transformer (Vaswani et al., 2017) and aims to generate text representations from natural language”; Zero-Shot Inference, “CLIP is pre-trained to predict whether an image matches a textual description, it naturally fits zero-shot recognition”).

Regarding Claim 15, Zhou teaches the method of claim 14, wherein the training dataset comprises a plurality of text-image pairs, wherein each text-image pair comprises an image and a respective caption (Zhou, 3.1 Vision-Language Pre-Training, Training, “Given a batch of image-text pairs, CLIP maximizes the cosine similarity for matched pairs”).

Regarding Claim 16, Zhou teaches the method of claim 14, wherein the text embedding model and the image embedding model were trained based on a bi-directional contrastive loss (Zhou, 3.1 Vision-Language Pre-Training, Training “CLIP is trained to align the two embedding spaces learned for images and text respectively. Specifically, the learning objective is formulated as a contrastive loss”).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1 – 3, 5 – 10, 13 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. (Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337-2348; hereafter referred to as Zhou) in view of Radford et al. (Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR; hereafter referred to as Radford).

Regarding Claim 1, Zhou teaches:
A computing system, the system comprising: 
one or more processors (Zhou, page, 2340, col. 1, “computer vision tasks”); and 
one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors (Zhou, page, 2340, col. 1, “computer vision tasks”), cause the computing system to perform operations, the operations comprising: 
obtaining an image and a plurality of candidate text labels, wherein the plurality of candidate text labels are associated with a particular task (Zhou, Fig. 2, image and labels such as airplane, butterfly…pizza are text labels as inputs); 
obtaining a plurality of prompts, wherein the plurality of prompts are associated with a phrase to provide with a classification output (Zhou, Fig. 1 plurality of prompts; page 2340, section 3.1 vision-language pre-training, zero-shot inference “let f be image features extracted by the image encoder for an image x and {wi}K i=1 a set of weight vectors generated by the text encoder. K denotes the number of classes and each wi is derived from a prompt that could have the form of “a photo of a [CLASS].” where the class token is replaced by the specific class name, such as “cat,” “dog” or “car”); 
processing the image with an image embedding model to generate an image embedding (Zhou, Fig. 2, image encoder, 3.1 Vision-language pre-training, “The image encoder aims to map high-dimensional images into a low-dimensional embedding space”); 
processing each of the plurality of candidate text labels with each of the plurality of prompts with a text embedding model to generate a plurality of text embedding sets, wherein each text embedding set is associated with a different prompt of the plurality of prompts, and wherein each text embedding set comprises a particular text embedding associated with a particular candidate text label of the plurality of candidate text labels (Zhou, Fig. 2, page 2340, 3.1 Vision-Language Pre-Training, text encoder “the text encoder is built on top of a Transformer and aims to generate text representations from natural language”; Fig. 1, prompt; section 3.1 vision-language pre-training “the prompt given to the text encoder g(·) is designed with the following form, t = [V]1[V]2 ... [V]M [CLASS], (2) where each [V]m (m ∈ {1,..., M}) is a vector with the same dimension as word embeddings (i.e., 512 for CLIP), and M is a hyperparameter specifying the number of context tokens”);
determining a score for each respective prompt of the plurality of prompts (Zhou, page 2341, 3.3 Discussion, “take both visual and textual data as input and produce alignment scores used for image recognition”); 
generating a plurality of weighted text representations based on the plurality of text embeddings sets and the plurality of respective scores, wherein each weighted text representation is associated with a respective prompt of the plurality of prompts and a respective candidate text label of the plurality of candidate text labels (Zhou, page 2340, 3.3 Vision Language pre-training, zero-shot inference “image features with the classification weights synthesized by the text encoder, which takes as input textual descriptions specifying classes of interest. Formally, let f be image features extracted by the image encoder for an image x and {wi}K i=1 a set of weight vectors generated by the text encoder. K denotes the number of classes and each wi is derived from a prompt that could have the form of “a photo of a [CLASS].” where the class token is replaced by the specific class name, such as “cat,” “dog” or “car.” The prediction probability is then computed”); and 
determining an image classification based on the plurality of weighted text representations and the image embedding (Zhou, 3.1 Vision language pre-training, Zero-Shot Inference Since CLIP is pre-trained to predict whether an image matches a textual description, it naturally fits zero-shot recognition. This is achieved by comparing image features with the classification weights synthesized by the text encoder, which takes as input textual descriptions specifying classes of interest”).
However, Zhou doesn’t explicitly recite:
wherein the image classification comprises a selected candidate text label of the plurality of candidate text labels.
In the same field of endeavor Randford teaches:
wherein the image classification comprises a selected candidate text label of the plurality of candidate text labels (Radford, Fig. 1, “CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes”; Radford, Fig. 1, (2) create dataset classifier from the label text, output “A photo of a dog”).
Zhou and Radford are considered analogous art as they are reasonably pertinent to the same field of endeavor of image processing. Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhou with the method of Radford to make the invention that selects candidate text label of the plurality of text labels to classify the image; doing so can improve the performance and accuracy of classifying the image (Radford, Introduction); thus, one of the ordinary skill in the art would have been motivated to combine the references. 

Regarding Claim 2, Zhou in view of Radford teaches the system of claim 1, wherein determining the score for each respective prompt of the plurality of prompts comprises: 
determining a similarity measure between a text embedding set of a respective prompt and the image embedding (Zhou, 3.1 Vision-Language Pre-Training, Training CLIP is trained to align the two embedding spaces learned for images and text respectively. Specifically, the learning objective is formulated as a contrastive loss. Given a batch of image-text pairs, CLIP maximizes the cosine similarity for matched pairs while minimizes the cosine similarity for all other unmatched pairs”) (Radford, page 3, col.1, “CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N2 − N incorrect pairings. We optimize a symmetric cross entropy loss over these similarity score”).

Regarding Claim 3, Zhou in view of Radford teaches the system of claim 2, wherein the similarity measure comprises an average embedding similarity between the text embeddings of the text embedding set and the image embedding (Radford, page 3, col. 1, “CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N2 − N incorrect pairings. We optimize a symmetric cross entropy loss over these similarity scores”).

Regarding Claim 5, Zhou in view of Radford teaches the system of claim 1, wherein the operations further comprise: 
obtaining a pre-trained image-text model, wherein the pre-trained image-text model comprises a foundation model pre-trained on a training dataset without a specific downstream task, and wherein the pre-trained image-text model comprises the text embedding model and the image embedding model (Zhou, Introduction, “pre-training at a large scale, models can learn diverse visual concepts and can readily be transferred to any downstream task through prompting (Radford et al., 2021) Fig. 2, text encoder and image encoder; 3.1 Vision-Language Pre-Training, “Models CLIP consists of two encoders, one for images and the other for text. The image encoder aims to map high-dimensional images into a low-dimensional embedding space. The architecture of the image encoder can take the form of a CNN like ResNet-50 (He et al., 2016) or a ViT (Dosovitskiy et al., 2021). On the other hand, the text encoder is built on top of a Transformer (Vaswani et al., 2017) and aims to generate text representations from natural language”).

Regarding Claim 6, Zhou in view of Radford teaches the system of claim 5, wherein the training dataset comprises a plurality of image-caption training examples (Radford, 2.2. Selecting an Efficient Pre-Training Method “Our initial approach, similar to VirTex, jointly trained an image CNN and text transformer from scratch to predict the caption of an image”).

Regarding Claim 7, Zhou in view of Radford teaches the system of claim 1, wherein the operations further comprise: 
providing the image classification as an output (Radford, Fig. 1, (2) create dataset classifier from the label text, output “A photo of a dog”).

Regarding Claim 8, Zhou in view of Radford teaches the system of claim 1, wherein the plurality of prompts comprise a plurality of caption templates (Zhou, Fig. 1 plurality of prompt with caption templates; Radford, page 4, Using CLIP, “providing CLIP with text prompts to help specify the task as well as ensembling multiple of these templates in order to boost performance”).

Regarding Claim 9, Zhou in view of Radford teaches the system of claim 8, wherein the plurality of caption templates are configured to be augmented to comprise a classification label and be descriptive of an example caption for an input image (Radford, Fig. 1, (2) Create dataset classifier from label text, “CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes”).

 Regarding Claim 10, Zhou in view of Radford teaches the system of claim 1, , wherein the plurality of candidate text labels are descriptive of a plurality of candidate object classification (Radford, Fig. 1, (2) Create dataset classifier from label text, text labels plane, car dog,…bird”).

Regarding Claim 13, Zhou teaches the method of claim 11, further comprising: 
obtaining, by the computing system, a second prompt, wherein the second prompt differs from the prompt (Zhou, Fig. 1 different prompts, a photo of a …, a flower photo of a … etc.); 
generating, by the computing system, a plurality of second weighted text embeddings based on the second prompt and the plurality of text embeddings (Zhou, 3.1 Vision-Language Pre-Training, Zero-Shot Inference, “comparing image features with the classification weights synthesized by the text encoder, which takes as input textual descriptions specifying classes of interest. Formally, let f be image features extracted by the image encoder for an image x and {wi}K i=1 a set of weight vectors generated by the text encoder. K denotes the number of classes and each wi is derived from a prompt that could have the form of “a photo of a [CLASS].” where the class token is replaced by the specific class name, such as “cat,” “dog” or “car.”); 
determining, by the computing system, an adjusted text embedding for a particular candidate text label of the plurality of candidate text labels based on a respective weighted text embedding of the plurality of weighted text embeddings and a respective second weighted text embedding of the plurality of second weighted text embeddings (Zhou, 3.1 Vision-Language Pre-Training, Zero-Shot Inference, “et f be image features extracted by the image encoder for an image x and {wi}K i=1 a set of weight vectors generated by the text encoder. K denotes the number of classes and each wi is derived from a prompt that could have the form of “a photo of a [CLASS].” where the class token is replaced by the specific class name, such as “cat,” “dog” or “car.”); and
However, Zhou fails to explicitly recite:
wherein the classification output is determined based on a similarity measure associated with the adjusted text embedding and the image embedding.
In the same field of endeavor Randford teaches:
wherein the classification output is determined based on a similarity measure associated with the adjusted text embedding and the image embedding (Radford, page 3, col. 1, “CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings”).
Zhou and Radford are considered analogous art as they are reasonably pertinent to the same field of endeavor of image processing. Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhou with the method of Radford to make the invention that selects candidate text label of the plurality of text labels to classify the image; doing so can improve the performance and accuracy of classifying the image (Radford, Introduction); thus, one of the ordinary skill in the art would have been motivated to combine the references. 

Regarding Claim 21, Zhou teaches:
A computing system, the system comprising: 
one or more processors (page, 2340, col. 1, “computer vision tasks”); and 
one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors (page, 2340, col. 1, “computer vision tasks”), cause the computing system to perform operations, the operations comprising: 
obtaining an image (Fig. 2, image and labels such as airplane, butterfly…pizza are text labels as inputs); 
processing the image with an image embedding model to generate an image embedding (Fig. 2, image encoder, 3.1 Vision-language pre-training, “The image encoder aims to map high-dimensional images into a low-dimensional embedding space”); 
obtaining a plurality of text embedding sets, wherein the plurality of text embedding sets were generated based on processing a plurality of candidate text labels with a plurality of prompt templates with a text embedding model, wherein each text embedding set is associated with a different prompt template of the plurality of prompt templates (Fig. 1 plurality of prompts; page 2340, section 3.1 vision-language pre-training, zero-shot inference “let f be image features extracted by the image encoder for an image x and {wi}K i=1 a set of weight vectors generated by the text encoder. K denotes the number of classes and each wi is derived from a prompt that could have the form of “a photo of a [CLASS].” where the class token is replaced by the specific class name, such as “cat,” “dog” or “car”); 
processing each of the plurality of candidate text labels with each of the plurality of prompts with a text embedding model to generate a plurality of text embedding sets, wherein each text embedding set is associated with a different prompt of the plurality of prompts, and wherein each text embedding set comprises a particular text embedding associated with a particular candidate text label of the plurality of candidate text labels (Fig. 2, page 2340, 3.1 Vision-Language Pre-Training, text encoder “the text encoder is built on top of a Transformer and aims to generate text representations from natural language”; Fig. 1, prompt; section 3.1 vision-language pre-training “the prompt given to the text encoder g(·) is designed with the following form, t = [V]1[V]2 ... [V]M [CLASS], (2) where each [V]m (m ∈ {1,..., M}) is a vector with the same dimension as word embeddings (i.e., 512 for CLIP), and M is a hyperparameter specifying the number of context tokens”);
determining a score for each respective prompt template of the plurality of prompt templates based on a respective text embedding set for the respective prompt template (page 2341, 3.3 Discussion, “take both visual and textual data as input and produce alignment scores used for image recognition”); 
generating a plurality of weighted text representations based on the plurality of text embeddings sets and the plurality of respective scores, wherein each weighted text representation is associated with a respective prompt of the plurality of prompts and a respective candidate text label of the plurality of candidate text labels (page 2340, 3.3 Vision Language pre-training, zero-shot inference “image features with the classification weights synthesized by the text encoder, which takes as input textual descriptions specifying classes of interest. Formally, let f be image features extracted by the image encoder for an image x and {wi}K i=1 a set of weight vectors generated by the text encoder. K denotes the number of classes and each wi is derived from a prompt that could have the form of “a photo of a [CLASS].” where the class token is replaced by the specific class name, such as “cat,” “dog” or “car.” The prediction probability is then computed”); and 
determining an image classification based on the plurality of weighted text representations and the image embedding (3.1 Vision language pre-training, Zero-Shot Inference Since CLIP is pre-trained to predict whether an image matches a textual description, it naturally fits zero-shot recognition. This is achieved by comparing image features with the classification weights synthesized by the text encoder, which takes as input textual descriptions specifying classes of interest”).
However, Zhou doesn’t explicitly recite:
wherein the image classification comprises a selected candidate text label of the plurality of candidate text labels.
In the same field of endeavor Randford teaches:
wherein the image classification comprises a selected candidate text label of the plurality of candidate text labels (Radford, Fig. 1, “CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes”; Radford, Fig. 1, (2) create dataset classifier from the label text, output “A photo of a dog”).
Zhou and Radford are considered analogous art as they are reasonably pertinent to the same field of endeavor of image processing. Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhou with the method of Radford to make the invention that selects candidate text label of the plurality of text labels to classify the image; doing so can improve the performance and accuracy of classifying the image (Radford, Introduction); thus, one of the ordinary skill in the art would have been motivated to combine the references. 

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. (Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337-2348; hereafter referred to as Zhou) in view of Radford et al. (Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR; hereafter referred to as Radford) further in view of Geng et al. (US 20240212327 A1; hereafter referred to as Geng).

Regarding Claim 4, Zhou in view of Radford teaches the system of claim 1, but fails to explicitly teach:
obtaining a control image, wherein the control image differs from the image; 
processing the control image with the image embedding model to generate a control image embedding; and
wherein the score is generated based on the image embedding, the control image embedding, and a respective text embedding set for the respective prompt
	In the same field of endeavor, Geng teaches:
obtaining a control image, wherein the control image differs from the image (Geng, [0017] “he zero-shot joint text-image encoder models can perform effectively even on out-of-distribution tasks. In-distribution generalization refers to generalizing to examples that are new but still drawn from a same distribution as a data set used for training. An example result of in-distribution generalization is an ability to classify a new breed of dog as a dog, when the data set used for training only depicts other breeds of dogs. Out-of-distribution generalization refers to generalizing to inputs drawn from a different distribution than a data set used for training. An example result of out-of-distribution generalization is an ability to classify a cow as a class representing a cow, when the data set used for training depicts different breeds of dogs”); 
processing the control image with the image embedding model to generate a control image embedding (Geng, [0034] “CLIP trains a model using pairs as training data, where each pair includes an image and text corresponding to the image; the text is also referred to as a caption for the image, according to one embodiment. Given a set of image-caption pairs, CLIP trains an image encoder and a text encoder over a set of input data such that a cosine similarity between features extracted by the image encoder and features extracted by the text encoder is maximized with respect to each pair, where the features are captured and represented as embeddings”); and
wherein the score is generated based on the image embedding, the control image embedding, and a respective text embedding set for the respective prompt (Geng, [0042] “the image reprogrammer also operates as an input-perturbation function that increases in-distribution softmax scores”; Geng, [0034] “CLIP trains an image encoder and a text encoder over a set of input data such that a cosine similarity between features extracted by the image encoder and features extracted by the text encoder is maximized with respect to each pair, where the features are captured and represented as embeddings”; [0035] “CLIP jointly trains the image encoder and the text encoder to predict correct pairings of the pairs in the training data, at least in some embodiments. More specifically, at least in some embodiments, CLIP learns an embedding space by maximizing a cosine similarity of embeddings of the image and the caption of correct pairings of the training data while also minimizing a cosine similarity of embeddings of the image and the caption of incorrect pairings of the training data. The embedding space is also referred to as a multimodal embedding space, given the training data including both images and text”)
Zhou, Radford, and Geng are considered analogous art as they are reasonably pertinent to the same field of endeavor of image processing. Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhou in view of Radford with the invention of Geng to make the invention that obtains a new image as control image, processes the control image with the image embedding model and determines a prompt score; doing so can improve the performance and accurately/correctly identify and classify the image (Geng, [0006]); thus, one of the ordinary skill in the art would have been motivated to combine the references. 


Claims 17 - 20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. (Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337-2348; hereafter referred to as Zhou) in view of Geng et al. (US 20240212327 A1; hereafter referred to as Geng).

Regarding Claim 17, Zhou teaches:
One or more non-transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices (Zhou, page, 2340, col. 1, “computer vision tasks”), cause the one or more computing devices to perform operations, the operations comprising:
	obtaining input data, wherein the input data is descriptive of one or more images; 
obtaining a plurality of candidate text labels and a prompt, wherein the plurality of candidate text labels are descriptive of a plurality of candidate classifications (Zhou, Fig. 2, image and labels such as airplane, butterfly…pizza are text labels as inputs); 
generating a plurality of text strings based on the plurality of candidate text labels and the prompt, wherein each of the plurality of text strings are generated by augmenting the prompt with a candidate text label of the plurality of candidate text labels (Zhou, page 2340, 3.3 Vision Language pre-training, zero-shot inference “image features with the classification weights synthesized by the text encoder, which takes as input textual descriptions specifying classes of interest. Formally, let f be image features extracted by the image encoder for an image x and {wi}K i=1 a set of weight vectors generated by the text encoder. K denotes the number of classes and each wi is derived from a prompt that could have the form of “a photo of a [CLASS].” where the class token is replaced by the specific class name, such as “cat,” “dog” or “car.” The prediction probability is then computed”); and 
processing each text string of the plurality of text strings with a text embedding model to generate a plurality of text embeddings, wherein each text embedding of the plurality of text embeddings is associated with a respective text string (Zhou, Fig. 2, page 2340, 3.1 Vision-Language Pre-Training, text encoder “the text encoder is built on top of a Transformer and aims to generate text representations from natural language”; Fig. 1, prompt; section 3.1 vision-language pre-training “the prompt given to the text encoder g(·) is designed with the following form, t = [V]1[V]2 ... [V]M [CLASS], (2) where each [V]m (m ∈ {1,..., M}) is a vector with the same dimension as word embeddings (i.e., 512 for CLIP), and M is a hyperparameter specifying the number of context tokens”);
processing the input data with an image embedding model to generate an image embedding (Zhou, Fig. 2, image encoder, 3.1 Vision-language pre-training, “The image encoder aims to map high-dimensional images into a low-dimensional embedding space”); 
generating a plurality of weighted text embeddings based on the prompt score and the plurality of text embeddings (Zhou, 3.2 Context Optimization, “the prompt given to the text encoder g(·) is designed with the following form, t = [V]1[V]2 ... [V]M [CLASS], (2) where each [V]m (m ∈ {1,..., M}) is a vector with the same dimension as word embeddings (i.e., 512 for CLIP), and M is a hyperparameter specifying the number of context tokens. By forwarding a prompt t to the text encoder g(·), we can obtain a classification weight vector representing a visual concept (still from the [EOS]token position)”); and
determining a classification output based at least in part on the plurality of weighted text embeddings (Zhou, 3.1 Vision language pre-training, Zero-Shot Inference Since CLIP is pre-trained to predict whether an image matches a textual description, it naturally fits zero-shot recognition. This is achieved by comparing image features with the classification weights synthesized by the text encoder, which takes as input textual descriptions specifying classes of interest”);
	However, Zhou fails to explicitly teach:
	obtaining a control image, wherein the control image differs from the one or more images of the input data;
processing the control image with an image embedding model to generate a control image embedding;
	determining a prompt score based on the input image embedding, the control image embedding, and the plurality of text embeddings;
	In the same field of endeavor, Geng teaches:
obtaining a control image, wherein the control image differs from the one or more images of the input data (Geng, [0017] “he zero-shot joint text-image encoder models can perform effectively even on out-of-distribution tasks. In-distribution generalization refers to generalizing to examples that are new but still drawn from a same distribution as a data set used for training. An example result of in-distribution generalization is an ability to classify a new breed of dog as a dog, when the data set used for training only depicts other breeds of dogs. Out-of-distribution generalization refers to generalizing to inputs drawn from a different distribution than a data set used for training. An example result of out-of-distribution generalization is an ability to classify a cow as a class representing a cow, when the data set used for training depicts different breeds of dogs”); 
processing the control image with an image embedding model to generate a control image embedding (Geng, [0034] “CLIP trains a model using pairs as training data, where each pair includes an image and text corresponding to the image; the text is also referred to as a caption for the image, according to one embodiment. Given a set of image-caption pairs, CLIP trains an image encoder and a text encoder over a set of input data such that a cosine similarity between features extracted by the image encoder and features extracted by the text encoder is maximized with respect to each pair, where the features are captured and represented as embeddings”); 
determining a prompt score based on the input image embedding, the control image embedding, and the plurality of text embeddings (Geng, [0042] “the image reprogrammer also operates as an input-perturbation function that increases in-distribution softmax scores”; Geng, [0034] “CLIP trains an image encoder and a text encoder over a set of input data such that a cosine similarity between features extracted by the image encoder and features extracted by the text encoder is maximized with respect to each pair, where the features are captured and represented as embeddings”; [0035] “CLIP jointly trains the image encoder and the text encoder to predict correct pairings of the pairs in the training data, at least in some embodiments. More specifically, at least in some embodiments, CLIP learns an embedding space by maximizing a cosine similarity of embeddings of the image and the caption of correct pairings of the training data while also minimizing a cosine similarity of embeddings of the image and the caption of incorrect pairings of the training data. The embedding space is also referred to as a multimodal embedding space, given the training data including both images and text”);
Zhou and Geng are considered analogous art as they are reasonably pertinent to the same field of endeavor of image processing. Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Zhou with the invention of Geng to make the invention that obtains a new image as control image, processes the control image with the image embedding model and determines a prompt score; doing so can improve the performance and accurately/correctly identify and classify the image (Geng, [0006]); thus, one of the ordinary skill in the art would have been motivated to combine the references. 

Regarding Claim 18, Zhou in view of Geng teaches one or more non-transitory computer-readable media of claim 17, wherein determining the prompt score comprises: 
determining a first similarity measure based on the input image embedding and the plurality of text embeddings (Zhou, Fig. 2, similarity score, “class-specific context, which learns for each class a specific set of context vectors”; Zhou, page 2340,  3.1 Vision-Language Pre-Training, Training, “CLIP maximizes the cosine similarity for matched pairs while minimizes the cosine similarity for all other unmatched pairs”); 
determining a second similarity measure based on the control image embedding and the plurality of text embeddings (Geng, [0046] FIG. 3 is a block diagram 300 depicting a loss function used in fine-tuning joint text-image encoders, according to one embodiment. Although the loss function is shown and described as a cosine-similarity loss function 302, the type of loss function used can vary and be tailored to suit the needs of a particular case. The cosine-similarity loss function computes a loss based on the respective outputs of each of the image encoder 118 and the text encoder 120. The output of the image encoder 118 includes features extracted by the image encoder 118 from a reprogrammed image. The output of the text encoder includes features extracted by the text encoder 120 from reprogrammed text”); and 
determining the prompt score based on a difference between the first similarity measure and the second similarity measure (Geng, [0049] “the loss function is a symmetric, cross-entropy loss function. Cross entropy loss measures a difference between a discovered probability distribution of a model and a predicted distribution of the model”; Geng, [0042] “the image reprogrammer also operates as an input-perturbation function that increases in-distribution softmax scores”; Geng, [0069] “The application 150 then further trains each of the image and text encoders of the joint text-image encoder 116 using a respective one of the reprogrammed images and captions”).

Regarding Claim 19, Zhou in view of Geng teaches one or more non-transitory computer-readable media of claim 17, wherein the plurality of weighted text embeddings are generated based on softmax weighting across a plurality of prompt text embedding sets (Geng, [0042] “the image reprogrammer also operates as an input-perturbation function that increases in-distribution softmax scores, according to one embodiment. As such, at least in some cases, the image reprogrammer can consistently improve separability between in-distribution and out-of-distribution samples, thereby improving performance at least in terms of accuracy in out-of-distribution detection tasks”; Geng, [0045] “the text reprogramming function Φ.sub.θ is defined to be a look-up embedding on the tokens {s.sub.i}, where Φ.sub.θ can be parameterized by the learnable embedding tensor θ”)

Regarding Claim 20, Zhou in view of Geng teaches one or more non-transitory computer-readable media of claim 17, wherein the operations further comprise: 
generating a plurality of probability predictions for the plurality of candidate text labels based on the plurality of weighted text embeddings and the image embedding (Zhou, 3.1 Vision-Language Pre-Training, Zero-Shot Inference, “let f be image features extracted by the image encoder for an image x and {wi}K i=1 a set of weight vectors generated by the text encoder. K denotes the number of classes and each wi is derived from a prompt that could have the form of “a photo of a [CLASS].” where the class token is replaced by the specific class name, such as “cat,” “dog” or “car.” The prediction probability is then computed as p(y = i|x)”); and 
wherein the classification output is determined based on the plurality of probability predictions (Zhou, 3.2 Context Optimization, Unified Context, The prediction probability is computed as p(y = i|x) …where the class token within each prompt ti is replaced by the corresponding word embedding vector(s) of the i-th class name”).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
US 20230351753 A1 TEXT-CONDITIONED VIDEO REPRESENTATION
US 20230334245 A1 SYSTEMS AND METHODS FOR ZERO-SHOT TEXT CLASSIFICATION WITH A CONFORMAL PREDICTOR

	Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VAISALI RAO KOPPOLU whose telephone number is (571)270-0273. The examiner can normally be reached Monday - Friday 8:30 - 5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached at (571) 272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

VAISALI RAO. KOPPOLU
Examiner
Art Unit 2664



/JENNIFER MEHMOOD/Supervisory Patent Examiner, Art Unit 2664
Read full office action
Prosecution Timeline

Jan 24, 2024
Application Filed
Jan 29, 2026
Non-Final Rejection — §101, §102, §103
Mar 25, 2026
Examiner Interview Summary
Mar 25, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

18/163,324
Patent 12586356
ARTIFICIAL IMAGE GENERATION WITH TRAFFIC SIGNS
2y 5m to grant Granted Mar 24, 2026
18/022,455
Patent 12579680
IMAGE PROCESSING METHOD AND APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 17, 2026
18/350,801
Patent 12579824
OCCUPANT DETECTION DEVICE AND OCCUPANT DETECTION METHOD
2y 5m to grant Granted Mar 17, 2026
18/271,952
Patent 12573210
PARKING ASSISTANCE DEVICE
2y 5m to grant Granted Mar 10, 2026
18/327,722
Patent 12573087
OBJECT THREE-DIMENSIONAL LOCALIZATIONS IN IMAGES OR VIDEOS
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
79%
Grant Probability
99%
With Interview (+26.8%)
2y 12m
Median Time to Grant
Low
PTA Risk
Based on 113 resolved cases by this examiner. Grant probability derived from career allow rate.