Office Action Analysis: 18724632 — SMALL SAMPLE FINE-TURNING METHOD AND SYSTEM AND RELATED APPARATUS

Examiner Intelligence

VOGT, JACOB BUI View full profile →
Grants 57% of resolved cases
Career Allow Rate
4 granted / 7 resolved
-4.9% vs TC avg
Strong +100% interview lift
Without
With
+100.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
33 currently pending
Career history
40
Total Applications
across all art units
Statute-Specific Performance

§101
35.1%
-4.9% vs TC avg
§103
43.8%
+3.8% vs TC avg
§102
8.7%
-31.3% vs TC avg
§112
10.6%
-29.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 7 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
This communication is in response to the Application filed on 06/27/2024. Claims 1-16 and 19-22 are pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged that application claims priority to foreign application with application number CN 202210392419.0 dated 04/15/2022. Copies of certified papers required by 37 CFR 1.55 have been received. Priority is acknowledged under 35 USC 119(e) and 37 CFR 1.78. 

Information Disclosure Statement
The IDS dated 06/27/2024 has been considered and placed in the application file.  

Claim Objections
Claims 1, 3, 5, 6, 9, 10, 12, 13, 15, 16, and 22 are objected to because of the following informalities:
Claim 1, line 8, should be “corresponding to the prompt template.”
Claim 3, line 3, should be “forming 
Claim 5, line 2, should be “initializing a prompt template format; and”
Claim 6, line 4, should be “encoding the input content using an SBERT method”
Claim 9, line 4, should be “vectorizing each word in the vocabulary using a word2vec method, and determining a near-synonym set corresponding to each tag via a cosine similarity”
Claim 9, lines 5-6, should be “selecting, for each category in the training set, a word in the vocabulary that maximizes a conditional probability”
Claim 9, line 8, should be “model that is not fine-tuned [[:]] ;”
Claim 9, lines 11-12, should be “determining an assignment mode which maximizes an accuracy rate of the training set as the optimal candidate tag word.”
Claim 10, lines 2, should be “determining the conditional probability set through a formula:”
Claim 10, lines 4-7, should be “wherein Topk is a word with a maximum conditional probability …                                 
                                    
                                            P
                                        
                                            L
                                        
                             represents an output probability distribution based on the model ℒ”
Claim 12, line 7, should be “to obtain a search space list; and”
Claim 13, line 4, should be “obtaining the search space list [[,]] to determine”
Claim 15, line 3, should be “inputting 
Claim 15, line “calculating a loss of the output result and the optimal tag word.”
Claim 15, lines 7-8, should be “determining, by the agent, selection directions of subsequent s and s according to the reward until the optimal tag word and the prompt template are determined.”
Claim 16, lines 3-5, should be “when an input is textless, averaging an output tag word corresponding probability … and calculating a correction matrix according to a formula”
Claim 22 is objected to for similar reasons as described above with respect to claim 9.
Appropriate correction is required. 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. § 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

Claims 3, 7, 9-13, and 22 are rejected under 35 U.S.C. § 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
Claim 3 recites “The small sample fine-tuning method according to claim, further comprising:”.  It is unclear the claim upon which claim 3 depends on. For the purpose of prior art analysis, Examiner assumes claim 3 relies on claim 1.
Claim 9 recites “vectorizing all the words in the vocabulary using a word2vec method, and determining a near-synonym set corresponding to each tag via the cosine similarity” It is unclear what tag the limitation “each tag” refers to. Interpreting the claim language using the broadest reasonable interpretation, “each tag” could refer to the automatically selected optimal candidate tag word, the initial set of candidate tags, the set of vectorized vocabulary words, or a tag within one of a separate training, validation, or testing set. For the purpose of prior art analysis, Examiner assumes that “each tag” refers to tags within a training set. 
Claim 9 further recites “determining a candidate tag word under each category as a maximum value of a geometric intersection of the near-synonym set and the conditional probability” It is unclear what element the limitation “the conditional probability” refers to. As currently claimed, it is impossible for there to exist a geometric intersection between a set of words and a distinct numerical probability value since a distinct numerical probability value lacks the geometric properties necessary to create the claimed geometric intersection. It is noted that the previous limitation, (i.e. “selecting…”) recites a “conditional probability set”, and that ¶ [0094] of the specification specifies that the intersection for “determining the candidate tag word under each category” occurs between “the near-synonym set” and “the conditional probability set.” For the purpose of compact prosecution, Examiner assumes that “the conditional probability” refers to the “the conditional probability set” of the previous limitation (i.e. “selecting…”), and that the “conditional probability set” refers to the “Topk word set             
                
                        V
                    
                        c
                    
         that maximizes the conditional probability” of ¶ [0091].
Claim 7 is rejected due to dependence upon claim 3.
Claims 10-13 are rejected due to dependence upon claim 9.
Claim 22 is rejected due to analogousness to claim 9.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-16 and 19-22 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. All of the claims are method claims (1-16), apparatus/machine claims (19-22) or manufacture claim under (Step 1), but under Step 2A all of these claims recite abstract ideas and specifically mental processes. These mental processes are more particularly recited in claims 1, 19, and 20 as:
forming an input sample according to a fixed template…
constructing a candidate tag word set and a candidate prompt template set… 
searching an optimal tag word corresponding to the input sample from the candidate tag word set, and a prompt template…
outputting a mapping relationship of the optimal tag word and an optimal prompt template format corresponding to the prompt template…
Under Step 2A Prong One, claims 1, 19, and 20 are directed to an abstract idea and specifically a mental process. As detailed above, the steps of forming, constructing, searching, outputting, etc. may be practically performed in the human mind with the use of a physical aid such as a pen and paper. For example, a human researcher could receive a training dataset with sentence/label tuples, format each input using a fixed template, randomly sample the formatted tuples to obtain a set of candidate labels, create a set of candidate templates for each input by formatting the input sentence according to various placeholder templates, and task a second human to find a mapping that indicates an optimal label and optimal template from the sets of candidate labels and candidate templates respectively based on their personal experience.
Under Step 2A Prong Two, this judicial exception is not integrated into a practical application because claims 2-16, 21, and 22 do not recite additional elements that integrate the exception into a practical application. In particular, claims 1, 19, and 20 recite the additional elements of a non-volatile readable storage medium (¶ [00143]) and a processor (¶ [00144]). These additional elements are recited at a high level of generality and merely equate to “apply it” or otherwise merely uses a generic computer as a tool to perform an abstract which are not indicative of integration into a practical application as per MPEP 2106.05(f). Further, claims 1, 19, and 20 recite the additional element of “inputting…” which amounts to insignificant extra-solution activities which are not indicative of integration into a practical application as per MPEP 2106.05(g). Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Under Step 2B, the claims do not recite additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, s discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is noted as a general computer {non-volatile readable storage medium (¶ [00143]); processor (¶ [00144]) }. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Further, the additional limitations in the claim noted above are directed towards insignificant extra-solution activities. The claim is not patent eligible.
With respect to claims 2, the claim relates to splitting a dataset into validation, training, and test datasets. This relates to a human researcher splitting a dataset manually by randomly sampling a fixed number of tuples from a larger set to form each individual set. No additional limitations are present. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
With respect to claims 3, the claim relates to data in a dataset having ID, sentence, and label attributes. This relates to a human researcher receiving a dataset including a dataset name, sentence type information, and label information. No additional limitations are present. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
With respect to claims 4, the claim relates to forming the input sample by measuring cosine distance and random sampling. This relates to a human researcher picking tuples for the training or validation set by defining a test set comprising a select number of tuples, measuring a cosine similarity for each remaining tuple in the dataset to the tuples included in the testing set, and then random sampling the remaining tuples in order to form the training or validation sets. No additional limitations are present. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
With respect to claims 5 and 7, the claim relates to initializing and converting prompt templates. This relates to a human researcher formatting input data using placeholder templates. No additional limitations are present. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
With respect to claims 6, the claim relates to encoding and calculating cosine similarities. This relates to a human researcher following the step-wise procedure of Sentence BERT to create sentence embeddings for each input tuple, and then using the sentence embeddings to calculate a cosine similarity between all sentences in a dataset. No additional limitations are present. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
With respect to claims 8 and 21, the claim relates to automatically selecting tag words and templates. This relates to the second human understanding their task to predict labels and templates from a search space list, and then performing the prompt tuning immediately after receiving the dataset from the human researcher without requiring initial instructions. No additional limitations are present. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
With respect to claims 9 and 22, the claim relates to further defining the process of automatically selecting candidate tag words. This relates to a human researcher converting labels to vectors following the word2vec technique, determining a near-synonym set of labels to a training set, calculating a conditional probability             
                P
                (
                y
                |
                x
                )
            
         for each label embedding in the training set, wherein             
                y
            
         is a predicted label and             
                x
            
         is a label embedding in the training set, generating a conditional probability set comprising all calculated conditional probabilities, selecting the candidate label with the highest cosine similarity between the conditional probability set and the near-synonym set, and determining a candidate label that maximizes accuracy when compared to the training set. No additional limitations are present. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
With respect to claims 10, the claim relates to using an equation to determine conditional probability. This limitation is directed towards an abstract idea and more specifically, a mathematical concept which cannot constitute patentable subject matter as per MPEP 2106.04(a)(2) Section I. No additional limitations are present. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
With respect to claims 11, the claim relates to further defining the process of automatically selecting candidate templates. This relates to a human researcher using the above mental process of claim 9 to determine an optimal candidate tag word, converting training sentences into input sequences using various placeholder templates, and then performing beam search to obtain a candidate prompt template among the various input sequences. No additional limitations are present. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
With respect to claims 12 and 13, the claim relates to further defining the process of searching. This relates to a human researcher randomly sampling three candidate label for each category and then concatenating the candidate label set with a template set in order to obtain a search space list with a masked token. The human researcher could then hand this search list to the second human, who could then rely on their experience with language in order to replace the masked token with the most appropriate label. The second human could also indicate which template of the template set they were most confident about predicting. No additional limitations are present. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
With respect to claims 14 and 15, the claim relates to using reinforcement learning to perform learning. This relates to a second human receiving a search space list from the human researcher and predicting a label for a given input. The human researcher could then calculate a loss value based measuring the distance between the predicted label and the test label. The human researcher could then inform the second human of this loss value, and the second human can re-align their mental processes to better account for that loss. The second human’s re-aligned mental processes may select a different or better selection direction than the previous prediction process based on the loss. The additional element of a “language model" is recited at a high level of generality (¶ [0096]) and merely equate to “apply it” or otherwise merely uses a generic computer as a tool to perform an abstract which are not indicative of integration into a practical application as per MPEP 2106.05(f). No additional limitations are present. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
With respect to claims 16, the claim relates to averaging and normalizing conditional probabilities in order to calculate a correction matrix. This limitation is directed towards an abstract idea and more specifically, a mathematical concept which cannot constitute patentable subject matter as per MPEP 2106.04(a)(2) Section I. No additional limitations are present. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Claim 19 is drawn to a “signal” per se as recited in the preamble and as such is non-statutory subject matter. On paragraph [00143] of the as filed Specification, the term “non-volatile readable storage medium" is not defined as to what the scope of the term is meant to encompass. Hence, one of ordinary skilled in the art can interpret such term to include transitory signals and non-transitory signals. It does not appear that a claim reciting a signal encoded with functional descriptive material falls within any of the categories of patentable subject matter set forth in § 101. First, a claimed signal is clearly not a "process" under § 101 because it is not a series of steps. The other three § 101 classes of machine, compositions of matter and manufactures "relate to structural entities and can be grouped as 'product' claims in order to contrast them with process claims." 1 D. Chisum, Patents § 1.02 (1994). 
The Applicant's Specification presents a broad definition as to what the “non-volatile readable storage medium” covers and is being made to include transitory and non-transitory signals. The Applicant's as filed Specification in paragraph [00143], refers to the “storage medium”. Hence, it appears that the claims appear to be drawn towards transitory signals, which is not subject matter eligible. In order to overcome the present rejection, the Applicant is advised to amend the claims by using the following terminology: "non-transitory machine readable storage medium." Such example terminology has been also found in the Official Gazette 1351 OG 212.
For all of the above reasons, taken alone or in combination, claims 1-16 and 19-22 recite a non-statutory mental process.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-8, 14, 15 and 19-21 are rejected under 35 U.S.C. 103 as obvious over "Making Pre-trained Language Models Better Few-shot Learners" (Gao et al.) in view of US Patent Publication 20190370219 A1 (Efstathiou et al.).
Claim 1
Regarding claim 1, Gao et al. disclose a small sample fine-tuning method, comprising:
inputting a data set (Gao et al. pg. 3818, Section 3, Paragraph 2, "We conduct a systematic study across 8 single-sentence and 7 sentence-pair English tasks, including 8 tasks from the GLUE benchmark (Wang et al., 2019), SNLI (Bowman et al., 2015), and 6 other popular sentence classification tasks (SST-5, MR, CR, MPQA, Subj, TREC). All of the dataset details are provided in Appendix B."), and forming an input sample according to a fixed template (Gao et al. pg. 3819, Section 4, Paragraph 2, "we can formulate a binary sentiment classification task using a prompt with input             
                
                        x
                    
                        1
                    
         (e.g., “No reason to watch it .”) as:             
                
                        x
                    
                        p
                        r
                        o
                        m
                        p
                        t
                    
                =
                
                        C
                        L
                        S
                    
                        x
                    
                        1
                    
                I
                t
                 
                w
                a
                s
                 
                        M
                        A
                        S
                        K
                    
                .
                [
                S
                E
                P
                ]
            
        ");
constructing a candidate tag word set (Gao et al. pg. 3820, Section 5.1, Paragraph 1, "for each class             
                c
                ∈
                Y
            
        , we construct a pruned set             
                
                        V
                    
                        c
                    
                ∈
                V
            
         of the top             
                k
            
         vocabulary words based on their conditional likelihood using the initial ℒ.") and a candidate prompt template set (Gao et al. pg. 3820, Section 5.2, Paragraph 1, "we study how to generate a diverse set of templates {𝒯} automatically from a fixed set of label words ℳ(𝒴).");
[by means of reinforcement learning,] searching an optimal tag word corresponding to the input sample from the candidate tag word set (Gao et al. pg. 3820, Section 5.1, Paragraph 1, "To further narrow down the search space, we find the top             
                n
                 
        assignments over the pruned space that maximize zero-shot accuracy on             
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         (both             
                n
            
         and             
                k
            
         are hyper-parameters, see Appendix C.2). Then we fine-tune all top             
                n
            
         assignments, and re-rank to find the best one using             
                
                        D
                    
                        d
                        e
                        v
                    
        ." Finding the best label assignment is considered analogous to searching an optimal tag word), and a prompt template corresponding to the input sample from the candidate prompt template set (Gao et al. pg. 3821, Section 5.2, Paragraph 4, "we use a wide beam width (e.g., 100) to cheaply obtain a large set of diverse templates."); and
outputting a mapping relationship of the optimal tag word (Gao et al. pg. 3820, Section 5.1, Paragraph 1, "We first study how to construct a label word mapping ℳ that maximizes accuracy on             
                
                        D
                    
                        d
                        e
                        v
                    
         after fine-tuning, given a fixed template 𝒯.") and an optimal prompt template format corresponding to the prompt template (Gao et al. pg. 3821, Section 5.2, Paragraph 4, "We then fine-tune each generated template on             
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         and use             
                
                        D
                    
                        d
                        e
                        v
                    
         to either pick the single template with the best performance (Table 3), or the top k templates to use as an ensemble (Table 4).").
Gao et al. do not explicitly disclose all of reinforcement learning.
However, Efstathiou et al. disclose inputting a data set (Efstathiou et al. ¶ [0117], "FIG. 3 shows a method of labeling unlabeled data according to an embodiment. The method starts 30 with the retrieval of a set of labeled data and a set of unlabeled data."), [and forming an input sample according to a fixed template];
constructing a candidate tag word set  (Efstathiou et al. ¶ [0118], "Labels are then assigned to the unlabeled data 34 based on the confidence score output by the classifier.  The classifier may be a multi-output classifier (a classifier that classifies data into one of a plurality of classes)." A set of labels/classes are considered analogous to a candidate tag word set) [and a candidate prompt template set];
by means of reinforcement learning, searching an optimal tag word corresponding to the input sample from the candidate tag word set (Efstathiou et al. ¶ [0153], "The training system is therefore able to learn, via reinforcement learning, the best actions for retraining the classifier. Each action may include generating new instances of labeled data and retraining the classifier on these new instances. The finally trained policy can then be used to train a classifier to improve its performance"), [and a prompt template corresponding to the input sample from the candidate prompt template set]; and
outputting a mapping relationship of the optimal tag word (Efstathiou et al. ¶ [0153], "Each action may include generating new instances of labeled data and retraining the classifier on these new instances.") [and an optimal prompt template format corresponding to the prompt template].
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify Gao et al.’s small sample fine-tuning method to incorporate Efstathiou et al.’s reinforcement learning. 
The suggestion/motivation for doing so would have been that, “By utilizing the reinforcement learning methods described herein, a system can be trained to improve the classification performance of a classifier without requiring additional manually labeled data,” as noted by the Efstathiou et al. disclosure in paragraph [0062].
Claim 2
Regarding claim 2, the rejection of claim 1 is incorporated. 
Gao et al. further disclose dividing the data set into a training set, a validation set, and a test set (Gao et al. pg. 3818, Section 3, Paragraph 1, "we only assume             
                K
            
         training examples per class for the task’s training set             
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
        , such that the total number of examples is             
                
                        K
                    
                        t
                        o
                        t
                    
                =
                K
                ×
                |
                Y
                |
            
        , and             
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
                =
                
                                                x
                                            
                                                i
                                                n
                                            
                                                i
                                            
                                        ,
                                        
                                                y
                                            
                                                i
                                            
                        i
                        =
                        1
                    
                                K
                            
                                t
                                o
                                t
                            
        . Our goal is then to develop task-agnostic learning strategies that generalize well to an unseen test set             
                
                                x
                            
                                i
                                n
                            
                                t
                                e
                                s
                                t
                            
                        ,
                        
                                y
                            
                                t
                                e
                                s
                                t
                            
                ~
                 
                        D
                    
                        t
                        e
                        s
                        t
                    
        . For model selection and hyper-parameter tuning, we assume a development set             
                
                        D
                    
                        d
                        e
                        v
                    
        , of the same size as the few-shot training set"             
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         is considered analogous to a training set.             
                
                        D
                    
                        t
                        e
                        s
                        t
                    
         is considered analogous to a testing set.             
                
                        D
                    
                        d
                        e
                        v
                    
         is considered analogous to a validation set.); wherein
the training set is configured to random sample to form the input sample (Gao et al. pg. 3818, Section 3, Paragraph 1, "we measure average performance across 5 different randomly sampled             
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         and             
                
                        D
                    
                        d
                        e
                        v
                    
         splits."); and
the validation set is configured to calculate a cosine similarity (Gao et al. pg. 3820, Section 5.1, Paragraph 1, "we fine-tune all top             
                n
            
         assignments, and re-rank to find the best one using             
                
                        D
                    
                        d
                        e
                        v
                    
        ." pg. 3822, Section 6.2, Paragraph 1, "For each query             
                
                        x
                    
                        i
                        n
                    
         and each label             
                c
                ∈
                Y
            
        , we sort all training instances with the label             
                x
                ∈
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
                        c
                    
         by their similarity score to the query             
                c
                o
                s
                (
                e
                (
                
                        x
                    
                        i
                        n
                    
                )
                ;
                 
                e
                (
                x
                )
                )
            
        ").
Claim 3
Regarding claim 3, the rejection of claim 1 is incorporated. 
Gao et al. further disclose forming the data in the data set according to ID attributes, sentence attributes, and label attributes wherein, the ID attributes are configured to represent IDs of the data, the sentence attributes are configured to represent contents of the data, and the label attributes are configured to represent tag words of the data (Gao et al. pg. 3829, Table B.1 describes each of the datasets used for training and evaluation. Column "Dataset" is considered analogous to  ID attributes. Column "Category" is considered analogous to sentence attributes. Column "Labels (classification tasks)" is considered analogous to label attributes.).
Claim 4
Regarding claim 4, the rejection of claim 1 is incorporated. 
Gao et al. further disclose acquiring input content (Gao et al. pg. 3819, Section 4, Paragraph 2, "we can formulate a binary sentiment classification task using a prompt with input             
                
                        x
                    
                        1
                    
         (e.g., “No reason to watch it .”)");
representing the input content in the fixed template (Gao et al. pg. 3819, Section 4, Paragraph 2, "we can formulate a binary sentiment classification task using a prompt with input             
                
                        x
                    
                        1
                    
         (e.g., “No reason to watch it .”) as:             
                
                        x
                    
                        p
                        r
                        o
                        m
                        p
                        t
                    
                =
                
                        C
                        L
                        S
                    
                        x
                    
                        1
                    
                I
                t
                 
                w
                a
                s
                 
                        M
                        A
                        S
                        K
                    
                .
                [
                S
                E
                P
                ]
            
        ");
calculating a cosine similarity between the input content and all samples in a training set (Gao et al. pg. 3822, Section 6.2, Paragraph 1, "For each query             
                
                        x
                    
                        i
                        n
                    
         and each label             
                c
                ∈
                Y
            
        , we sort all training instances with the label             
                x
                ∈
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
                        c
                    
         by their similarity score to the query             
                c
                o
                s
                (
                e
                (
                
                        x
                    
                        i
                        n
                    
                )
                ;
                 
                e
                (
                x
                )
                )
            
        , and only sample from the top r = 50% instances for each class to use as demonstrations."); and
random sampling from a preset percentage of training set samples to obtain the input sample (Gao et al. pg. 3822, Section 6.2, Paragraph 1, "For each query             
                
                        x
                    
                        i
                        n
                    
         and each label             
                c
                ∈
                Y
            
        , we sort all training instances with the label             
                x
                ∈
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
                        c
                    
         by their similarity score to the query             
                c
                o
                s
                (
                e
                (
                
                        x
                    
                        i
                        n
                    
                )
                ;
                 
                e
                (
                x
                )
                )
            
        , and only sample from the top r = 50% instances for each class to use as demonstrations." pg. 3821, Section 6.1, Paragraph 1, "at each training step, we randomly sample one9 example             
                (
                
                        x
                    
                        i
                        n
                    
                                c
                            
                ,
                
                        y
                    
                                c
                            
                )
                ∈
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         from each class").
Claim 5
Regarding claim 5, the rejection of claim 4 is incorporated. 
Gao et al. further disclose initializing a prompt template format (Gao et al. pg. 3819, Section 4, Paragraph 2, "we can formulate a binary sentiment classification task using a prompt with input             
                
                        x
                    
                        1
                    
         (e.g., “No reason to watch it .”) as:             
                
                        x
                    
                        p
                        r
                        o
                        m
                        p
                        t
                    
                =
                
                        C
                        L
                        S
                    
                        x
                    
                        1
                    
                I
                t
                 
                w
                a
                s
                 
                        M
                        A
                        S
                        K
                    
                .
                [
                S
                E
                P
                ]
            
        "); and
representing the input content in the initialized prompt template format (Gao et al. pg. 3819, Section 4, Paragraph 2, "we can formulate a binary sentiment classification task using a prompt with input             
                
                        x
                    
                        1
                    
         (e.g., “No reason to watch it .”) as:             
                
                        x
                    
                        p
                        r
                        o
                        m
                        p
                        t
                    
                =
                
                        C
                        L
                        S
                    
                        x
                    
                        1
                    
                I
                t
                 
                w
                a
                s
                 
                        M
                        A
                        S
                        K
                    
                .
                [
                S
                E
                P
                ]
            
        ").
Claim 6
Regarding claim 6, the rejection of claim 4 is incorporated. 
Gao et al. further disclose encoding the input content using the SBERT method (Gao et al. pg. 3822, Section 6.2, Paragraph 1, "we use a pre-trained SBERT (Reimers and Gurevych, 2019) model to obtain embeddings for all input sentences"); and
calculating, for each input content in a validation set, the cosine similarity to all samples in the training set respectively (Gao et al. pg. 3822, Section 6.2, Paragraph 1, "For each query             
                
                        x
                    
                        i
                        n
                    
         and each label             
                c
                ∈
                Y
            
        , we sort all training instances with the label             
                x
                ∈
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
                        c
                    
         by their similarity score to the query             
                c
                o
                s
                (
                e
                (
                
                        x
                    
                        i
                        n
                    
                )
                ;
                 
                e
                (
                x
                )
                )
            
        , and only sample from the top r = 50% instances for each class to use as demonstrations.").
Claim 7
Regarding claim 7, the rejection of claim 3 is incorporated. 
Gao et al. further disclose converting the input sample to a prompts input (Gao et al. pg. 3819, Section 4, Paragraph 2, "we can formulate a binary sentiment classification task using a prompt with input             
                
                        x
                    
                        1
                    
         (e.g., “No reason to watch it .”) as:             
                
                        x
                    
                        p
                        r
                        o
                        m
                        p
                        t
                    
                =
                
                        C
                        L
                        S
                    
                        x
                    
                        1
                    
                I
                t
                 
                w
                a
                s
                 
                        M
                        A
                        S
                        K
                    
                .
                [
                S
                E
                P
                ]
            
        ").
Claim 8
Regarding claim 8, the rejection of claim 1 is incorporated. 
Gao et al. further disclose automatically selecting the optimal candidate tag word (Gao et al. pg. 3820, Section 5, Paragraph 1, "We now explore principled ways of automating the search process for label words (§5.1) and templates (§5.2)."); and
automatically selecting a candidate prompt template (Gao et al. pg. 3820, Section 5, Paragraph 1, "We now explore principled ways of automating the search process for label words (§5.1) and templates (§5.2).").
Claim 14
Regarding claim 14, the rejection of claim 1 is incorporated. 
Gao et al. further disclose determining the optimal tag word and the prompt template (Gao et al. pg. 3820, Section 5, Paragraph 1, "We now explore principled ways of automating the search process for label words (§5.1) and templates (§5.2).") [by key factors in reinforcement learning, wherein the key factors comprise agent, environment, action, status, and reward].
Efstathiou et al. further disclose key factors in reinforcement learning, wherein the key factors comprise agent, environment, action, status, and reward (Efstathiou et al. ¶ [0063], "FIG. 1 shows an example of a reinforcement learning process. This shows a single episode of reinforcement learning by a single agent. The agent is on an observed state             
                
                        s
                    
                        t
                    
         at a particular time point             
                t
            
        . The environment in reinforcement learning is typically considered to be a Markov Decision Process (MDP)" ¶ [0031], "Training the agent may comprise selecting and storing in the policy the actions that provide the highest value. The value of each action may be based on a reward for that action.").
Claim 15
Regarding claim 15, the rejection of claim 14 is incorporated. 
Gao et al. further disclose inputting the text into the model (Gao et al. pg. 3819, Section 4, Paragraph 2, "we can formulate a binary sentiment classification task using a prompt with input             
                
                        x
                    
                        1
                    
         (e.g., “No reason to watch it .”) as:             
                
                        x
                    
                        p
                        r
                        o
                        m
                        p
                        t
                    
                =
                
                        C
                        L
                        S
                    
                        x
                    
                        1
                    
                I
                t
                 
                w
                a
                s
                 
                        M
                        A
                        S
                        K
                    
                .
                [
                S
                E
                P
                ]
            
        ") to obtain an output result (Gao et al. pg. 3819, Section 4.1, Paragraph 1, "we can treat our task as an MLM, and model the probability of predicting class             
                y
                ∈
                Y
            
         as:             
                p
                
                        y
                    
                                x
                            
                                i
                                n
                            
                =
                p
                
                                M
                                A
                                S
                                K
                            
                        =
                        M
                        (
                        y
                        )
                    
                                x
                            
                                i
                                n
                            
                =
                
                        e
                        x
                        p
                        ⁡
                        (
                        
                                w
                            
                                M
                                (
                                y
                                )
                            
                        ∙
                        
                                h
                            
                                [
                                M
                                A
                                S
                                K
                                ]
                            
                        )
                    
                                ∑
                                
                                    y
                                    '
                                    ∈
                                    Y
                                
                                e
                                x
                                p
                                ⁡
                                (
                                
                                        w
                                    
                                        M
                                        (
                                        y
                                        )
                                    
                                ∙
                                
                                        h
                                    
                                        [
                                        M
                                        A
                                        S
                                        K
                                        ]
                                    
                                )
                            
                        '
                    
        , where             
                
                        h
                    
                        [
                        M
                        A
                        S
                        K
                        ]
                    
         is the hidden vector of             
                [
                M
                A
                S
                K
                ]
            
         and             
                
                        w
                    
                        M
                        (
                        y
                        )
                    
         denotes the pre-softmax vector corresponding to             
                v
                ∈
                V
            
        ."); the model comprising a language model environment (Gao et al. pg. 3818, Section 3, Paragraph 1, "In this work, we assume access to a pre-trained language model ℒ that we wish to fine-tune on a task             
                D
            
         with a label space 𝒴.");
calculating a loss of the output result and the tag word (Gao et al. pg. 3819, Section 4.1, Paragraph 1, "When supervised examples             
                
                        (
                        
                                x
                            
                                i
                                n
                            
                        ,
                        y
                        )
                    
         are available, ℒ can be fine-tuned to minimize the cross-entropy loss.");
[feeding back the loss as the reward to the agent;] and
determining, [by the agent,] selection directions of the template and the tag word [according to the reward] until the optimal tag word and the prompt template are determined (Gao et al. pg. 3820, Section 5, Paragraph 1, "We now explore principled ways of automating the search process for label words (§5.1) and templates (§5.2). Our goals are to ... find more optimal settings than those that we manually choose.").
Efstathiou et al. further disclose inputting the text into the model (Efstathiou et al. ¶ [0113]-[0114], "AlSynth is a classification method that aims to assign labels to unlabeled data based on an initial labeled training set of data.") to obtain an output result (Efstathiou et al. ¶ [0063]-[0064], "The agent is on an observed state             
                
                        s
                    
                        t
                    
         at a particular time point             
                t
            
        . ... The agent determines an action             
                
                        a
                    
                        t
                    
         to be performed in response to the state             
                
                        s
                    
                        t
                    
         based on a policy for the agent."); the model comprising a language model environment (Efstathiou et al. ¶ [0126]-[0127], "ChopSynth operates in the same scenario as AlSynth with the same goal—to synthesize labeled data from unlabeled data based on labeled train data. ... ChopSynth is described with reference to the classification of words from text data.");
calculating a loss of the output result and the tag word (Efstathiou et al. ¶ [0065], "By applying the action             
                
                        a
                    
                        t
                    
        , the agent traverses to a new state             
                
                        s
                    
                        t
                        +
                        1
                    
         in the next time point             
                t
                +
                1
            
        . It then receives a reward             
                
                        r
                    
                        t
                    
         from the environment at the state             
                
                        s
                    
                        t
                        +
                        1
                    
        .");
feeding back the loss as the reward to the agent (Efstathiou et al. ¶ [0065], "By applying the action             
                
                        a
                    
                        t
                    
        , the agent traverses to a new state             
                
                        s
                    
                        t
                        +
                        1
                    
         in the next time point             
                t
                +
                1
            
        . It then receives a reward             
                
                        r
                    
                        t
                    
         from the environment at the state             
                
                        s
                    
                        t
                        +
                        1
                    
        ."); and
determining, by the agent, selection directions of [the template and] the tag word according to the reward (Efstathiou et al. ¶ [0068]-[0069], "The most appropriate action for a given state may be determined using a gradient ascent method based on the reward values. This involves locating values, connected to particular actions, that maximise the reward function's result. In this way, after the end of the learning process, the agent has successfully learnt how to traverse to the most desirable state through a series of other states, by selecting the highest in value actions (i.e. those of the optimal policy).") until the optimal tag word [and the prompt template] are determined (Efstathiou et al. ¶ [0104], "Each reinforcement learning agent works on an individual class within a multi-class classification problem." Classifying data into multiple classes using an optimal policy trained overtime is considered analogous to selecting directions of tag word classification using a reward).
Claim 19
Regarding claim 19, Gao et al. disclose a non-volatile readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the method according to claim 1 (Gao et al. pg. 3816, Section 1, Paragraph 2, "In this work, we study a more practical scenario in which we only assume access to a moderately sized language model such as BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019).... This setting is appealing as (1) such models can be trained on typical research hardware").
The remaining limitations of claim 19 are identical to that of claim 1 and therefore are rejected for similar reasons as described above.
Claim 20
Regarding claim 19, Gao et al. disclose an electronic device, comprising a memory having stored thereon a computer program, and a processor that implements the steps of the method according claim 1 when calling the computer program in the memory (Gao et al. pg. 3816, Section 1, Paragraph 2, "In this work, we study a more practical scenario in which we only assume access to a moderately sized language model such as BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019).... This setting is appealing as (1) such models can be trained on typical research hardware").
The remaining limitations of claim 20 are identical to that of claim 1 and therefore are rejected for similar reasons as described above.
Claim 21
Regarding claim 21, the rejection of claim 20 is incorporated. The limitations of claim 21 are similar in scope to that of claim 8 and therefore are rejected for similar reasons as described above.

Claims 9-13 and 22 are rejected under 35 U.S.C. 103 as obvious over Gao et al. in view of Efstathiou et al. as applied to claims 8 and 21 above, and further in view of US Patent Publication 20200409948 A1 (Corvinelli et al.).
Claim 9
Regarding claim 9, the rejection of claim 8 is incorporated. Gao et al. in view of Efstathiou et al. disclose all the elements of the claimed invention as stated above.
Gao et al. further disclose initializing a vocabulary (Gao et al. pg. 3819, Section 4.1, Paragraph 1, "Let ℳ be a mapping from the task label space to individual words in the vocabulary 𝒱 of ℒ.");
vectorizing all the words in the vocabulary using a [word2vec] method (Gao et al. pg. 3822, Section 6.2, Paragraph 1, "we use a pre-trained SBERT (Reimers and Gurevych, 2019) model to obtain embeddings for all input sentences (for sentence-pair tasks, we use the concatenation of the two sentences)."), and determining a near-synonym set corresponding to each tag via the cosine similarity (Gao et al. pg. 3822, Section 6.2, Paragraph 1, "For each query             
                
                        x
                    
                        i
                        n
                    
         and each label             
                c
                ∈
                Y
            
        , we sort all training instances with the label             
                x
                ∈
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
                        c
                    
         by their similarity score to the query             
                c
                o
                s
                (
                e
                (
                
                        x
                    
                        i
                        n
                    
                )
                ;
                 
                e
                (
                x
                )
                )
            
        , and only sample from the top r = 50% instances for each class to use as demonstrations.");
selecting, for each category in the training set, a word in the vocabulary that maximizes a conditional probability, and a conditional probability set comprising the word (Gao et al. pg. 3820, Section 5.1, Paragraph 1, "let             
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
                        c
                    
                ⊂
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         be the subset of all examples of class             
                c
            
        . We take             
                
                        V
                    
                        c
                    
         as             
                
                        T
                        o
                        p
                        -
                        k
                    
                        v
                        ∈
                        V
                    
                                ∑
                                
                                            x
                                        
                                            i
                                            n
                                        
                                    ∈
                                    
                                            D
                                        
                                            t
                                            r
                                            a
                                            i
                                            n
                                        
                                            c
                                        
                                l
                                o
                                g
                            
                                P
                            
                                L
                            
                                        M
                                        A
                                        S
                                        K
                                    
                                =
                                v
                                |
                                T
                                (
                                
                                        x
                                    
                                        i
                                        n
                                    
                                )
                            
        , where             
                
                        P
                    
                        L
                    
         denotes the output probability distribution of ℒ"             
                
                        P
                    
                        L
                    
         is considered analogous to a conditional probability. Thus,             
                
                        V
                    
                        c
                    
         is considered analogous to selecting a word in the vocabulary for each category that maximizes a conditional probability. The Top-K computation of the above equation is considered analogous to a conditional probability set), by a pretraining model that is not fine-tuned (Gao et al. pg. 3818, Section 3, Paragraph 1, "In this work, we assume access to a pre-trained language model ℒ that we wish to fine-tune on a task             
                D
            
         with a label space 𝒴." pg. 5, Section 5.1, Paragraph 1, "for each class             
                c
                ∈
                Y
            
        , we construct a pruned set             
                
                        V
                    
                        c
                    
                ∈
                V
            
         of the top             
                k
            
         vocabulary words based on their conditional likelihood using the initial ℒ.");
determining a candidate tag word under each category as a maximum value of a geometric intersection (Gao et al. pg. 3822, Section 6.2, Paragraph 1, "we devise a simple strategy in which we only sample examples that are semantically close to             
                
                        x
                    
                        i
                        n
                    
        . ... For each query             
                
                        x
                    
                        i
                        n
                    
         and each label             
                c
                ∈
                Y
            
        , we sort all training instances with the label             
                x
                ∈
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
                        c
                    
         by their similarity score to the query             
                c
                o
                s
                (
                e
                (
                
                        x
                    
                        i
                        n
                    
                )
                ;
                e
                (
                x
                )
                )
            
        " The highest ranked training instance according to cosine similarity is considered analogous to a candidate tag word with a maximum value of geometric intersection.) of the near-synonym set and the conditional probability (Gao et al. pg. 3828, Section C.2, Paragraph 1, "For TREC, we observe that filtering             
                
                        V
                    
                        c
                    
         using conditional likelihood alone is still noisy, thus we set             
                k
                =
                1000
            
        , and then re-rank             
                
                        V
                    
                        c
                    
         by the nearest neighbors of the original manual label words" A filtered             
                
                        V
                    
                        c
                    
        using conditional likelihood is considered analogous to a conditional probability set (see section Claim Rejections - 35 USC § 112). The set of original manual label words is considered analogous to a near-synonym set, since the set of original manual label words is sampled using cosine similarity as described in section 6.2); and
integrating candidate tag words under various categories, and determining an assignment mode which maximizes the accuracy rate of the training set as the optimal candidate tag word (Gao et al. pg. 3820, Section 5.1, Paragraph 1, "To further narrow down the search space, we find the top             
                n
                 
        assignments over the pruned space that maximize zero-shot accuracy on             
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         (both             
                n
            
         and             
                k
            
         are hyper-parameters, see Appendix C.2). Then we fine-tune all top             
                n
            
         assignments, and re-rank to find the best one using             
                
                        D
                    
                        d
                        e
                        v
                    
        ." Finding top n assignments is considered analogous to determining an assignment mode).
Gao et al. in view of Efstathiou et al. do not explicitly disclose all of a word2vec method.
However, Corvinelli et al. disclose a word2vec method (Corvinelli et al. ¶ [0019], "FIG. 2 depicts a SQL embedding layer 200 in accordance with at least one embodiment of the present invention. ... In at least one embodiment, SQL embedding layer is configured to utilize a Word2vec model to create word embeddings.").
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify Gao et al.’s small sample fine-tuning method to include Corvinelli et al.’s word2vec embedding method because such a modification is the result of simple substitution of one known element for another producing a predictable result.  More specifically, Gao et al.’s SBERT encoding and Corvinelli et al.’s word2vec encoding perform the same general and predictable function, the predictable function being generating vectors from datasets. Since each individual element and its function are shown in the prior art, albeit shown in separate references, the difference between the claimed subject matter and the prior art rests not on any individual element or function but in the very combination itself - that is in the substitution of Gao et al.’s SBERT encoding by replacing it with Corvinelli et al.’s word2vec encoding. Thus, the simple substitution of one known element for another producing a predictable result renders the claim obvious.
Claim 10
Regarding claim 10, the rejection of claim 9 is incorporated.
Gao et al. further disclose determining the conditional probability set through the formula:
            
                        T
                        o
                        p
                        -
                        k
                    
                        v
                        ∈
                        V
                    
                                ∑
                                
                                            x
                                        
                                            i
                                            n
                                        
                                    ∈
                                    
                                            D
                                        
                                            t
                                            r
                                            a
                                            i
                                            n
                                        
                                            c
                                        
                                l
                                o
                                g
                            
                                P
                            
                                L
                            
                                        M
                                        A
                                        S
                                        K
                                    
                                =
                                v
                                |
                                T
                                (
                                
                                        x
                                    
                                        i
                                        n
                                    
                                )
                            
wherein Topk is a word with the maximum conditional probability;             
                V
            
         is an initialization vocabulary;             
                L
            
         is a pre-trained model that is not fine-tuned; c is each category in the training set;             
                
                        P
                    
                        L
                    
         represents the output probability distribution based on the model             
                L
            
        ; and T(            
                
                        X
                    
                        i
                        n
                    
        ) is an input sample (Gao et al. pg. 3820, Section 5.1, Paragraph 1, "for each class             
                c
                ∈
                Y
            
        , we construct a pruned set             
                
                        V
                    
                        c
                    
                ∈
                V
            
         of the top             
                k
            
         vocabulary words based on their conditional likelihood using the initial ℒ. That is, let             
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
                        c
                    
                ⊂
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         be the subset of all examples of class             
                c
            
        . We take             
                
                        V
                    
                        c
                    
         as             
                
                        T
                        o
                        p
                        -
                        k
                    
                        v
                        ∈
                        V
                    
                                ∑
                                
                                            x
                                        
                                            i
                                            n
                                        
                                    ∈
                                    
                                            D
                                        
                                            t
                                            r
                                            a
                                            i
                                            n
                                        
                                            c
                                        
                                l
                                o
                                g
                            
                                P
                            
                                L
                            
                                        M
                                        A
                                        S
                                        K
                                    
                                =
                                v
                                |
                                T
                                (
                                
                                        x
                                    
                                        i
                                        n
                                    
                                )
                            
        , where             
                
                        P
                    
                        L
                    
         denotes the output probability distribution of ℒ").
Claim 11
Regarding claim 11, the rejection of claim 9 is incorporated.
Gao et al. further disclose determining the optimal candidate tag word (Gao et al. pg. 3820, Section 5.1, Paragraph 1, "We first study how to construct a label word mapping ℳ that maximizes accuracy on             
                
                        D
                    
                        d
                        e
                        v
                    
         after fine-tuning, given a fixed template 𝒯.");
generating an initial prompt template by filling a placeholder (Gao et al. pg. 3821, Section 5.2, Paragraph 2-3, "Given an input example             
                (
                
                        x
                    
                        i
                        n
                    
                ;
                y
                )
                ∈
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
        , we consider the following simple conversions, denoted as              
                
                        T
                    
                        g
                    
                (
                
                        x
                    
                        i
                        n
                    
                ;
                y
                )
            
        , for formulating the T5 model inputs: [see mappings following paragraph]. As shown in Figure 2, we rely on the T5 model to fill in the placeholders."); wherein the initial prompt template is configured to maximize an output probability in the training set (Gao et al. pg. 3821, Section 5.2, Paragraph 3, "When decoding, our goal here is to find an output that can work well for all examples in             
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
        , i.e., the output template 𝒯 that maximizes             
                
                        ∑
                        
                            (
                            
                                    x
                                
                                    i
                                    n
                                
                            ;
                            y
                            )
                            ∈
                            
                                    D
                                
                                    t
                                    r
                                    a
                                    i
                                    n
                                
                        l
                        o
                        g
                        ⁡
                        
                                P
                            
                                T
                                5
                            
                        (
                        T
                        |
                        
                                T
                            
                                g
                            
                        (
                        
                                x
                            
                                i
                                n
                            
                        ;
                        y
                        )
                        )
                    
        , where             
                
                        P
                    
                        T
                        5
                    
         denotes the output probability distribution of T5."); and
decoding the initial prompt template using a bundle search algorithm to obtain the candidate prompt template (Gao et al. pg. 3821, Section 5.2, Paragraph 4, "We use beam search to decode multiple template candidates. Concretely, we use a wide beam width (e.g., 100) to cheaply obtain a large set of diverse templates. We then fine-tune each generated template on             
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         and use             
                
                        D
                    
                        d
                        e
                        v
                    
         to either pick the single template with the best performance (Table 3), or the top k templates to use as an ensemble (Table 4)." Beam search is considered analogous to a bundle search algorithm).
Claim 12
Regarding claim 12, the rejection of claim 11 is incorporated.
Gao et al. further disclose determining a preset number of candidate tag word set for each category (Gao et al. pg. 3821, Section 6.1, Paragraph 1, "at each training step, we randomly sample one9 example             
                (
                
                        x
                    
                        i
                        n
                    
                                c
                            
                ,
                
                        y
                    
                                c
                            
                )
                ∈
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         from each class" pg. 6, Column 2, Footnote 9, "We also explored sampling multiple examples per class, but did not observe any improvements." Experimenting with different sampling numbers and settling on a standard value (e.g. “one”) is considered analogous to determining a preset number);
combining the candidate tag word set with a template set corresponding to the candidate prompt template to obtain a search space list (Gao et al. pg. 3821, Section 6.1, Paragraph 1, "at each training step, we randomly sample one9 example             
                (
                
                        x
                    
                        i
                        n
                    
                                c
                            
                ,
                
                        y
                    
                                c
                            
                )
                ∈
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         from each class, convert it into             
                T
                (
                
                        x
                    
                        i
                        n
                    
                                c
                            
                )
            
         in with [MASK] replaced by             
                M
                (
                
                        y
                    
                        (
                        c
                        )
                    
                )
            
        —we denote this as             
                
                        T
                    
                    ~
                
                (
                
                        x
                    
                        i
                        n
                    
                                c
                            
                ,
                
                        y
                    
                                c
                            
                )
            
        —and then concatenate them with             
                
                        x
                    
                        i
                        n
                    
        :" See Figure 1(c), which illustrates the combined search space list);
by means of the search space list, determining an optimal tag word corresponding to the input sample from the candidate tag word set, and a prompt template corresponding to the input sample from the candidate prompt template set (Gao et al. pg. 3820, Section 5, Paragraph 1, "We now explore principled ways of automating the search process for label words (§5.1) and templates (§5.2). Our goals are to ... find more optimal settings than those that we manually choose.").
Claim 13
Regarding claim 13, the rejection of claim 12 is incorporated.
Gao et al. further disclose by combining the candidate tag word set with a template set corresponding to the candidate prompt template, obtaining the search space list (Gao et al. pg. 3821, Section 6.1, Paragraph 1, "at each training step, we randomly sample one9 example             
                (
                
                        x
                    
                        i
                        n
                    
                                c
                            
                ,
                
                        y
                    
                                c
                            
                )
                ∈
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         from each class, convert it into             
                T
                (
                
                        x
                    
                        i
                        n
                    
                                c
                            
                )
            
         in with [MASK] replaced by             
                M
                (
                
                        y
                    
                        (
                        c
                        )
                    
                )
            
        —we denote this as             
                
                        T
                    
                    ~
                
                (
                
                        x
                    
                        i
                        n
                    
                                c
                            
                ,
                
                        y
                    
                                c
                            
                )
            
        —and then concatenate them with             
                
                        x
                    
                        i
                        n
                    
        :" See Figure 1(c), which illustrates the combined search space list) to determine the optimal assignment mode of the candidate tag word and the candidate prompt template in the finetuning process (Gao et al. pg. 3821, Section 5.2, Paragraph 4, "We then fine-tune each generated template on             
                
                        D
                    
                        t
                        r
                        a
                        i
                        n
                    
         and use             
                
                        D
                    
                        d
                        e
                        v
                    
         to either pick the single template with the best performance (Table 3), or the top k templates to use as an ensemble (Table 4).").
Claim 22
Regarding claim 22, the rejection of claim 21 is incorporated. The limitations of claim 22 are similar in scope to that of claim 9 and therefore are rejected for similar reasons as described above.

Claim 16 is rejected under 35 U.S.C. 103 as obvious over Gao et al. in view of Efstathiou et al. as applied to claim 1, and further in view of "Calibrate Before Use: Improving Few-Shot Performance of Language Models" (Zhao et al.).
Claim 16
Regarding claim 16, the rejection of claim 1 is incorporated. Gao et al. in view of Efstathiou et al. disclose all the elements of the claimed invention as stated above.
Gao et al. further disclose [when the input is textless,] averaging the output tag word corresponding probability (Gao et al. pg. 3828, Appendix C.3, Paragraph 1, "When using demonstrations, we sample 16 different sets of demonstrations for each input and average the predicted log probability for each class during inference.") [and then normalizing to obtain a normalized probability p_cf ; and calculating a correction matrix according to the formula].
Efstathiou et al. further disclose textless input (Efstathiou et al. ¶ [0127], "the ChopSynth method uses frequent (but not the overly common and therefore, meaningless) words to identify sequences of words in the unlabeled instances. If applied to other data types such as image data, common features in the labeled set are used to identify equivalent features in the unlabeled set to generate new samples.").
Gao et al. in view of Efstathiou et al. do not explicitly disclose all of a normalization or calculation of a correction matrix.
However, Zhao et al. disclose averaging the output tag word corresponding probability (Zhao et al. pg. 5, Section 5, Paragraph 4, "In all our experiments, we average the probabilities from three content-free inputs: “N/A”, “[MASK]”, and the empty string.") and then normalizing to obtain a normalized probability p_cf (Zhao et al. pg. 5, Section 5, Paragraph 1, "For classification tasks,             
                
                        p
                    
                    ^
                
         is the set of probabilities that are associated with each label name, renormalized to one."); and calculating a correction matrix according to the formula             
                
                        [
                        d
                        i
                        a
                        g
                        
                                        p
                                    
                                        c
                                        f
                                    
                        ]
                    
                        -
                        1
                    
         (Zhao et al. pg. 5, Section 5, Paragraph 3, "We first obtain             
                
                        p
                    
                    ^
                
         for the content-free input, denoted             
                
                                p
                            
                            ^
                        
                        c
                        f
                    
        . We then set             
                W
                =
                
                        d
                        i
                        a
                        g
                        (
                        
                                        p
                                    
                                    ^
                                
                                c
                                f
                            
                        )
                    
                        -
                        1
                    
        ").
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify Gao et al. in view of Efstathiou et al. to incorporate Zhao et al.’s normalization and correction.
The suggestion/motivation for doing so would have been that, “LMs are biased towards outputting answers that are (1) frequent in the prompt (majority label bias), (2) towards the end of the prompt (recency bias), and (3) common in the pre-training data (common token bias) … we look to correct this [bias] by “calibrating” the model’s output probabilities. A common technique for adjusting output probabilities is to apply an affine transformation… where a weight matrix             
                W
            
         and a bias vector             
                b
            
         are applied to the original probabilities             
                
                        p
                    
                    ^
                
         to get the new probabilities,” as noted by the Zhao et al. disclosure in pg. 4, Section 4, Paragraph 1, and pg. 5, Section 5, Paragraph 1.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JACOB B VOGT whose telephone number is (571)272-7028. The examiner can normally be reached Monday - Friday 9:30am - 7pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras D Shah can be reached at (571)270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JACOB B VOGT/               Examiner, Art Unit 2653                                                                                                                                                                                         
/Paras D Shah/               Supervisory Patent Examiner, Art Unit 2653                                                                                                                                                                                         
03/06/2026
Read full office action
Prosecution Timeline

Jun 27, 2024
Application Filed
Mar 05, 2026
Non-Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/103,858
Patent 12505279
METHOD AND SYSTEM FOR DOMAIN ADAPTATION OF SOCIAL MEDIA TEXT USING LEXICAL DATA TRANSFORMATIONS
2y 5m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 1 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

1-2
Expected OA Rounds
57%
Grant Probability
99%
With Interview (+100.0%)
2y 10m
Median Time to Grant
Low
PTA Risk
Based on 7 resolved cases by this examiner. Grant probability derived from career allow rate.
SMALL SAMPLE FINE-TURNING METHOD AND SYSTEM AND RELATED APPARATUS

This examiner grants 57% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

SMALL SAMPLE FINE-TURNING METHOD AND SYSTEM AND RELATED APPARATUS

This examiner grants 57% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email