Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Arguments and amendments filed 2/12/2026 have been examined.
Claims 1 and 19-20 have been amended.
In this Office Action, Claims 1-20 are currently pending.
Response to Arguments
Applicant’s arguments with respect to claim(s) and the previous prior art rejection have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 11, 13-14, 16, 19-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Peng et al. US Pub. No. 2024/0070394 A1, in view of Bathwal et al., US Pub. No.: US 2024/0281446 A1, in view of Gurgu et al. US Pub. No. 2023/0297887 A1, in view of Sewak et al., US Pub. No.: 2022/0414137 A1.
As to claim 1 (and substantially similar claim 19 and claim 20),
Peng discloses
a computer-implemented method,
(Peng [0049-0053])
for generating a synthetic training dataset to fine-tune an embedding model,
(Peng Fig. item, 604: “Generate one or more training input sequences by prepending the input with one or more soft prompts, respectively 604”; see laso [0077] At step 604, a training input sequences is generated (e.g. by a processor 410 of FIG. 4) by prepending one or more soft prompts (e.g., ll0a,ll0b in FIG. 2) to the input (e.g., 2l2a-d in FIG. 2).
See also [0002] The embodiments relate generally to natural language processing and machine learning systems, and more specifically to a trainable ensembling of soft prompts for
few-shot fine-tuning of language models.)
comprising:
providing, by a processor, to a sequence model (i) a plurality of few-shot prompts,
(Peng teaches a source model is trained with the few-shot target data in a prompt-tuning manner, i.e. “providing, to a sequence model (i) a plurality of few-shot prompts”
See [0031-0032] [0031] The prompt from a source task together with the PLM 100 is jointly referred to as a source model, represented as [P1; 8]. Thus, each source model [P1; 8] is trained with the few-shot target data in a prompt-tuning manner. This enforces the source models to generate target task's verbalizers given the target input sample 2l2a-d.
[0032] In one embodiment, given a labeled instance (X, y) from the few-shot target training dataset corresponding to a target task T,mget, trained or untrained (e.g., randomly initialized)
soft prompts (e.g. ll0a-b, l20a-b, or 150a-b) may be prepended to the target input data sample X, referred to as [P1; X].
see also [0099] In Ensemble ace SP, the source prompts, as opposed to the generated logits, are provided to the attention module. In PLG, source prompts were tuned on target data sets with the few-shot target training data. Then the trained source prompts are used with the pre-trained language model to generate pseudo labels for each for the entire target dataset.)
wherein each prompt comprises
a demonstration passage,
(Peng [0052] Ensembled Soft Prompt Tuning module 430 may receive input 440 such as an input training data (e.g., a natural language question) via the data interface 415 and generate an output 450 which may be an answer. Examples of the input data may include other types of natural language inputs such as a document, a text, etc. Examples of the output data may include an answer, a summary, an intent classification label, and/or the like.)
a demonstration task,
(Peng teaches source task training datasets, i.e. a “a demonstration task” [0018] Specifically, given a set of source tasks and corresponding large-scale datasets, a task-specific source soft prompt (or a task-specific set of soft prompts) may be trained using a frozen PLM on each of the source task training datasets.;
See also [0022] Thus, an instance in a source or target task is represented as (X, y), where X is a sequence of token embeddings (X=[xi, ... , x1]E JR lxd, where 1 is the length of
the input token sequence and d is the embedding size of PLM), and y is a classification label. Then, the class label y is mapped to its corresponding verbalizer or the verbalizer template sequence, represented by Y. Each soft prompt P1=[p1 , ... , PmlE JR mxd is also a sequence of embeddings, where m is the number of soft prompt embeddings for the task.;
see also [0021] Each soft prompt may be composed of a sequence of soft prompt embeddings, e.g., ll0a-b, l20a-b, l50a-b. For example, the source tasks may be question answering, natural language inference, paraphrasing, etc.)
and a demonstration query,
(Peng teaches training questions, i.e. demonstration queries see [0067] Data vendor server 545 may correspond to a server that hosts database 519 to provide training datasets including question-answer pairs and/or the like to the server 530. )
wherein the demonstration task describes a type of retrieval,
(Peng teaches target tasks with classification labels, i.e. “wherein the demonstration task describes a type of retrieval” see [0022] Thus, an instance in a source or target task is
represented as (X, y), where X is a sequence of token embeddings (X=[xi, ... , x1]E JR lxd, where 1 is the length of the input token sequence and d is the embedding size of PLM), and y is a classification label.)
and
(ii) a plurality of passages sampled from a corpus of passages;
(Peng teaches training using multiple text samples of see [0040] During few-shot training, the attention module 200 is updated with the few-shot labeled target samples. The attention module is thus trained to capture the sample-specific preference of different source models.;
See also [0052] Examples of the input data may include other types of natural language
inputs such as a document, a text, etc. Examples of the output data may include an answer, a summary, an intent classification label, and/or the like.)
Peng does not disclose:
and wherein sequence model has been trained to align the demonstration query with the demonstration task,
However, Bathwal discloses:
and wherein sequence model has been trained to align the demonstration query with the demonstration task,
(Bathwal teaches a proprietary/training dataset to be created or curated, which contains examples that are closely aligned with the specific tasks, i.e. “sequence model has been trained to align the demonstration query with the demonstration task” [0026] Examples of the training process involves leveraging large, pre-existing, and new models to generate high-quality training data, which can then be used to fine-tune smaller, more efficient models tailored to specific tasks, such as summarization tasks, citation tasks, web-interface building tasks, or the like. see also [0045] The training data generator 136 is responsible for creating the datasets used to train or fine-tune the LLMs. A training data generator might automate the collection, cleaning, and labeling of data, ensuring that the model has a diverse and representative set of examples to learn from.;
See also [0081] The proprietary dataset creator 410 enables a proprietary dataset to be created or curated, which contains examples that are closely aligned with the specific tasks the
model will perform.;
see also [0109] the loss function teaches the model to output a higher score for the preferred response in each pair of responses), and RL (e.g., running RL using the trained reward model to teach the original sequence generation (base) model to maximize mean reward to be a good reflection of human preferences.)
It would have been obvious to one having ordinary skill in the art at the time the time of the effective filing date to apply proprietary/training dataset to be created or curated, which contains examples that are closely aligned with the specific tasks as taught by Bathwal, to the system of Peng, since it was known in the art that machine learning systems provide fine-tuning/examples that are closely aligned with the specific tasks as this leverages machine learning models to optimize automated data generation, model compression, and reward modeling enhancements that provide for summarization, as well as applicability beyond summarization where this provides a system for model training, model tuning, and fine-tuning processing underpinned by sophisticated machine learning models that have undergone extensive training and fine-tuning to provide increased efficacy where models are adept at interpreting the nuances of natural language, enabling them to extract and synthesize information from a multitude of documents where the training process involves leveraging large, pre-existing, and new models to generate high-quality training data, which can then be used to fine-tune smaller, more efficient models tailored to specific tasks, such as summarization tasks, citation tasks, web-interface building tasks, or the like, where fine tuning allows the system to eliminate prior inefficiencies by summarizing information from various sources into a coherent and interactive response, tailored to the user's specific query using fine-tuned machine learning models to enhance search capabilities where task-specific generative models can leverage problem structure to run smaller, faster, and cheaper at web scale. (Bathwal [0025-0026]).
Peng/Bathwal does not disclose:
receiving, by the processor, from the sequence model and for the plurality of passages and based on the plurality of few-shot prompts, a respective plurality of predicted task-query pairs,
by automatically prompting the sequence model to predict a task based on an input passage, and predict an output query for the predicted task;
generating, by the processor, a synthetic training dataset comprising the plurality of passages and the respective plurality of predicted task-query pairs for use in fine-tuning the
embedding model;
and providing by the processor the synthetic training dataset to the embedding model;
However, Gurgu discloses:
receiving, by the processor, from the sequence model and for the plurality of passages and based on the plurality of few-shot prompts, a respective plurality of predicted task-query pairs,
by automatically prompting, the sequence model to predict a task based on an input passage, and predict an output query the predicted task;
(Gurgu [0063] In some embodiments, the chatbot system 10 provides recommendations of training data that may be used by the chatbot builder 12 to train the inference models used by
the chatbot to respond to user queries. The training data may include question and answer pairs that may be generated based on information in the knowledge base 14.;
see also [0058] In some embodiments, a generated training question is funneled back into the prompt to simulate few-shot learning. The training question that is selected for the few-shot learning may be one for which the chatbot administrator provides positive feedback.
See also [0055] In one embodiment, the generative language model is provided a prompt to generate one or more training questions. The prompt may have a structure that has been
predicted to be successful in generating meaningful training questions. In some embodiments, the prompt is generated based on content that is relevant to the enterprise that is to use the chatbot. In this regard, the prompt may include a topic of discussion typical for the enterprise, text describing a context of the discussion, and describe the task as generating a question about the topic of discussion. The language model may engage in zero-shot or few-shot in-context learning to fulfill the described task.)
generating, by the processor, a synthetic training dataset comprising the plurality of passages and the respective plurality of predicted task-query pairs for use in fine-tuning the embedding model;
(Gurgu teaches providing recommendations of training data that may be used by the chatbot builder to train the inference models , i.e. “generating a synthetic training dataset” [0063] In some embodiments, the chatbot system 10 provides recommendations of training data that may be used by the chatbot builder 12 to train the inference models used by the chatbot to respond to user queries. The training data may include question and answer pairs that may be generated
based on information in the knowledge base 14. The knowledge base 14 may include any source of information for the particular enterprise that is serviced by the chatbot system 10. For example, the knowledge base 14 may include the enterprise's website, database, social media sites, and/or any other online repository of source data for the enterprise. The automatic recommendation of question and answer pairs that may be used as the training data may help expedite the training of the chatbot, which may otherwise be a time consuming process.;
see also [0067] For example, the model may be fine-tuned by adjusting values of one or more
learnable parameters of the language model for a particular task. In some embodiments, a deep neural network that has been fine-tuned based on user queries may be used to generate the embedding vectors, in addition or in lieu of the BERT model)
and providing, by the processor, the synthetic training dataset to the embedding model.
(Gurgu teaches providing recommendations of training data that may be used by the chatbot builder to train the inference models , i.e. “the synthetic training dataset to the embedding model” [0063] In some embodiments, the chatbot system 10 provides recommendations of training data that may be used by the chatbot builder 12 to train the inference models used by the chatbot to respond to user queries. The training data may include question and answer pairs that may be generated based on information in the knowledge base 14. The knowledge base 14 may include any source of information for the particular enterprise that is serviced by the chatbot system 10. For example, the knowledge base 14 may include the enterprise's website, database, social media sites, and/or any other online repository of source data for the enterprise. The automatic recommendation of question and answer pairs that may be used as the training data may help expedite the training of the chatbot, which may otherwise be a time consuming process.;
See also [0127] The suggested training questions may also be automatically included into the training dataset 508 after a filtering evaluation is made by the filtering system 302.)
It would have been obvious to one having ordinary skill in the art at the time the time of the effective filing date to apply few-shot learning to generating training data as taught by Gurgu, to the system of Peng/Bathwal, since it was known in the art that machine learning systems provide generative language model is provided a prompt to generate one or more training questions where the prompt may have a structure that has been predicted to be successful in generating meaningful training questions where the prompt is generated based on content that is relevant to the enterprise that is to use the chatbot. In this regard, the prompt may include a topic of discussion typical for the enterprise, text describing a context of the discussion, and describe the task as generating a question about the topic of discussion where the language
model may engage in zero-shot or few-shot in-context learning to fulfill the described task.
where a model evaluation system evaluates the capabilities of the language model in generating
meaningful training questions where in this regard, feedback may be received about the generated training questions, and one or more metrics computed based on the received feedback where the feedback may be provided where aspects of the language model may be modified based on the computed metrics, and/or the prompt structure may be altered based on the computed metrics and where a generated training question is funneled back into the prompt to simulate few-shot learning where the training question that is selected for the few-shot learning may be one for which the chatbot administrator provides positive feedback. (Gurgu [0055-0058]).
Peng/Bathwal/Gurgu does not disclose:
and wherein the generating of the synthetic training dataset further comprises
determining a global positive passage for a generated query by selecting a passage from the corpus of passages that has a higher relevance score than the input passage used to generate the query;
However, Sewak discloses:
and wherein the generating of the synthetic training dataset further comprises determining a global positive passage for a generated query by selecting a passage from the corpus of passages that has a higher relevance score than the input passage used to generate the query;
(Sewak teaches “produced augmented dataset could be directly used for training large and advanced NLP models” using “a text snippet” based on “obtaining a positive example” and “quantify a confidence”, i.e. “generating of the synthetic training dataset” by “determining a global positive passage for a generated query by selecting a passage from the corpus of passages that has a higher relevance score”
[0121] The method of augmentation disclosed herein holistically discovers new ideas with respect to a specific context-requirement as provided by the label description, and not just
randomly replacing words/terms/translations/generations etc. The method of augmentation presented works in a noise-resistant manner. The produced augmented dataset could be directly used for training large and advanced NLP models. Additionally, as opposed to zero-shot/few-shot classification techniques, which require an existing (unlabelled) dataset that it classifies, the disclosed method that fulfils both augmentation and pre-classification requirements. The augmentation method disclosed automatically and intelligently acquires and buckets the data samples in the correct data sub-set, ready for any classification model.;
see also [0089] An exemplary method of obtaining a positive example at step 330 entails performing a search over a corpus 154 using the ordered keywords derived from the label, and using at least a portion of the augmentation method 600 shown in FIG. 6. Specifically, a search over a corpus 154 is performed at step 620 using the prioritized keywords for the label as the query. At step 625, a text snippet is obtained and the method proceeds to step 630 to
quantify a confidence that the text snippet belongs to the label class. An exemplary method to quantify class confidence is to construct a keyword structure for the text snippet, e.g. using method of performing step 710 of FIG. 9. An exemplary method of evaluating an overall semantic similarity between the keyword structure of the text snippet and the label keyword structure could be the use of cosine similarity based on a vectorized transformation of graph
terms, or some other method provided by vectorization functions 156. Other methods are disclosed herein provide a similarity score or an estimate of probability that a label properly applies to the text snippet.)
It would have been obvious to one having ordinary skill in the art at the time the time of the effective filing date to apply obtaining a positive examples for augmenting training data as taught by Sewak, to the system of Peng/Bathwal/Gurgu, since it was known in the art that machine learning systems provide augmenting training data for large, advanced NLP data with which the underlying model could deliver better recall/FPR/accuracy, and due to the richness and variation of ideas of data it could augment, the model could learn the context better, and more holistically, which means that the model could perform reasonable better for new data/domain where augmented samples are search based, and hence are human/enterprise generated actual samples, this ensuring that under real-life applications the models trained
on these systems are more reliable, and stable and the method could augment huge amounts of realistic human/enterprise-created training data for even advanced transformer-based NLP models, which require very diverse representations of ideas to learn rich contexts. (Sewak [0155-0157]).
As to claim 11, Peng as modified discloses the computer-implemented method of claim 1, wherein a particular predicted task of the plurality of predicted task-query pairs is different from the plurality of demonstration tasks provided to the sequence model
(Peng teaches various different target tasks from the source task i.e. predicted task of the plurality of predicted task-query pairs is different from the plurality of demonstration tasks “
See [0027] The second predicted source output may be the same or different for different source tasks.;
see also claim 2 “wherein the soft prompts are trained using a source training dataset that is on a different domain or task from the target training sample”;
See also [0052] Ensembled Soft Prompt Tuning module 430 may receive input 440 such as an input training data (e.g., a natural language question) via the data interface 415 and generate an output 450 which may be an answer. Examples of the input data may include other types of natural language inputs such as a document, a text, etc. Examples of the output data may include an answer, a summary, an intent classification label, and/or the like.;
See also [0017] Prompt tuning refers to a training and/or tuning paradigm that updates task-specific soft prompts while keeping a pre-trained language model frozen. Soft-prompt tuning
provides an efficient and effective solution for adapting large-scale pre-trained language models (PLMs) to downstream tasks because the updating of soft-prompts are relatively computationally efficient compared to updating the entire pre-trained language model.).
As to claim 13, Gurgu as modified discloses the computer-implemented method of claim 1, further comprising:
filtering the corpus of passages to remove passages that do not conform to content standards,
wherein the content standards are defined by a machine learning model trained to
classify content based on one or more of prohibited speech, harmful content, or sensitive subjects, and wherein the plurality of passages are sampled from the filtered corpus of passages (Gurgu teaches filtering unsafe/sensitive/irrelevant content i.e. filtering based on standards, See [0070] In some embodiments, the training system 110 is configured to evaluate the recommended training question to determine whether the question contains content that should be filtered. For example, a question that is predicted to contain unsafe or sensitive content may be discarded and not used as training data. In some embodiments, the training system 110 is configured to determine whether the recommended training question is semantically relevant to, for example, at least a portion of the prompt. The recommended question may be discarded if the question is predicted to be semantically irrelevant to the prompt.;
see also [0056] In some embodiments, the question that is output by the language model is filtered based on a predicted characteristic of the question. For example, a question that
is predicted to contain unsafe or sensitive content may be filtered. In other examples, a question that is likely to be semantically irrelevant to the topic and/or content of the discussion provided in the prompt may also be filtered.
See also [0090] In one embodiment, the filtering system 302 evaluates the recommended training question for determining whether all or a portion of the training question should be
filtered. In this regard, the filtering system 302 includes a fine-tuned machine learning model that predicts a characteristic of the question. For example, the machine learning model may predict whether the question can be characterized as containing unsafe or sensitive content. If the question is characterized as containing unsafe or sensitive content, the question may be discarded.).
As to claim 14, Gurgu as modified discloses the computer-implemented method of claim 13, wherein the filtering is performed by the machine learning model trained based on the content standards
(Gurgu teaches a filtering system includes a fine-tuned machine learning model for filtering unsafe/sensitive/irrelevant content i.e. “the filtering is performed by a machine learning model trained based on the content standards”,
See [0070] In some embodiments, the training system 110 is configured to evaluate the recommended training question to determine whether the question contains content that should
be filtered. For example, a question that is predicted to contain unsafe or sensitive content may be discarded and not used as training data. In some embodiments, the training system 110 is configured to determine whether the recommended training question is semantically relevant to, for example, at least a portion of the prompt. The recommended question may be discarded if the question is predicted to be semantically irrelevant to the prompt.;
see also [0056] In some embodiments, the question that is output by the language model is filtered based on a predicted characteristic of the question. For example, a question that
is predicted to contain unsafe or sensitive content may be filtered. In other examples, a question that is likely to be semantically irrelevant to the topic and/or content of the discussion provided in the prompt may also be filtered.
See also [0090] In one embodiment, the filtering system 302 evaluates the recommended training question for determining whether all or a portion of the training question should be
filtered. In this regard, the filtering system 302 includes a fine-tuned machine learning model that predicts a characteristic of the question. For example, the machine learning model may predict whether the question can be characterized as containing unsafe or sensitive content. If the question is characterized as containing unsafe or sensitive content, the question may be discarded.).
As to claim 16, Gurgu as modified discloses the computer-implemented method of claim 1, wherein the sequence model is a large language model
(Gurgu [0097] In some embodiments, a plurality of answer content is sampled for generating a plurality of prompts. The prompts are provided to the large language model for receiving various question suggestions based on the prompts.).
Claim(s) 2-10 is/are rejected under 35 U.S.C. 103 as being unpatentable over
Peng et al. US Pub. No. 2024/0070394 A1, in view of Bathwal et al., US Pub. No.: US 2024/0281446 A1, in view of Gurgu et al. US Pub. No. 2023/0297887 A1, in view of Sewak et al., US Pub. No.: 2022/0414137 A1, in view of Muraoka et al. US Pub. No. 2024/0289558 A1.
As to claim 2, Peng/Bathwal/Gurgu/Sewak do not disclose:
providing the plurality of predicted task-query pairs to a passage retrieval model, the passage retrieval model having been trained to output, for a given predicted task-query pair, respective one or more nearest neighbor passages of the corpus of passages;
However, Muraoka discloses:
the computer-implemented method of claim 1, further comprising:
providing the plurality of predicted task-query pairs to a passage retrieval model, the passage retrieval model having been trained to output, for a given predicted task-query pair, respective one or more nearest neighbor passages of the corpus of passages
(Muraoka teaches providing pairs to a large language model is then combined with a probability distribution of the k-Nearest Neighbor search result to make a final prediction for the downstream task see [0049-0050] [0049] In step 402, a pre-trained large language model is
obtained, as is a dataset for a downstream task performed using the large language model. [0050] According to an exemplary embodiment, the dataset fetched in step 402 includes sets of pairs, each pair having an input sentence x and a corresponding output label y. For instance, using the sentiment analysis example from above, an input sentence x/output label y pair in the dataset might be "it is a funny film"/'Positive.' Another input sentence x/output label y pair in the dataset might be "it is not very interesting"/'Negative,' and so on. As shown in
FIG. 4, the dataset is split into a training (Train) set 450 and a testing (Test) set 452 (and a development (dev) set though not shown). As will be described in detail below, the training set 450 will be leveraged to construct a datastore using the (frozen) large language model upon which a k-Nearest Neighbor search will be conducted. At the same time, the testing set 452 will be leveraged to compute a debiased output probability distribution of the (frozen) large
language model. The debiased output probability distribution of the (frozen) large language model is then combined with a probability distribution of the k-Nearest Neighbor search result to make a final prediction for the downstream task.;
see also [0052] In step 408, an instance is extracted from the testing set 452. As provided above, according to an exemplary embodiment, the training set 450 and the testing set 452 (both which are subsets of the dataset) each includes sets of pairs, each pair having an input sentence x and a corresponding output label y. Thus, in step 408, a pair (i.e., an input sentence x and a corresponding output label y) can be extracted from the testing set 452.).
It would have been obvious to one having ordinary skill in the art at the time the time of the effective filing date to apply providing training pairs to a nearest-neighbor search as taught by Muraoka, to the system of Peng/Bathwal/Gurgu, since it was known in the art that machine learning systems provide that employing a fine-tuning free evaluation (namely zero or few shot evaluation) whereby a prompt is used to reformulate a downstream task as a language modelling task where a language modelling task such as sentence classification is a multiclass
classification task to predict an output label yE Y given an input sentence x. Y is a pre-defined label set where for instance, this can be {Postive,Negative} in a sentiment analysis task towards movie reviews where to solve this task, a conditional probability distribution p(ylx) over labels from the large language model is computed. However, the large language model cannot directly compute it since the large language model is not fine-tuned on this task and thus, this is why the task is reformulated as a masked language modelling task. (Muraoka [0053]).
As to claim 3, Muraoka as modified discloses the computer-implemented method of claim 2, further comprising:
receiving, from the passage retrieval model and for the plurality of predicted task- query pairs, the respective one or more nearest neighbor passages;
(Muraoka teaches k-Nearest Neighbor search result to compute a final prediction [0042] Advantageously, provided herein are techniques that combine the debiased output probability distribution of a large language model with a probability distribution of a
k-Nearest Neighbor search result to compute a final prediction that outperforms state-of-the-art approaches on downstream tasks. Further, the present techniques also contribute
to improving the interpretability. Interpretability refers to whether the reason for a model making a particular prediction is apparent or not.)
providing, to a second sequence model, the plurality of predicted task-query pairs, and the respective one or more nearest neighbor passages;
(Muraoka [0052] In step 408, an instance is extracted from the testing set 452. As provided above, according to an exemplary embodiment, the training set 450 and the testing set
452 (both which are subsets of the dataset) each includes sets of pairs, each pair having an input sentence x and a corresponding output label y. Thus, in step 408, a pair (i.e., an
input sentence x and a corresponding output label y) can be extracted from the testing set 452.)
receiving, from the second sequence model and for each predicted passage of the one or more predicted nearest neighbor passages, an associated relevance score indicative of a relevance of the predicted passage to a predicted query;
(Muraoka teaches using distances and probability distributions between nearest neighbor instances to measure similarity, i.e. a relevance score indicative of relevance see [0063] Specifically, a feature space 422 (i.e., the datastore) is shown which is built on the feature vectors h . obtained from the large language model and the Positive ;;d Negative training instances (i.e., circles with a diamond-shaped pattern and un-patterned circles, respectively) closest to the testing instance hLM(prompt(x)) aka the query vector (shown as the circle with a dotted pattern). The distance between the circles illustrates the similarity between the instances, regardless of which set, training or testing set, an instance originates from. According to the present techniques, the
k-Nearest Neighbor (kNN) search result, i.e., the k closest instances to the query vector hLA (prompt(x)), will be used to compute a probability distribution hNN· )
classifying, for each task-query pair of the plurality of predicted task-query pairs and based on associated relevance scores, the one or more nearest neighbor passages into a first collection of positive passages and a second collection of negative passages, wherein a
positive passage is indicative of a high relevance to a predicted query, and wherein a negative passage is indicative of a low relevance to the predicted query,
(Muraoka teaches positive/negative classifications and labels see [0050] According to an exemplary embodiment, the dataset fetched in step 402 includes sets of pairs, each pair
having an input sentence x and a corresponding output label y. For instance, using the sentiment analysis example from above, an input sentence x/output label y pair in the dataset
might be "it is a funny film"/'Positive.' Another input sentence x/output label y pair in the dataset might be "it is not very interesting"/'Negative,' and so on.;
see also [0063] The distance between the circles illustrates the similarity between the instances,
regardless of which set, training or testing set, an instance originates from. According to the present techniques, the k-Nearest Neighbor (kNN) search result, i.e., the k closest instances to the query vector hLA (prompt(x)), will be used to compute a probability distribution hNN· See Equation 3, below. For example, when k=4, the four closest instances to the query vector hLM(prompt(x)) are used as shown in FIG. 4, which are inside the dashed circle 424. In the present example, there are three positive instances and one negative instance as the k-Nearest Neighbor instances (k=4). Thus, the positive probability is almost three times higher than the
negative probability in the resultant probability distribution PkNNo see below. )
and
wherein the synthetic training dataset comprises the plurality of predicted task-query pairs, the respective first collection of positive passages, and the respective second collection of negative passages
(Muraoka teaches using negative and positive training instances see [0062] For
clarity, the query vector hLM(prompt(x)) and the Positive and Negative instances in the datastore are depicted using circles having unique patterns or lack of pattern. Namely, a circle
with a dotted pattern is used to represent the query vector hLM(prompt(x)) of prompt(x)), which serves as a testing instance in the present example. Circles with a diamondshaped pattern are used to represent Positive training instances of the feature vectors h in the datastore corresponding to sentences X,rain and labels y' like '(you have to see it., Positive).' Un-patterned circles are used to represent Negative training instances of the feature vectors h,rain in the datastore corresponding to sentences x,rain and labels y' like '(wait to see it., Negative).' As will be described in detail below, a probability distribution of the k-Nearest Neighbor
search result will be defined over the label set that has these two labels, Positive and Negative.).
As to claim 4, Muraoka as modified discloses the computer-implemented method of claim 3, wherein the classifying further comprises:
applying one or more few-shot prompted ranking functions
(Muraoka teaches ordering by probability distribution from a few-shot process, i.e. “applying one or more few-shot prompted ranking functions” [0059] As highlighted above, predictions made by large language models can be biased by the pre-training data or by the order of training instances given as a context in a few-shot evaluation (e.g., large language models can show a
prediction bias towards those answers that occur near the end of a prompt). Thus, according to an exemplary embodiment, the output distribution PLM(yEVlprompt(x)) is next
debiased in step 416 to provide a debiased output probability distribution PdebiasedLM• where:
Pd,bia"dLM = wd,bia,PLM(y E VI prompt (x)).).
As to claim 5, Muraoka as modified discloses the computer-implemented method of claim 4, wherein the one or more few- shot prompted ranking functions comprise query likelihood or relevance classification
(Muraoka teacdhes using likelihood classifications see [0038] As provided above, zero or few shot evaluation enables performance evaluation of a large language model without fine-tuning, i.e., fine-tuning free evaluation. Namely, the term 'fine-tuning free evaluation' as used herein is an evaluation that never updates the parameters (i.e., weights) in the large language model. According to an exemplary embodiment, the present techniques employ finetuning free evaluation. Thus, in that case, the corresponding large language model is frozen meaning that, following pre-training, the parameters (i.e., weights) in the large language model are never changed or updated, i.e., the model parameters are 'frozen.' A large language model predicts the most likely word/label y for the replacement of a [MASK] token given an input sentence x.;
see also [0054] A large language model M is pre-trained to predict the most likely word/label y for a replacement of a [MASK] token given an input sentence x:).
As to claim 6, Muraoka as modified discloses the computer-implemented method of claim 3, further comprising:
providing, to an embedding model, each task-query pair, the respective first collection of positive passages, and the respective second collection of negative passages;
(Muraoka teaches positive/negative training sets for a model see [0050-0051] [0050] According to an exemplary embodiment, the dataset fetched in step 402 includes sets of pairs, each pair
having an input sentence x and a corresponding output label y. For instance, using the sentiment analysis example from above, an input sentence x/output label y pair in the dataset
might be "it is a funny film"/'Positive.' Another input sentence x/output label y pair in the dataset might be "it is not very interesting"/'Negative,' and so on. As shown in FIG. 4, the dataset is split into a training (Train) set 450 and a testing (Test) set 452 (and a development (dev) setthough
not shown). As will be described in detail below, the training set 450 will be leveraged to construct a datastore using the (frozen) large language model upon which a k-Nearest Neighbor search will be conducted. At the same time, the testing set 452 will be leveraged to compute a
debiased output probability distribution of the (frozen) large language model. The debiased output probability distribution of the (frozen) large language model is then combined with a probability distribution of the k-Nearest Neighbor search result to make a final prediction for the downstream task. [0051] Namely, in step 406, the datastore is created by applying the pre-trained (frozen) large language model to the training set 450 to obtain a multi-dimensional continuous large language model feature vector h,rain from each sentence X,rain in the training set 450.)
and
causing the embedding model to be trained to embed a given input task-query pair near positive passages and away from negative passages
(Muraoka teaches an example, there are three positive instances and one negative
instance as the k-Nearest Neighbor instances (k=4). Thus, the positive probability is almost three times higher than the negative probability, i.e. “causing the embedding model to be trained to embed a given input task-query pair near positive passages and away from negative passages” [0063] According to the present techniques, the k-Nearest Neighbor (kNN) search result, i.e., the k closest instances to the query vector hLA (prompt(x)), will be used to compute a probability distribution hNN· See Equation 3, below. For example, when k=4, the four closest instances to the query vector hLM(prompt(x)) are used as shown in FIG. 4, which are inside the dashed circle 424. In the present example, there are three positive instances and one negative
instance as the k-Nearest Neighbor instances (k=4). Thus, the positive probability is almost three times higher than the negative probability in the resultant probability distribution PkNNo see below.).
As to claim 7, Muraoka as modified discloses the computer-implemented method of claim 6, wherein the embedding model is a dual encoder comprising a query tower and a document tower, wherein the query tower is trained to embed the task-query pair, and wherein the document tower is trained to embed the positive passages and negative passages
(Muraoka teaches a Bidirectional Encoder Representations and training using positive/negative feature vectors, i.e. a “dual encoder” see [0049] For instance, by way of example only, a pre-trained large language model such as Bidirectional Encoder Representations from Transformers (BERT) or its subsequent model Robustly optimized BERT approach (RoBERTa) may be employed in accordance with the present techniques.;
see also [0063] Specifically, a feature space 422 (i.e., the datastore) is shown which is built on the feature vectors h . obtained from the large language model and the Positive ;;d Negative training instances (i.e., circles with a diamond-shaped pattern and un-patterned circles, respectively) closest to the testing instance hLM(prompt(x)) aka the query vector (shown
as the circle with a dotted pattern; see also Fig. 2).
As to claim 8, Gurgu as modified discloses the computer-implemented method of claim 7, wherein a subplurality of the plurality of passages comprise a respective title, and wherein the document tower is trained to embed the respective title
(Gurgu [0117] The identified prompt structure may include preset wording and one or more placeholders for entering content selected by, for example, the chatbot administrator. The
content may be, for example, an answer title and/or an answer content selected from the enterprise's knowledge base 14. The prompt structure may also include placeholders for one or more labeled examples that may be answered by the identified content. In some embodiments, the prompt structure is identified based on a predicted success of the machine learning model in generating an output based on the selected prompt structure.;
see also [0091] In one embodiment, the characteristic determined by the filtering system 302 is semantic and/or lexical similarity of the generated question to the input prompt. In one embodiment, filtering system 302 generates n-grams of the words contained in at least a portion of the prompt (e.g., answer title and/or answer content), and n-grams of the words contained in the generated question. The filtering system 302 may compare the n-grams to determine overlap between the generated question and the prompt. The amount of overlap in the n-grams may be used as an indication of semantic relevance of the generated question to the prompt
containing at least a portion of the answer. In addition or in lieu of n-grams, a cosine similarly measure may be used to compute the semantic similarity between the generate question and the prompt.).
As to claim 9, Peng as modified discloses the computer-implemented method of claim 6, further comprising:
receiving an input query;
(Peng ‘394 [0053] Or the computing device 400 may receive the input 440, such as an articulated question, from a user via the user interface.)
providing the input query to the trained embedding model, wherein the trained embedding model determines a task description based on the input query, and predicts an output passage responsive to the input query and the task description;
(Peng teaches predicting a task output see [0018] The set of soft prompts are then
prepended to a target task input, based on which the frozen pre-trained language model generates a set of logits for predicting classification of the target task input, respectively.
An attention module is used to generate input-logit attention scores, which are used to compute a weighted linear combination of the logits given the attention scores. The weighted linear combination is the final logit used to predict the final classification of the target task input)
and
receiving, from the trained embedding model, the output passage
(Peng [0052] Ensembled Soft Prompt Tuning module 430 may receive input 440 such as an input training data (e.g., a natural language question) via the data interface 415 and
generate an output 450 which may be an answer. Examples of the input data may include other types of natural language inputs such as a document, a text, etc. Examples of the output data may include an answer, a summary, an intent classification label, and/or the like.;
See also [0063] For example, the user device 510 may receive a message indicating an output from the server 530 and display the message via the UI application 512.).
As to claim 10, Peng as modified discloses the computer-implemented method of claim 9, wherein the task description relates to one or more of a question-answering task, a search task, a document retrieval task, a fact-checking task, or a semantic sentence similarity task
(Peng teaches various target tasks see [0052] Ensembled Soft Prompt Tuning module 430 may receive input 440 such as an input training data (e.g., a natural language question) via the data interface 415 and generate an output 450 which may be an answer. Examples of the input data may include other types of natural language inputs such as a document, a text, etc. Examples of the output data may include an answer, a summary, an intent classification label, and/or the like.;
See also [0017] Prompt tuning refers to a training and/or tuning paradigm that updates task-specific soft prompts while keeping a pre-trained language model frozen. Soft-prompt tuning
provides an efficient and effective solution for adapting large-scale pre-trained language models (PLMs) to downstream tasks because the updating of soft-prompts are relatively computationally efficient compared to updating the entire pre-trained language model.).
Claim(s) 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over
Peng et al. US Pub. No. 2024/0070394 A1, in view of Bathwal et al., US Pub. No.: US 2024/0281446 A1, in view of Gurgu et al. US Pub. No. 2023/0297887 A1, in view of Sewak et al., US Pub. No.: 2022/0414137 A1, in view of Clement et al., US Pub. No. 2022/0245056 A1.
As to claim 12, Peng/Bathwal/Gurgu/Sewak do not disclose:
applying a beam search algorithm to cause the sequence model to predict two or more queries;
however, Clement discloses:
the computer-implemented method of claim 1, further comprising:
applying a beam search algorithm to cause the sequence model to predict two or more queries
(Clement teaches using beam search techniques to predict queries/method [0012] FIG. 6 is a flow diagram illustrating an exemplary method for using the neural transformer model with attention in a beam search to predict candidate methods;
see also [0086] The beam search uses the probability distribution generated by the neural transformer model to identify the top k subtokens likely to be the next subtoken in a method
candidate. The beam search expands the search by instantiating new partial sequences using each of the selected subtokens identified by the neural transformer model's probability distribution. The search continues generating new partial sequences from the top k subtokens identified by the output distributions until the search ends.).
It would have been obvious to one having ordinary skill in the art at the time the time of the effective filing date to apply few-shot learning to generating training data as taught by Clement, to the system of Peng/Gurgu, since it was known in the art that machine learning systems provide a beam search which uses the neural transformer model with the context tensor to generate a probability distribution for the subtoken vocabulary at each decoder time step where if the probability distribution indicates that the next likely token is the end-of-method token or
the maximum sequence length threshold has been exceeded, then the beam search is finished and the method candidates are output where otherwise the top k subtokens to complete a partial sequence are selected (Clement [0088]).
Claim(s) 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over
Peng et al. US Pub. No. 2024/0070394 A1, in view of Bathwal et al., US Pub. No.: US 2024/0281446 A1, in view of Gurgu et al. US Pub. No. 2023/0297887 A1, in view of Sewak et al., US Pub. No.: 2022/0414137 A1, in view of Tajbakhsh et al. US Pub. No. 2022/0405933 A1.
As to claim 15, Peng/Bathwal/Gurgu/Sewak do not disclose:
wherein the sequence model is a large multimodal model;
However, Tajbakhsh discloses
the computer-implemented method of claim 1, wherein the sequence model is a large multimodal model
(Tajbakhsh teaches generating pretrained multimodal models see [0133] At block 420, processing logic executes an improved collaborative learning process using joint-supervision to generate pretrained multimodal models.
See also [0134-0136] [0134] For instance, according to certain embodiments, stored instructions may specially configure the system to execute a method in which the system executes an improved collaborative learning process using joint-supervision to generate supplemental training data in which joint supervision is used to create multiple supervision signals, which then in tum are used to pretrain the multimodal model. [0135] At block 425, processing logic trains the AI model using a zero-shot or few-shot learning process to integrate the supplemental training data into a refined AI model. [0136] For instance, according to certain embodiments, stored instructions may specially configure the system to execute a method in which the system trains the AI model using a zero-shot or few-shot learning process to integrate the supplemental training data generated from the improved collaborative learning process due to the application of the joint supervision which creates the multiple supervision signals for the sake of training to render the refined AI model.;
See also [0131] At block 415, processing logic initiates a training sequence of an AI model by first learning dense anatomical embeddings from a large collection of unlabeled data, then
deriving application-specific models to identify and diagnose certain diseases with a small number of examples.).
It would have been obvious to one having ordinary skill in the art at the time the time of the effective filing date to apply multimodal models as taught by Tajbakhsh, to the system of Peng/Bathwal/Gurgu, since it was known in the art that machine learning systems provide an improved collaborative learning process using joint-supervision to generate pretrained multimodal models where the system executes an improved collaborative learning process using joint-supervision to generate supplemental training data in which joint supervision is used to create multiple supervision signals, which then in tum are used to pretrain the multimodal model where the system trains the AI model using a zero-shot or few-shot learning process to integrate the supplemental training data into a refined AI model. (Tajbakhsh [0133-0135]).
Claim(s) 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over
Peng et al. US Pub. No. 2024/0070394 A1, in view of Bathwal et al., US Pub. No.: US 2024/0281446 A1, in view of Gurgu et al. US Pub. No. 2023/0297887 A1, in view of Sewak et al., US Pub. No.: 2022/0414137 A1, in view of Salaam et al. US Pub. No. 2023/0259718 A1.
As to claim 17, Peng/Bathwal/Gurgu/Sewak do not disclose:
wherein the sequence model is a large multilingual model;
However, Salaam discloses:
the computer-implemented method of claim 1, wherein the sequence model is a large multilingual model
(Salaam [0046] [0046] The synthetic data 216 can be used to train a language model 220, e.g., a multilingual classification model. To this end, in some embodiments, the synthetic data
216 is transmitted to and received by a training module 218. In some cases, the training module 218 is part of the language model 220 to be trained;
See also [0003-0004] [0003] In one aspect of the present disclosure, a method of training a language model for code switching content is disclosed. In some embodiments, the method includes generating a dataset; and training a multilingual classification model. [0004] In some variants, the generating of the dataset includes: identifying one or more portions within textual
content in a first language, the identified one or more portions each comprising one or more of offensive content or non-offensive content, the identifying comprising tagging, based on an output of a first trained language model, the one or more portions with at least one content tag;
translating the tagged one or more portions to a second language using a second trained language model; and replacing, in the textual content, the tagged one or more portions with the translated one or more portions to generate codeswitched textual content.).
It would have been obvious to one having ordinary skill in the art at the time the time of the effective filing date to apply multilingual models as taught by Salaam, to the system of Peng/Bathwal/Gurgu, since it was known in the art that machine learning systems provide
training of the multilingual classification model including determining one or more training metrics based on evaluation of an output of the multilingual classification model with respect to the generated code-switched textual content; and adjusting a parameter of the multilingual classification model based on the one or more training metrics where identifying one or more portions within textual content in a first language, the identified one or more portions each comprising one or more of offensive content or non-offensive content, the identifying comprising tagging, based on an output of a first trained language model, the one or more portions with at least one content tag; translating the tagged one or more portions to a second language using a second trained language model; and replacing, in the textual content, the tagged one or more portions with the translated one or more portions to generate codeswitched textual content where this allows a language model to detect codeswitched offensive content (Salaam [0005-0007]).
Claim(s) 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over
Peng et al. US Pub. No. 2024/0070394 A1, in view of Bathwal et al., US Pub. No.: US 2024/0281446 A1, in view of Gurgu et al. US Pub. No. 2023/0297887 A1, in view of Sewak et al., US Pub. No.: 2022/0414137 A1, in view of Schmaltz et al., Article: “Coarse-to-Fine Memory Matching for Joint Retrieval and Classification”, arxiv.org, arXiv:2012.02287 [cs.IR], https://doi.org/10.48550/arXiv.2012.02287; From: Allen Schmaltz Sun, 29 Nov 2020 05:06:03 UTC.
As to claim 18, Pend/Bathwal/Gurgu/Sewak do not disclose:
formatting the synthetic training dataset as a standard symmetric dataset;
However, Schmaltz discloses the computer-implemented method of claim 1, further comprising:
formatting the synthetic training dataset as a standard symmetric dataset
(Schmaltz sec 4.1 data: “Symmetric Dataset There are two versions of the symmetric re-annotation sets of Schuster et al. (2019) dataset: The version from the published paper, SYMGEN., and a subsequent version with a dev-test split, SYMDEV.V2 and SYMTEST.V2, available
in the public repo6. We consider the first version for comparison to previous work, and we
also consider the new set as it provides a held-out set with which to examine updating the exemplar database. In these 2-class sets, single sentence retrieval is given, and for a subset of instances, the evidence and/or claims have been strategically modified to aid examination of a model’s reliance on class conditional distributional characteristics. We use these sets to analyze the model’s ability to identify—and predict over—out-of-domain samples. We use the labels TRAIN1-EVIDENCE and DEV1-EVIDENCE to indicate the original FEVER
train and dev sets limited to 2-class claims, with given single sentence evidence sets. Because these subsets from previous works have dropped the Wikipedia title and sentence index, we perform a preprocessing step to heuristically re-associate this metadata to the SUPPORT sequences based on the original FEVER data.”)
It would have been obvious to one having ordinary skill in the art at the time the time of the effective filing date to apply symmetric formatting as taught by Schmaltz, to the system of Peng/Gurgu, since it was known in the art that machine learning systems provide for the symmetric analysis experiments, a model which fine-tunes a BERTBASE model with a train where this model (BERTBASE+RW) is a cross-encoder given the ground-truth evidence
during training and inference and where including the BERTBASE baseline from this work where this shows results across models on the hidden test set, evaluated on CodaLab where the model is significantly stronger than the end-to-end language models that lack retrieval, and it approaches the accuracy of one of the recent strong multi-model systems, despite having fewer parameters, not using external linguistic tools, and consisting of a single end-to-end model and additionally, a tightly coupled end-to-end system achieves 97% of the accuracy of RAG using only 19% of the parameters. (Schmaltz 4.3-5).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Samarinas et al., Article: “Latent Retrieval for Large-Scale Fact-Checking and
Question Answering with NLI training”, Published in: 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI) Date of Conference: 09-11 November 2020
Date Added to IEEE Xplore: 24 December 2020; teaches a passage retrieval is a part of fact-checking and question answering systems that is critical yet often neglected. Most systems usually rely only on traditional sparse retrieval. This can have a significant impact on the recall, especially when the relevant passages have few overlapping words with the query sentence. Recent approaches have attempted to learn dense representations of queries and passages to better capture the latent semantic content of text. While dense retrieval models have been proven effective in question answering, there is no relevant work for improving evidence retrieval in factchecking. In this work, we show that training a dense retriever is sufficient to outperform traditional sparse representations in both question answering and fact-checking. We constructed a new dataset called Factual-NLI, comprised of factual claims and their supporting evidence, and demonstrate that using it to train a dense retriever can improve evidence retrieval significantly. Experimental results on the MSMARCO dataset indicate that pre-training with Factual-NLI, and other NLI datasets, is also effective for large-scale passage retrieval in question answering. Our model is incorporated in a real world semantic search engine that returns snippets containing evidence related to questions and claims about the COVID-19 pandemic.
CONTACT INFORMATION
Any inquiry concerning this communication or earlier communications from the examiner should be directed to EVAN S ASPINWALL whose telephone number is (571)270-7723. The examiner can normally be reached Monday-Friday 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Neveen Abel-Jalil can be reached at 571-270-0474. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Evan Aspinwall/Primary Examiner, Art Unit 2152