Last updated: April 19, 2026
Application No. 18/110,847
AUTOMATED MACHINE LEARNING USING LARGE LANGUAGE MODELS

Non-Final OA §101§103
Filed
Feb 16, 2023
Examiner
CAMPOS, ALFREDO
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Microsoft Technology Licensing, LLC
OA Round
1 (Non-Final)
Interview Optional

— +33.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 6 resolved cases, 2023–2026
Examiner Intelligence

CAMPOS, ALFREDO View full profile →
Grants 83% — above average
Career Allow Rate
5 granted / 6 resolved
+28.3% vs TC avg
Strong +33% interview lift
Without
With
+33.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 9m
Avg Prosecution
26 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
33.3%
-6.7% vs TC avg
§103
42.8%
+2.8% vs TC avg
§102
3.9%
-36.1% vs TC avg
§112
20.0%
-20.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 6 resolved cases
Office Action

§101 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Specification
The disclosure is objected to because of the following informalities:
The specification mentions in paragraph 0032 and 0033 recites data categories 206-218 as a feature. However FIG. 2A does not show any feature 218. FIG. 2A does show 206, 208, 210, 212, 214, and 216 data categories. The data categories should be 206-216.
The specification recites data storage 610 in paragraph 0070. However the image show it as 710. The data storage should be 710.
Appropriate correction is required.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract idea without significantly more. The claim(s) recite(s) significantly more. The subject matter eligibility test for products and process is describe below for claim 1 in view of dependent claims.
Regarding claim 1:
Step 1: Is the claim to a process machine manufacture or composition of matter? 
Yes – Claim 1 recites a method, which a method falls under the statutory categories. 
Step 2A Prong 1: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes – The claim recites the following:
“evaluating a performance of each of the plurality of corresponding machine learning models implemented by the plurality candidate machine learning based on the evaluation metric;” - The limitations of claim 1 recites a mental process of evaluating the performance of each machine learning model based on the evaluation metric (see MPEP 2106.04(a)(2)III).
“and selecting a machine learning model from the plurality of corresponding machine learning models implemented by the plurality of candidate machine learning pipelines, the selected machine learning model having a higher performance in relation to the performances of other machine learning models in the plurality of corresponding machine learning models.”- The limitations of claim 1 recites a mental process of selecting based on higher performance in relation to other machine learning models (see MPEP 2106.04(a)(2)III).
Step 2 Prong 2: Does the claim recite additional elements that integrate the judicial exception into a particular application? No –
The claim includes the additional element(s):
“A method comprising: receiving an input dataset comprising a plurality of quantities and an evaluation metric at a large language model;”
The additional elements fall under Insignificant Extra-Solution Activity as mere data gathering by obtaining data and an evaluation metric at the large language model. See MPEP 2106.5(g).
“generating, by the large language model, a plurality of data transforms, each data transform of the plurality of data transforms formatting the input dataset for processing;”
The additional elements fall under “apply it” as using a generic computer to implement a large language model to generate a plurality of data transforms. See Mere Instructions to Apply an Exemption (see MPEP 2106.05(f)).
“generating, by the large language model, a plurality of featurization approaches, each featurization approach defining a feature set for the input dataset comprising a constituent plurality of features derived from the input dataset;”
The additional elements fall under “apply it” as using a generic computer to implement a large language model to generate a plurality of featurization approaches. See Mere Instructions to Apply an Exemption (see MPEP 2106.05(f)).
“initializing a plurality of candidate machine learning pipelines, each candidate machine learning pipeline implementing a corresponding machine learning model utilizing a data transform of the plurality of data transforms and an associated featurization approach generated by the large language model;”
The additional elements fall under “apply it” as using a generic computer to initialize a plurality of candidate machine learning pipelines. See Mere Instructions to Apply an Exemption (see MPEP 2106.05(f)).
“configuring an automated machine learning training module with a plurality of corresponding machine learning models implemented by the plurality of candidate machine learning pipelines to process the input dataset;”
The additional elements fall under “apply it” as using a generic computer to configure an automated machine training module with a plurality of candidate machine models to process the input dataset. See Mere Instructions to Apply an Exemption (see MPEP 2106.05(f)).
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception?
No - The claim does not include additional elements that are sufficient to amount to a significantly more than the judicial exemption. As an order whole, the claim is directed to using Large Lange Model to preprocess data for an automated machine learning method. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of receiving, generating, initializing and configuring fall under using generic computer to apply an exemption and mere data gathering. The method does not improve on the function of a computer, transforms an article into another article, nor is it applied by a particular machine, making the claim not patent eligible.
Regarding claim 2:
Step 2A Prong 1: 
“The method of claim 1, wherein a feature of the constituent plurality of features is a ratio of two quantities of the plurality of quantities.” – The limitation recites a mathematical relationship where a feature is a ratio between two quantities (see MPEP 2106.04(a)(2). 
Step 2A Prong 2, Step 2B: The additional element(s): 
No additional elements. The judicial exemptions do not integrate into a practical application nor provide an improvement. The process does not provide an inventive concept nor provides a practical application.
Regarding claim 3:
Step 2A Prong 1:
“The method of claim 1, wherein a feature of the constituent plurality of features is an aggregate quantity of a subset of the plurality of quantities.” – The limitation recites a mathematical calculation by where a feature is an aggregate quantity (see MPEP 2106.04(a)(2). 
Step 2A Prong 2, Step 2B: The additional element(s): 
No additional elements. The judicial exemptions do not integrate into a practical application nor provide an improvement. The process does not provide an inventive concept nor provides a practical application
Regarding claim 4:
Step 2A Prong 2, Step 2B: The additional element(s): 
“The method of claim 1, wherein a feature of the constituent plurality of features is a subdivision extracted from a quantity of the plurality of quantities.” 
The additional elements fall under Insignificant Extra-Solution Activity. See MPEP 2106.5(g). The judicial exemptions do not integrate into a practical application nor provide an improvement. The process does not provide an inventive concept nor provides a practical application.
Regarding claim 5:
Step 2A Prong 2, Step 2B: The additional element(s): 
“The method of claim 1, wherein a feature of the constituent plurality of features defines a characteristic of a quantity of the plurality of quantities.”
The additional elements fall under Insignificant Extra-Solution Activity. See MPEP 2106.5(g). The judicial exemptions do not integrate into a practical application nor provide an improvement. The process does not provide an inventive concept nor provides a practical application.
Regarding claim 6:
Step 2A Prong 2, Step 2B: The additional element(s): 
“The method of claim 1, wherein the plurality of data transforms is generated based on a data type of the input dataset.”
The additional elements fall under Insignificant Extra-Solution Activity. See MPEP 2106.5(g). The judicial exemptions do not integrate into a practical application nor provide an improvement. The process does not provide an inventive concept nor provides a practical application.
Regarding claim 7:
Step 2A Prong 2, Step 2B: The additional element(s): 
“The method of claim 1, wherein: the evaluation metric is selected based on a machine learning task associated with the input dataset; the machine learning task is a binary classification task identifying a malicious uniform resource locator; and the evaluation metric is an area under curve metric.”
The additional elements fall under “apply it” as using a generic computer to configure classify and use a determine evaluation metric See Mere Instructions to Apply an Exemption (see MPEP 2106.05(f)).
Claims 8-14 recite a system and are analogous to the method of claims 1-7. Therefore, the rejections of claim 1-7 above applies to claims 8-14.
Claims 15-20 recite a CRM and are analogous to the method of claims 1-7. Therefore, the rejections of claim 1-7 above applies to claims 15-20.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 4-6, 8, 11-13, 15 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Bavly et al. (US20210334693A1) (“Bavly”) in view of Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthcare 3, 1, Article 2 (January 2022), 23 pages (“Gu”).
Regarding claim 1 and analogous claims 8 and 15, Bavly teaches A method comprising: receiving an input dataset comprising a plurality of quantities and an evaluation metric [at a large language model] (Bavly para 0022, The Auto-XAI module 212 may be configured to train the selection of models 210 with the feature datasets 206. The model training purpose for solving a developer's particular technical problem may be defined first to select particular models before training the selected models. For example, a model for predicting risk score may be selected and based on user transaction data and behaviors. The auto-XAI module 212 may be configured to extract a subset of models and parameters that may be offered as alternatives, one of which may be selected as the recommended XAI model based on a trade-off between model explainability and model performance. 
Para 0031, At step 404, application server 120 may execute the Auto-XAI module 212 to train the selected models 210 with the respective input feature datasets 206 [receiving an input dataset comprising a plurality of quantities]. The auto-XAI  module 212 may process and generate respective trained models 214 with a respective output for each model. Application server 120 may perform model evaluation 216 and model selection 218 of the pipeline platform 200 based on trained models' outputs 
Para 0037 line 14, FIG. 5 shows example training results of four example models in accordance with some embodiments of the present disclosure. Each model trained may be optimized using auto-ML techniques. As illustrated in FIG. 5, outputs of the trained models may be used to evaluate the model performance. The model performance may be represented by an accuracy indicative of an accuracy value or performance score ( e.g., F 1 score) and explainability ( also referred to herein as explainability properties). The Fl score is a measure of accuracy of the trained model and may be defined as the weighted harmonic mean of the precision and recall of the trained model. The evaluation metrics may include accuracy, precision and recall, which may be interactively selected by an expert and or developer [an evaluation metric].);
initializing a plurality of candidate machine learning pipelines, each candidate machine learning pipeline implementing a corresponding machine learning model utilizing a data transform of the plurality of data transforms and an associated featurization approach generated [by the large language model] (Bavly para 0021, FIG. 2 is a conceptual diagram of an example machine learning pipeline platform of 200 to implement explainable machine learning in accordance with the disclosed principles. The platform 200 may include various software algorithms configured as computer programs ( e.g., software) executed on one or more computers, in which the systems, models, algorithms, processes, and embodiments can be implemented various functionalities as described below. The platform 200 may explore different modeling techniques ( e.g., machine learning algorithms or models) compatible with training feature dataset and evaluate the performances of the trained models [initializing a plurality of candidate machine learning pipelines, each candidate machine learning pipeline implementing a corresponding machine learning model].
para 0022 line 1-10, The platform 200 may receive and input original data 202 and may include, among other things, algorithms of various machine learning models 208 with the aim of providing one or more recommended explainable models 218 as described herein. For example, the platform 200 may further include an Auto-XAI module 212 (e.g., Auto-XAI module 124 in FIG. 1) to receive feature datasets 206 (after undergoing feature engineering 204, explained below in more detail) and a selection of models 210 output from the set of models 208.
Para 0025, At step 304, feature engineering 204 may be performed by the application server 120 to extract and construct a plurality of feature datasets 206, which may be used an input to the auto-XAI module 212. Appropriate features may be selected and extracted to be used as input feature datasets for training purposes. A search in the appropriate parameter space may be automatically conducted to perform feature selection, so that an expert or a developer may not be required to have an intimate understanding of each of the selected models. As part of step 306, Application server 120 may perform preprocessing operations by making slight additions and or modifications to the features to generate the feature datasets 206 [utilizing a data transform of the plurality of data transforms and an associated featurization approach generated]); 
configuring an automated machine learning training module with a plurality of corresponding machine learning models implemented by the plurality of candidate machine learning pipelines to process the input dataset (Bavly Fig. 2, 
    PNG
    media_image1.png
    507
    1071
    media_image1.png
    Greyscale

[configuring an automated machine learning training module with a plurality of corresponding machine learning models implemented]
Para 0031 line 1-8, At step 404, application server 120 may execute the Auto-XAI module 212 to train the selected models 210 with the respective input feature datasets 206. The auto-XAI module 212 may process and generate respective trained models 214 with a respective output for each model. Application server 120 may perform model evaluation 216 and model selection 218 of the pipeline platform 200 based on the trained models' outputs. The models 210 may be trained by varying their respective explainability properties [by the plurality of candidate machine learning pipelines to process the input dataset]);
evaluating a performance of each of the plurality of corresponding machine learning models implemented by the plurality candidate machine learning based on the evaluation metric (Bavly Para 0031, At step 404, application server 120 may execute the Auto-XAI module 212 to train the selected models 210 with the respective input feature datasets 206. The auto-XAI module 212 may process and generate respective trained models 214 with a respective output for each model. Application server 120 may perform model evaluation 216 and model selection 218 of the pipeline platform 200 based on the trained models' outputs [evaluating a performance of each of the plurality of corresponding machine learning models implemented]. The models 210 may be trained by varying their respective explainability properties.
para 0039 line 1-6, Returning again to FIGS. 2 and 4, at step 408, application server 120 may execute models or algorithms of the platform 200 to determine an explainable model 218 as a recommended model from the set of the trained models 214 based on at least one of the accuracy value and the explainability properties [by the plurality candidate machine learning based on the evaluation metric]); 
and selecting a machine learning model from the plurality of corresponding machine learning models implemented by the plurality of candidate machine learning pipelines, the selected machine learning model having a higher performance in relation to the performances of other machine learning models in the plurality of corresponding machine learning models (Bavly para 0039, Returning again to FIGS. 2 and 4, at step 408, application server 120 may execute models or algorithms of the platform 200 to determine an explainable model 218 as a recommended model from the set of the trained models 214 based on at least one of the accuracy value and the explainability properties. The application server 120 may select and or determine the explainable model 218 from the set of trained models 214 based on a trade-off decision made between the accuracy and explainability properties of the trained models 214. The system may conduct model evaluation 216 by performing automated ranking and assessment of models and parameters so that the best list of possible options may be determined for the expert of developer based on the trade-off between performance and explainability. As a result, the system may only keep model options that are Pareto-optimal with respect to the explainability and multi-objective optimization [and selecting a machine learning model from the plurality of corresponding machine learning models implemented by the plurality of candidate machine learning pipelines,].
Para 0047, At step 612, the application server 120 may determine or select an explainable model 218 as the trained model with best explainability properties from the subset of the trained models.
Para 0048, In one embodiment, a typical case of a multi-objective process may be used to select acceptable models such that each model in the subset of trained models passes (i.e., exceeds) the predetermined accuracy threshold for one objective ( e.g., accuracy). The predetermined accuracy threshold may be set to have at least a percentage of accuracy or a predetermined performance score. For example, the model ranking may be conducted first based on accuracy values or performance scores when explainability is not important. Further, the best option from the remaining model options may be chosen based on another objective (e.g., explainability). The explainability of the subset of the trained models may be ranked or evaluated to determine models that exceed a predetermined explainability threshold. The most explainable or simplest model may be selected as the final explainable model 218 from the subset of the trained models.
Para 0049, The input-output relationship of each trained model may be used to show and or describe where each model fails or succeeds such that an expert and or developer may get a better understanding of the areas of failure. The model training results may be analyzed to show and determine the accuracy-explainability trade-off [the selected machine learning model having a higher performance in relation to the performances of other machine learning models in the plurality of corresponding machine learning models]).
Bavly does not explicitly teach [receiving an input dataset comprising a plurality of quantities and an evaluation metric] at a large language model; 
generating, by the large language model, a plurality of data transforms, each data transform of the plurality of data transforms formatting the input dataset for processing; 
generating, by the large language model, a plurality of featurization approaches, each featurization approach defining a feature set for the input dataset comprising a constituent plurality of features derived from the input dataset; 
[and an associated featurization approach generated] by the large language model; 
However Gu teaches [receiving an input dataset comprising a plurality of quantities and an evaluation metric] at a large language model (2.3 2.3 BLURB: A Comprehensive Benchmark for Biomedical NLP, para 3 line 1-3, BLURB is comprised of a comprehensive set of biomedicalNLP tasks from publicly available datasets, including NER, evidence-based medical information extraction (PICO), relation extraction, sentence similarity, document classification, and question answering.
Gu page 10 2.4.1 A General Architecture for Fine-Tuning Neural Language Models. Para 2, To facilitate a head-to-head comparison, we apply the same fine-tuning procedure for all BERT models and tasks. Specifically, we use cross-entropy loss for classification tasks and mean square error for regression tasks. We conduct hyperparameter search using the development set based on task-specific metrics. Similar to previous work, we jointly fine-tune the parameters of the task-specific prediction layer as well as the underlying neural language model. (i.e. the LLM receives the evaluation metric));
generating, by the large language model, a plurality of data transforms, each data transform of the plurality of data transforms formatting the input dataset for processing (Gu Page 9 2.3.6 Question Answering (QA).PubMedQA. The PubMedQA dataset [25] contains a set of research questions, each with a reference text from a PubMed abstract as well as an annotated label of whether the text contains the answer to the research question (yes/maybe/no). We use the original train/dev/test split with 450, 50, and 500 questions, respectively. 
BioASQ. The BioASQ corpus [42] contains multiple question answering tasks annotated by biomedical experts, including yes/no, factoid, list, and summary questions. Pertaining to our objective of comparing neural language models, we focus on the yes/no questions (Task 7b) and leave the inclusion of other tasks to future work. Each question is paired with a reference text containing multiple sentences from a PubMed abstract and a yes/no answer. We use the official train/dev/test split of 670/75/140 questions.
2.4 Task-Specific Fine-Tuning Pretrained neural language models provide a unifying foundation for learning task-specific models. Given an input token sequence, the language model produces a sequence of vectors in the contextual representation. A task-specific prediction model is then layered on top to generate the final output for a task-specific application. Given task-specific training data, we can learn the task-specific model parameters and refine the BERT model parameters by gradient descent using backpropagation [each data transform of the plurality of data transforms formatting the input dataset for processing].
Page 10 Fig. 2, 

    PNG
    media_image2.png
    633
    1054
    media_image2.png
    Greyscale

[generating, by the large language model, a plurality of data transforms]
2.4.1 2.4.1 A General Architecture for Fine-Tuning Neural Language Models. Figure 2 shows a general architecture of fine-tuning neural language models for downstream applications. An input instance is first processed by a TransformInput module that performs task-specific transformations such as appending special instance marker (e.g., [CLS]) or dummifying entity mentions for relation extraction. The transformed input is then tokenized using the neural language model’s vocabulary and fed into the neural language model. Next, the contextual representation at the top layer is processed by a Featurizer module and then fed into the Predict module to generate the final output for a given task.); 
generating, by the large language model, a plurality of featurization approaches, each featurization approach defining a feature set for the input dataset comprising a constituent plurality of features derived from the input dataset (Gu Page 9 2.3.6 Question Answering (QA).
PubMedQA. The PubMedQA dataset [25] contains a set of research questions, each with a reference text from a PubMed abstract as well as an annotated label of whether the text contains the answer to the research question (yes/maybe/no). We use the original train/dev/test split with 450, 50, and 500 questions, respectively.
BioASQ. The BioASQ corpus [42] contains multiple question answering tasks annotated by biomedical experts, including yes/no, factoid, list, and summary questions. Pertaining to our objective of comparing neural language models, we focus on the yes/no questions (Task 7b) and leave the inclusion of other tasks to future work. Each question is paired with a reference text containing multiple sentences from a PubMed abstract and a yes/no answer. We use the official train/dev/test split of 670/75/140 questions.
2.4.1 2.4.1 A General Architecture for Fine-Tuning Neural Language Models. Figure 2 shows a general architecture of fine-tuning neural language models for downstream applications. An input instance is first processed by a TransformInput module that performs task-specific transformations such as appending special instance marker (e.g., [CLS]) or dummifying entity mentions for relation extraction. The transformed input is then tokenized using the neural language model’s vocabulary and fed into the neural language model. Next, the contextual representation at the top layer is processed by a Featurizer module and then fed into the Predict module to generate the final output for a given task [generating, by the large language model, a plurality of featurization approaches,]); 
[and an associated featurization approach generated] by the large language model (Gu page 9, 2.4.1 2.4.1 A General Architecture for Fine-Tuning Neural Language Models. Figure 2 shows a general architecture of fine-tuning neural language models for downstream applications. An input instance is first processed by a TransformInput module that performs task-specific transformations such as appending special instance marker (e.g., [CLS]) or dummifying entity mentions for relation extraction. The transformed input is then tokenized using the neural language model’s vocabulary and fed into the neural language model. Next, the contextual representation at the top layer is processed by a Featurizer module and then fed into the Predict module to generate the final output for a given (i.e. generates the associated featurization approach generated by the large langue model);
Bavly and Gu are considered to be analogous to the claim invention because they are in the same field of distributed machine learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filling date of the claimed invention to have modified Bavly in view of Gu to disclose using an Large Language model to process datasets and generate features. Doing so to use general-domain language models and fine-tune them for domain specific task (Gu Abstract line 1-10, Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition). 
Regarding claim 4 and analogous claims 11 and 18, Bavly in view of Gu teach the method of claim 1. 
Bavly teaches wherein a feature of the constituent plurality of features is a subdivision extracted from a quantity of the plurality of quantities (Bavly para 0025 line 1-10, At step 304, feature engineering 204 may be performed by the application server 120 to extract and construct a plurality of feature datasets 206 [from a quantity of the plurality of quantities], which may be used an input to the auto-XAI module 212. Appropriate features may be selected and extracted to be used as input feature datasets for training purposes [wherein a feature of the constituent plurality of features is a subdivision extracted]. A search in the appropriate parameter space may be automatically conducted to perform feature selection, so that an expert or a developer may not be required to have an intimate understanding of each of the selected models).
Regarding claim 5 and analogous claims 12 and 19, Bavly in view of Gu teach the method of claim 1. 
Bavly teaches wherein a feature of the constituent plurality of features defines a characteristic of a quantity of the plurality of quantities (Bavly para 0025 line 10-15, As part of step 306, Application server 120 may perform preprocessing operations by making slight additions and or modifications to the features to generate the feature datasets 206. In one or more embodiments, a flag may be added to each feature of the dataset 206 to indicate whether the feature has a semantic representation or not [defines a characteristic of a quantity of the plurality of quantities]).
Regarding claim 6 and analogous claims 13 and 20, Bavly in view of Gu teach the method of claim 1. 
Bavly and Gu are combined in the same rational as set forth above with respect to claim 1 and analogous claims 8 and 15.
Gu further teaches wherein the plurality of data transforms is generated based on a data type of the input dataset (Gu Page 9 2.3.6 Question Answering (QA). PubMedQA. The PubMedQA dataset [25] contains a set of research questions, each with a reference text from a PubMed abstract as well as an annotated label of whether the text contains the answer to the research question (yes/maybe/no). We use the original train/dev/test split with 450, 50, and 500 questions, respectively.
BioASQ. The BioASQ corpus [42] contains multiple question answering tasks annotated by biomedical experts, including yes/no, factoid, list, and summary questions. Pertaining to our objective of comparing neural language models, we focus on the yes/no questions (Task 7b) and leave the inclusion of other tasks to future work. Each question is paired with a reference text containing multiple sentences from a PubMed abstract and a yes/no answer. We use the official train/dev/test split of 670/75/140 questions [based on a data type of the input dataset].
2.4 Task-Specific Fine-Tuning, Pretrained neural language models provide a unifying foundation for learning task-specific models. Given an input token sequence, the language model produces a sequence of vectors in the contextual representation. A task-specific prediction model is then layered on top to generate the final output for a task-specific application. Given task-specific training data, we can learn the task-specific model parameters and refine the BERT model parameters by gradient descent using backpropagation [, wherein the plurality of data transforms is generated] (i.e. the data transforms are generated based on the data type of input dataset)).
Claim(s) 2, 3, 9, 10, 16 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Bavly in view of Gu and further in view of Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP '09). Association for Computing Machinery, New York, NY, USA, 117–132 (“Xu”).
Regarding claim 2 and analogous claims 9 and 16, Bavly in view of Gu teach the method of claim 1. 
Bavly and Gu are combined in the same rational as set forth above with respect to claim 1 and analogous claims 8 and 15.
Bavly does not explicitly teach wherein a feature of the constituent plurality of features is a ratio of two quantities of the plurality of quantities.
However Xu teaches wherein a feature of the constituent plurality of features is a ratio of two quantities of the plurality of quantities (Xu Page 6, 4. Feature Creation, This section describes our technique for constructing features from parsed logs. We focus on two features, the state ratio vector and the message count vector [wherein a feature of the constituent plurality of features], based on state variables and identifiers (see Section 2.1), respectively. The state ratio vector is able to capture the aggregated behavior of the system over a time window. The message count vector helps detect problems related to individual operations. Both features describe message groups constructed to have strong correlations among their members. The features faithfully capture these correlations, which are often good indicators of runtime problems. Although these features are from the same log, and similar in structure, they are constructed independently, and have different semantics.
4.1 State variables and state ration vectors, 
We construct state ratio vectors y to encode this correlation: Each state ratio vector represents a group of state variables in a time window, while each dimension of the vector corresponds to a distinct state variable value , and the value of the dimension is how many times this state value appears in the time window [features is a ratio of two quantities of the plurality of quantities]).
Bavly and Xu are considered to be analogous to the claim invention because they are in the same field of distributed machine learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filling date of the claimed invention to have modified Bavly in view of Xu to disclose generating features that is a ratio of two quantities. Doing so to generate sophisticated features without the use of human input (Xu Abstract line 9-23, We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software’s internals.). 
Regarding claim 3 and analogous claims 10 and 17, Bavly in view of Gu teach the method of claim 1. 
Bavly and Gu are combined in the same rational as set forth above with respect to claim 1 and analogous claims 8 and 15.
Bavly and Xu are combine in the same rational as set forth above with respect to claim 2 and analogous claims 10 and 17.
Xu further teaches wherein a feature of the constituent plurality of features is an aggregate quantity of a subset of the plurality of quantities (Xu Page 122, 4. Feature Creation, This section describes our technique for constructing features from parsed logs. We focus on two features, the state ratio vector and the message count vector [wherein a feature of the constituent plurality of features], based on state variables and identifiers (see Section 2.1), respectively. The state ratio vector is able to capture the aggregated behavior of the system over a time window. The message count vector helps detect problems related to individual operations. Both features describe message groups constructed to have strong correlations among their members. The features faithfully capture these correlations, which are often good indicators of runtime problems. Although these features are from the same log, and similar in structure, they are constructed independently, and have different semantics.
Page 122-123 4.2 Identifiers and message count vectors para 2-3, To form the message count vector, we first automatically discover identifiers, then group together messages with the same identifier values, and create a vector per group. Each vector dimension corresponds to a different message type, and the value of the dimension tells how many messages of that type appear in the message group. The structure of this feature is analogous to the bag of words model in information retrieval [6]. In our application, the “document” is the message group. The dimensions of the vector consist of the union of all useful message types across all groups (analogous to all possible “terms”), and the value of a dimension is the number of appearances of the corresponding message types in a group (corresponding to “term frequency”). Algorithm 1 summarizes our three-step process for feature construction. We now try to provide intuition behind the design choices in this algorithm [aggregate quantity of a subset of the plurality of quantities]).
Claim(s) 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Bavly in view of Gu and further in view of M. Darling, G. Heileman, G. Gressel, A. Ashok and P. Poornachandran, "A lexical approach for classifying malicious URLs," 2015 International Conference on High Performance Computing & Simulation (HPCS), Amsterdam, Netherlands, 2015, pp. 195-202, (“Darling”).
Regarding claim 7 and analogous claim 14, Bavly in view of Gu teach the method of claim 1. 
Bavly and Gu are combined in the same rational as set forth above with respect to claim 1 and analogous claims 8 and 15.
Bavly does not explicitly teach wherein: the evaluation metric is selected based on a machine learning task associated with the input dataset; the machine learning task is a binary classification task identifying a malicious uniform resource locator; and the evaluation metric is an area under curve metric.
However Darling teaches wherein: the evaluation metric is selected based on a machine learning task associated with the input dataset; the machine learning task is a binary classification task identifying a malicious uniform resource locator; and the evaluation metric is an area under curve metric (Darling 
Page 195 I Introduction para 7, In this paper we present an approach which uses an ngram model to develop a new classification system that adheres to the strict time-constraints required for a real-time system. The system increases accuracy on out-of-sample testing data while maintaining overall classification accuracy comparable to previous work [1]–[3], [6]. Our approach uses the J48 decision tree algorithm to perform classification of URLs using 16 features extracted from an n-gram model and 71 features from other lexical properties. J48 is an open-source implementation of the C4.5 algorithm [8].
page 198-199 E. Classification Algorithms, In this study we chose to explore several classification methods. As our baseline we built a linear classifier using regularized logistic regression. Logistic regression is a parametric model for binary classification where examples are classified by their distance from a decision boundary [the machine learning task is a binary classification]. We implemented the L1 regularized logistic regression model from the LibLinear package as was done by Ma et al. [1]. For the implementation of the classifier based on n-gram modeling we used the J48 algorithm (an implementation of the C4.5 decision tree). J48 is one of many classification algorithms available in the popular Weka machine learning suite [8]. We chose J48 due to its reputation of success in this domain [15] and after evaluating its performance in comparison to Bayesian Logistic Regression, Logistic Regression, Naive Bayes, and K-Nearest Neighbors classifiers. In order to evaluate performance of each algorithm, we compared their overall accuracy and receiver operating characteristic (ROC) curve [the evaluation metric is selected based on a machine learning task associated with the input dataset].
A classifier’s accuracy is not the only necessary metric to evaluate its performance. The ambiguity lies in the fact that classification accuracy will have equal misclassification costs: an accuracy of 99% tells nothing about the false positive and negative rates. In fact, in a real-world deployment of any classifier, it is often true that the error rate of one class of data comes at a higher cost than misclassifying other types of data. For the problem of malicious web page detection, a false negative is potentially much more harmful than a false-positive since it could result in an infected system [the machine learning task is a binary classification task identifying a malicious uniform resource locator].
An ROC curve shows the false-positive and false-negative rates on the X and Y axes respectively. Therefore, the curve represents the predictive quality of the classifier independent of error costs (and class imbalance in the training data) [16]. In order to choose our classifier we examined the ROC Area- Under-the-Curve value (AUC) and accuracy rate of each type of classifier. Table III shows J48 outperforming its counterparts in AUC of its ROC. Furthermore, it achieved the highest overall classification accuracy for the n-gram model [the evaluation metric is an area under curve metric].
Page 6 III. Results, Table VI shows the confusion matrix for the J48 model with 10-fold cross validation on the entire data set. Each row represents the instances of an actual class, each column represents the instances of the predicted class by J48. 
    PNG
    media_image3.png
    123
    455
    media_image3.png
    Greyscale
 [the machine learning task is a binary classification task identifying a malicious uniform resource locator;] (i.e. the system classifies URLs as Malicious or Benign)).
Bavly and Darling are considered to be analogous to the claim invention because they are in the same field of distributed machine learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filling date of the claimed invention to have modified Bavly in view of Darling to disclose using area under the curve as an evaluation metric. Doing so to select the best classifier based on the best predictive quality (Darling E. Classification Algorithms Page 199 para 4, An ROC curve shows the false-positive and false-negative rates on the X and Y axes respectively. Therefore, the curve represents the predictive quality of the classifier independent of error costs (and class imbalance in the training data) [16]. In order to choose our classifier we examined the ROC Area Under-the-Curve value (AUC) and accuracy rate of each type of classifier. Table III shows J48 outperforming its counterparts in AUC of its ROC. Furthermore, it achieved the highest overall classification accuracy for the n-gram model.). 
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALFREDO CAMPOS whose telephone number is (571)272-4504. The examiner can normally be reached 7:00 - 4:00 pm M - F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J. Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ALFREDO CAMPOS/Examiner, Art Unit 2129                                                                                                                                                                                                        
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

Feb 16, 2023
Application Filed
Jul 28, 2025
Applicant Interview (Telephonic)
Dec 03, 2025
Non-Final Rejection — §101, §103
Feb 24, 2026
Interview Requested
Mar 05, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

17/528,305
Patent 12561407
ONE-PASS APPROACH TO AUTOMATED TIMESERIES FORECASTING
2y 5m to grant Granted Feb 24, 2026
17/558,355
Patent 12561559
Neural Network Training Method and Apparatus, Electronic Device, Medium and Program Product
2y 5m to grant Granted Feb 24, 2026
17/820,419
Patent 12554973
HIERARCHICAL DATA LABELING FOR MACHINE LEARNING USING SEMI-SUPERVISED MULTI-LEVEL LABELING FRAMEWORK
2y 5m to grant Granted Feb 17, 2026
17/938,431
Patent 12536260
SYSTEM, APPARATUS, AND METHOD FOR AUTOMATICALLY GENERATING NEGATIVE KEYSTROKE EXAMPLES AND TRAINING USER IDENTIFICATION MODELS BASED ON KEYSTROKE DYNAMICS
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 4 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+33.3%)
3y 9m
Median Time to Grant
Low
PTA Risk
Based on 6 resolved cases by this examiner. Grant probability derived from career allow rate.