Office Action Analysis: 18178056 — TIERED EVALUATION METRIC FOR COMPREHENSIVELY EVALUATING MACHINE LEARNING MODELS

Office Action

§101 §102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more.
 101 Subject Matter Eligibility Analysis

 Claims 1-8 are directed to a method consisting of a series of steps, meaning that it is directed to the statutory category of process. Claims 9-20 are directed to storage mediums and processors which are machines.
Step 2A Prong One, Step 2A Prong Two, and Step 2B Analysis:
Step 2A Prong One asks if the claim recites a judicial exception (abstract idea, law of nature, or natural phenomenon). If the claim recites a judicial exception, analysis proceeds to Step 2A Prong Two, which asks if the claim recites additional elements that integrate the abstract idea into a practical application. If the claim does not integrate the judicial exception, analysis proceeds to Step 2B, which asks if the claim amounts to significantly more than the judicial exception. If the claim does not amount to significantly more than the judicial exception, the claim is not eligible subject matter under 35 U.S.C. 101. 
None of the claims represent an improvement to technology.

Regarding claim 1, the following claim elements are abstract ideas:
Generating…a holistic evaluation vector for a target machine learning model based on a plurality of evaluation scores for the target machine learning model, wherein the plurality of evaluation scores comprises: (i) a data evaluation score corresponding to a training dataset for the target machine learning model, (ii) a model evaluation score corresponding to one or more performance metrics for the target machine learning model, and (iii) a decision evaluation score corresponding to an output class of the target machine learning model (This is an abstract idea of a mental process and mathematical concepts. Under the broadest reasonable interpretation, the recited “data evaluation score,” “model evaluation score,” and “decision evaluation score” each correspond to numerical values that are generated from mathematical or statistical analysis of data. The “holistic evaluation vector” merely represents an arrangement of these numerical values. As such, the limitation is directed to mathematical calculations and the organization of numerical information, which are types of operations that can be performed in the human mind, with the aid of basic computing tools (e.g., pen and paper or a calculator). Accordingly, these limitations are directed to the mental process groupings and mathematical concept groupings of abstract ideas. See MPEP 2106.04(a)(2(I) and 2106.04(a)(2)(III).);
generating…a holistic evaluation score for the target machine learning model based on an aggregation of the holistic evaluation vector (Under the broadest reasonable interpretation, the recited steps of “generating a holistic evaluation score based on an aggregation of the holistic evaluation vector”  amounts to performing a mathematical aggregation of previously-determined numerical scores. That is, the “holistic evaluation score” is simply a further numerical value derived from combining the underlying evaluation scores using mathematical operations. Such activity constitutes mathematical calculation and numerical analysis that can be performed in the human mind, optionally with the aid of basic computing tools (e.g., pen and paper or a calculator). Accordingly, this limitation is directed to a mathematical concept and a mental process.); and
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
one or more processors (This is a high-level recitation of generic computer components for performing the abstract idea. See MPEP 2106.05.)
providing, by the one or more processors, an evaluation output for the target machine learning model based on the holistic evaluation score (This merely applies the result of the abstract idea and therefore amounts to insignificant extra-solution activity. See MPEP 2106.05(g).).
Regarding claim 2, the rejection of claim 1 is incorporated herein. Further, claim 2 recites the following additional elements, which taken alone or in combination with other elements, do not integrate the judicial exception into practical application nor amount to significantly more than the judicial exception:
(i) the target machine learning model is previously trained based on the training dataset, (ii) the training dataset comprises a plurality of input data objects and a plurality of input features, (iii) each input data object of the plurality of input data objects comprises an input feature value for one or more of the plurality of input features, and (iv) the data evaluation score is indicative of a balance of the training dataset with respect to one or more of the plurality of input features (These limitations merely apply the abstract idea in the context of a previously-trained model and describe the dataset and meaning of the evaluation score, and therefore amounts to insignificant extra-solution activity.).
Regarding claim 3, the rejection of claim 2 is incorporated herein. Further, claim 3 recites the following abstract idea:
generating the data evaluation score based on the data evaluation profile (This is an abstract idea of a mental process. The limitation recites evaluating characteristics of the training data that are reflected in the “data evaluation profile” and producing a corresponding numerical “data evaluation score.” A person could, through observation, judgement, and reasoning, examine the profile values and compute a corresponding score using simple arithmetic or weighting rules, either mentally or with basic tools such as pen, paper, or a calculator. Because this involves evaluation and calculation that can be practically performed in the human mind or with basic computing tools, it falls within the mental process grouping of abstract ideas.).
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
receiving a data evaluation profile for the training dataset (Receiving a data evaluation profile (i.e., mere data transmission in conjunction with the abstract idea) is directed to a well understood routine conventional activity of data transmission see MPEP 2106.05(d)(II)(i).)
wherein the data evaluation profile is indicative of: (i) one or more evaluation features from the plurality of input features, (ii) one or more feature values respectively defined for each of the one or more evaluation features, and (iii) one or more input data object exceptions for each of the one or more evaluation features (This limitation amounts to adding insignificant extra-solution activity to the judicial exception. The recited descriptive dataset information represents mere data presentation and contextual description in conjunction with the abstract idea.);
Regarding claim 4, the rejection of claim 3 is incorporated herein. Further claim 4 recites the following abstract ideas:
determining a target ratio for an evaluation feature of the one or more evaluation features (This is an abstract idea of a mental process and a mathematical concept. The limitation involves reviewing feature values and performing mathematical calculation to determine a ratio for the feature. A person could mentally evaluate the feature distribution and calculate the ratio using basic arithmetic with pen and paper or a calculator. Since it involves mental reasoning and mathematical calculation, it falls within the abstract groupings of mental processes and mathematical concepts.);
generating a synthetic dataset for the evaluation feature based on the target ratio and the data evaluation profile, wherein the synthetic dataset comprises a plurality of synthetic data objects each comprising at least one feature value from one or more defined feature values of the evaluation feature (This is an abstract idea of a mental process. It involves reviewing the data evaluation profile and,  using observation and judgement, creating synthetic data objects that reflect selected feature values. A person could mentally determine the values to assign to each synthetic record. Since this activity can practically be performed in the human mind, it falls within the mental process grouping of abstract ideas.); and
generating the data evaluation score based on the synthetic dataset (This is an abstract idea of a mental process and mathematical concept. It involves reviewing synthetic data and performing a mathematical calculation to derive a score that reflects the evaluation result. A person could mentally evaluate the dataset and compute the score using basic arithmetic with pen and paper or a calculator. Since it involves mental reasoning and mathematical calculations, it falls within the abstract groupings of mental processes and mathematical concepts.).
Regarding claim 5, the rejection of claim 4 is incorporated herein. Further, claim 5 recites the following additional elements, which taken alone or in combination with other elements, do not integrate the judicial exception into practical application nor amount to significantly more than the judicial exception:
(i) the one or more defined feature values comprise a first feature value and a second feature value, (ii) the target ratio is indicative of a first expected frequency for the first feature value and a second expected frequency for the second feature value, (iii) the plurality of synthetic data objects comprises (a) one or more first synthetic data objects, each comprising the first feature value and (b) one or more second synthetic data objects, each comprising the second feature value, (iv) the one or more first synthetic data objects are based on the first expected frequency, and (v) the one or more second synthetic data objects are based on the second expected frequency (These limitations merely apply the abstract idea in the context of particular feature values, object groupings, and expected frequencies and therefore amount to insignificant extra-solution activity.).
Regarding claim 6, the rejection of claim 4 are incorporated herein. Claim 6 further recites the following abstract ideas:
generating an input feature profile for a non-evaluation feature of the training dataset based on the training dataset and the synthetic dataset, wherein the input feature profile is indicative of a feature confidence score between the non-evaluation feature and the evaluation feature (This is an abstract idea of a mental process. The limitation involves reviewing feature values of the real dataset and the synthetic dataset, comparing how strongly a non-evaluation feature is associated with an evaluation feature, and assigning a confidence score to reflect that relationship. A person could, using observation, judgement, reasoning, and simple mathematical calculation, examine the feature values, determine the strength of the correlation, and express that strength as a confidence score in the mind or with basic tools such as pen and paper or a calculator. Since it involves mental evaluation and mathematical calculation that can practically be performed in the human mind, it falls within the mental process groupings of abstract ideas.); and
generating the data evaluation score based on the feature confidence score (This is an abstract idea of a mental process and mathematical concept. The limitation involves using a numerical confidence score and applying mathematical calculation to derive an overall evaluation score. A person could, using observation, judgement, reasoning, and simple arithmetic, review a confidence score and compute a resulting evaluation score mentally or with basic tools such as pen and paper or a calculator. Since it involves mental evaluation and mathematical calculation that can practically be performed in the human mind, it falls within the mental process and mathematical concept groupings of abstract ideas.).
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
wherein the plurality of input features comprises the one or more evaluation features and one or more non-evaluation features  (This limitation amounts to adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP 2106.05(g).) 
Regarding claim 7, the rejection of claim 6 are incorporated herein. Claim 7 further recites the following abstract ideas:
generating a feature correlation score between the evaluation feature and the non-evaluation feature (This is an abstract idea of a mental process and a mathematical concept. The limitation involves comparing two features and applying mathematical calculation to determine the degree of correlation between them. A person could, using observation, judgement, reasoning, and basic arithmetic, review the feature values and compute a correlation score mentally or with simple tools such as pen and paper or a calculator. Since it involves mental evaluation and mathematical calculation that can practically be performed in the human mind, it falls within the mental process and mathematical concept groupings of abstract ideas.);
determining a scaled feature correlation score based on the feature correlation score and the feature confidence score (This is an abstract idea of a “mental process.” The limitation involves reviewing two previously determined numerical values, and using observation and judgement, determining a scaled score that reflects their combined effect. A person could mentally evaluate the correlation score and the confident score and determine an adjusted value using reasoning or simple arithmetic. Since this activity can practically be performed in the human mind, it falls within the mental process groupings of abstract ideas.); and
in response to the scaled feature correlation score achieving a threshold score, augmenting the data evaluation profile with the non-evaluation feature (This is an abstract idea of a mental process. The limitation involves comparing a calculated score to a threshold value and, based on that comparison, deciding to include an additional feature in the profile. A person could mentally evaluate the score and determine whether the threshold is satisfied using judgement and reasoning. Since this activity can be practically performed in the human mind, it falls within the mental process grouping of abstract ideas.).
Regarding claim 8, the rejection of claim 7 are incorporated herein. Claim 8 further recites the following abstract ideas:
generating an input feature risk score for the training dataset based on an aggregation of a plurality of scaled feature correlation scores for the one or more non-evaluation features, wherein the input feature risk score is indicative of a probability that the one or more non-evaluation features are impacted by the feature confidence score (This is an abstract idea of a mental process and a mathematical concept. The limitation involves aggregating multiple numerical scores and using mathematical calculation to derive a probability-type risk score reflecting the likelihood of impact. A person could, using observation, judgement, reasoning, and basic arithmetic, review the various correlation scores and compute a resulting probability or risk value mentally or with simple tools such as pen and paper or a calculator. Since it involves mental evaluation and mathematical calculation that can practically be performed in the human mind, it falls within the mental process and mathematical concept groupings of abstract ideas.); and
generating the data evaluation score based on the input feature risk score (This is an abstract idea of a mental process and a mathematical concept. The limitation involves using previously determined numerical risk score and applying mathematical calculation to derive an overall evaluation score. A person could, using observation, judgement, reasoning, and simple arithmetic, review the risk score and compute the resulting evaluation score mentally or with basic tools such as pen and paper or a calculator. Since it involves mental evaluation and mathematical calculation that can practically be performed in the human mind, it falls within the mental process and mathematical concept groupings of abstract ideas.).
Regarding claim 9, the rejection of claim 8 are incorporated herein. Claim 9 further recites the following abstract ideas:
generating…a plurality of first feature impact measures for the one or more evaluation features, wherein a first feature impact measure is indicative of a relative impact of the evaluation feature to a predictive output of the target machine learning model (This is an abstract idea of a mental process. The limitation involves reviewing outputs associated with each feature and determining, through judgement and reasoning, the degree to which each feature influences the model’s predictive output. A person could mentally evaluate the feature-output relationship and assess the relative level of impact using observation and reasoning. Since this activity can practically be performed in the human mind, it falls within the mental process grouping of abstract ideas.);
generating, using one or more partial dependency plots, a plurality of second feature impact measures for the one or more evaluation features, wherein a second feature impact measure for the evaluation feature is indicative of a relationship type between the evaluation feature and one or more predicted output classes of the target machine learning model (This is an abstract idea of a mental process. The limitation involves reviewing partial dependency outputs and, through judgement and reasoning, determining the type of relationship between a feature and predicted output classes. A person could mentally evaluate the plots, observe how the output varies with the feature, and characterize the relationship using reasoning. Since this activity can practically be performed in the human mind, it falls within the mental process grouping of abstract ideas.)
determining a data impact score for the training dataset based on the plurality of first feature impact measures and the plurality of second feature impact measures, wherein the data impact score is indicative of a probability that one or more predictive outputs by the target machine learning model are impacted by the feature confidence score (This is an abstract idea of a mental process. The limitation involves reviewing multiple impact measures and, using judgement and reasoning, determining an overall score that reflects the likelihood that the model’s predictive outputs are influenced by the feature confidence score. A person could mentally evaluate the impact measures and assess the resulting probability through observation and reasoning. Since this activity can practically be performed in the human mind, it falls within the mental process grouping of abstract ideas.); and
generating the data evaluation score based on the data impact score (This is an abstract idea of a mental process and a mathematical concept. The limitation involves using a previously determined score and applying mathematical calculation and judgement to derive a resulting evaluation score using observation, reasoning, and simple arithmetic in the mind or with basic tools. Since it involves mental evaluation and mathematical calculation that can practically be performed in the human mind, it falls within the mental process and mathematical concept groupings of abstract ideas.).
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
interpretable machine learning model (This is a high-level recitation of generic computer components for performing the abstract idea. See MPEP 2106.05.),
target machine learning model (This is a high-level recitation of generic computer components for performing the abstract idea. See MPEP 2106.05.)
Regarding claim 10, the rejection of claim 1 are incorporated herein. Claim 10 further recites the following abstract ideas:
(v) the model evaluation score is based on a comparison between at least two of the one or more evaluation data object sets (This is an abstract idea of a mental process. The limitation involves comparing different groups of data objects and, through judgement and reasoning, determining a score based on that comparison. A person could mentally review the results for each group and evaluate the differences to derive the score. Since this activity can practically be performed in the human mind, it falls within the mental process grouping of abstract ideas.).
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
(i) the training dataset comprises a plurality of input data objects and a plurality of input features, (ii) the plurality of input features comprises one or more evaluation features, (iii) the plurality of input data objects comprises one or more evaluation data object sets, (iv) each evaluation data object set comprises one or more input data objects that each comprise a particular feature value of an evaluation feature (These limitations merely apply the abstract idea in the context of particular dataset organization and grouping of data objects based on feature values and therefore amounts to insignificant extra-solution activity.), and
Regarding claim 11, the rejection of claim 10 are incorporated herein. Claim 11 further recites the following abstract ideas:
determining the first performance metric based on a selection rate comparison between the at least two evaluation data object sets (This is an abstract idea of a mental process. The limitation involves comparing selection rates across data groups and, through judgement and reasoning, determining a performance metric based on that comparison. A person could mentally review the selection rates for the groups and evaluate the resulting metric in the mind. Since this activity can practically be performed in the human mind, it falls within the mental process grouping of abstract ideas.);
determining the second performance metric based on a false positive rate comparison between the at least two evaluation data object sets (This is an abstract idea of a mental process. The limitation involves comparing false positive rates across data groups and, through judgement and reasoning, determine a performance metric based on that comparison. A person could mentally review the false positive rates for the groups and evaluate the resulting metric in the mind. Since this activity can practically be performed in the human mind, it falls within the mental process grouping of abstract ideas.);
determining the third performance metric based on a false negative rate comparison between the at least two evaluation data object sets (This is an abstract idea of a mental process. The limitation involves comparing false negative rates across data groups and, through judgement and reasoning, determining a performance metric based on that comparison. A person could mentally review the false negative rates for the groups and evaluate the resulting metric in the mind. Since this activity can practically be performed in the human mind, it falls within the mental process groupings of abstract ideas.); and
generating the model evaluation score based on an aggregation of the first performance metric, the second performance metric, and the third performance metric (This is an abstract idea of a mental process and mathematical concept. The limitation involves combining multiple numerical performance metrics and applying mathematical calculation and judgement to derive an overall evaluation score. A person could mentally review the metrics and compute the resulting aggregate score using observation, reasoning, and simple arithmetic. Since it involves mental evaluation and mathematical calculation that can practically be performed in the human mind, it falls within the mental process and mathematical concept groupings of abstract ideas.).
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
wherein the one or more performance metrics comprise a first performance metric, a second performance metric, and a third performance metric (This limitation amounts to adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP 2106.05(g).),
Regarding claim 12, the rejection of claim 1 are incorporated herein. Claim 12 further recites the following abstract ideas:
(iii) the decision evaluation score is based on one or more counterfactual proposals for one or more of the plurality of predictive outputs that correspond to the negative output class (This is an abstract idea of a mental process. The limitation involves reviewing negative predictive outputs and, using judgement and reasoning, considering hypothetical changes (counterfactual proposals) that would case the output to change, and basing a score on that review. A person could mentally evaluate the negative outputs and determine such counterfactual proposals through reasoning. Since this activity can practically be performed in the human mind, it falls within the mental process grouping of abstract ideas.).
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
(i) the target machine learning model is previously trained to generate a plurality of predictive outputs for a plurality of input data objects (This merely applies the abstract idea using a trained machine learning model to produce predictive outputs and therefore amounts to insignificant extra-solution activity.),
(ii) each of the plurality of predictive outputs correspond to a positive output class or a negative output class (This limitation amounts to adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP 2106.05(g).), and
Regarding claim 13, the rejection of claim 12 are incorporated herein. Claim 13 further recites the following abstract ideas:
identifying, from the one or more counterfactual proposals, an evaluation counterfactual proposal that comprises an evaluation feature of the one or more evaluation features (This is an abstract idea of a mental process. The limitation involves reviewing multiple counterfactual proposals and, through judgement and reasoning, identifying one that includes an evaluation feature. A person could mentally evaluate the proposals and determine which one meets this condition. Since this activity can practically be performed in the human mind, it falls within the mental process grouping of abstract ideas.);
in response to identifying the evaluation counterfactual proposal, generating…a recourse action for the evaluation counterfactual proposal (This is an abstract idea of a mental process. The limitation involves reviewing the identified counterfactual proposal and, using judgement and reasoning, determining and action to take based on that proposal. A person could mentally evaluate the proposal and decide what recourse action should be taken. Since this activity can practically be performed in the human mind, it falls within the mental process grouping of abstract ideas.); and
generating the decision evaluation score based on the recourse action (This is an abstract idea of a mental process. The limitation involves reviewing the determined recourse action, and, through judgement and reasoning, generating a score based on that action. A person could mentally evaluate a recourse action and determine the resulting decision evaluation score. Since this activity can practically be performed in the human mind, it falls within the mental process grouping of abstract ideas.).
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
machine learning recourse model (This is a high-level recitation of generic computer components for performing the abstract idea. See MPEP 2106.05.)
Regarding claim 14, the following claim elements are abstract ideas:
generate a holistic evaluation vector for a target machine learning model based on a plurality of evaluation scores for the target machine learning model, wherein the plurality of evaluation scores comprises: (i) a data evaluation score corresponding to a training dataset for the target machine learning model, (ii) a model evaluation score corresponding to one or more performance metrics for the target machine learning model, and (iii) a decision evaluation score corresponding to an output class of the target machine learning model (This is an abstract idea of a mental process and mathematical concepts. Under the broadest reasonable interpretation, the recited “data evaluation score,” “model evaluation score,” and “decision evaluation score” each correspond to numerical values that are generated from mathematical or statistical analysis of data. The “holistic evaluation vector” merely represents an arrangement of these numerical values. As such, the limitation is directed to mathematical calculations and the organization of numerical information, which are types of operations that can be performed in the human mind, with the aid of basic computing tools (e.g., pen and paper or a calculator). Accordingly, these limitations are directed to the mental process groupings and mathematical concept groupings of abstract ideas. See MPEP 2106.04(a)(2(I) and 2106.04(a)(2)(III).);
generate a holistic evaluation score for the target machine learning model based on an aggregation of the holistic evaluation vector (Under the broadest reasonable interpretation, the recited steps of “generating a holistic evaluation score based on an aggregation of the holistic evaluation vector”  amounts to performing a mathematical aggregation of previously-determined numerical scores. That is, the “holistic evaluation score” is simply a further numerical value derived from combining the underlying evaluation scores using mathematical operations. Such activity constitutes mathematical calculation and numerical analysis that can be performed in the human mind, optionally with the aid of basic computing tools (e.g., pen and paper or a calculator). Accordingly, this limitation is directed to a mathematical concept and a mental process.); and
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
memory (This is a high-level recitation of generic computer components for performing the abstract idea. See MPEP 2106.05.)
one or more processors (This is a high-level recitation of generic computer components for performing the abstract idea. See MPEP 2106.05.)
provide an evaluation output for the target machine learning model based on the holistic evaluation score (This merely applies the result of the abstract idea and therefore amounts to insignificant extra-solution activity. See MPEP 2106.05(g).).
Regarding claim 15, the rejection of claim 14 is incorporated herein. The claim recites similar limitations as corresponding to claim 2. Therefore, the same subject matter analysis was utilized for claim 2, as described above, is equally applicable to claim 15.
Therefore claim 15, is ineligible.
Regarding claim 16, the rejection of claim 15 is incorporated herein. The claim recites similar limitations as corresponding to claim 3. Therefore, the same subject matter analysis was utilized for claim 3, as described above, is equally applicable to claim 16.
Therefore claim 16, is ineligible.
Regarding claim 17, the rejection of claim 16 is incorporated herein. The claim recites similar limitations as corresponding to claim 4. Therefore, the same subject matter analysis was utilized for claim 4, as described above, is equally applicable to claim 17.
Therefore claim 17, is ineligible.
Regarding claim 18, the rejection of claim 14 is incorporated herein. The claim recites similar limitations as corresponding to claim 12. Therefore, the same subject matter analysis was utilized for claim 12, as described above, is equally applicable to claim 18.
Therefore claim 18, is ineligible.
Regarding claim 19, the following claim elements are abstract ideas:
generate a holistic evaluation vector for a target machine learning model based on a plurality of evaluation scores for the target machine learning model, wherein the plurality of evaluation scores comprises: (i) a data evaluation score corresponding to a training dataset for the target machine learning model, (ii) a model evaluation score corresponding to one or more performance metrics for the target machine learning model, and (iii) a decision evaluation score corresponding to an output class of the target machine learning model (This is an abstract idea of a mental process and mathematical concepts. Under the broadest reasonable interpretation, the recited “data evaluation score,” “model evaluation score,” and “decision evaluation score” each correspond to numerical values that are generated from mathematical or statistical analysis of data. The “holistic evaluation vector” merely represents an arrangement of these numerical values. As such, the limitation is directed to mathematical calculations and the organization of numerical information, which are types of operations that can be performed in the human mind, with the aid of basic computing tools (e.g., pen and paper or a calculator). Accordingly, these limitations are directed to the mental process groupings and mathematical concept groupings of abstract ideas. See MPEP 2106.04(a)(2(I) and 2106.04(a)(2)(III).);
generate a holistic evaluation score for the target machine learning model based on an aggregation of the holistic evaluation vector (Under the broadest reasonable interpretation, the recited steps of “generating a holistic evaluation score based on an aggregation of the holistic evaluation vector”  amounts to performing a mathematical aggregation of previously-determined numerical scores. That is, the “holistic evaluation score” is simply a further numerical value derived from combining the underlying evaluation scores using mathematical operations. Such activity constitutes mathematical calculation and numerical analysis that can be performed in the human mind, optionally with the aid of basic computing tools (e.g., pen and paper or a calculator). Accordingly, this limitation is directed to a mathematical concept and a mental process.); and
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
One or more non-transitory computer-readable storage media (This is a high-level recitation of generic computer components for performing the abstract idea. See MPEP 2106.05.)
one or more processors (This is a high-level recitation of generic computer components for performing the abstract idea. See MPEP 2106.05.)
provide an evaluation output for the target machine learning model based on the holistic evaluation score (This merely applies the result of the abstract idea and therefore amounts to insignificant extra-solution activity. See MPEP 2106.05(g).).
Regarding claim 20, the rejection of claim 19 is incorporated herein. The claim recites similar limitations as corresponding to claim 10. Therefore, the same subject matter analysis was utilized for claim 10, as described above, is equally applicable to claim 20.
Therefore claim 20, is ineligible.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-11 and 14-20 are rejected under 35 U.S.C. 102 (a) (2) as being anticipated  by Jesus et al., (Pub. No.: US 20230074606 A1 (Filed: June 2022).

Regarding claim 1, Jesus discloses:
A computer-implemented method comprising: generating, by one or more processors, a holistic evaluation vector for a target machine learning model based on a plurality of evaluation scores for the target machine learning model, wherein the plurality of evaluation scores comprises (Jesus, [Abstract] “ The process includes using computer processor(s) and the received training data to train the generative model” [0020] “Evaluating both fairness and predictive accuracy is typically practiced when introducing novel algorithms, methods, or metrics for bias mitigation.” – describes a computer-implemented machine-learning systems executed using computer processors. Jesus further explains that each machine-learning model is evaluated using multiple numerical metrics, including fairness and predictive-accuracy measures. Under BRI, a “vector” reasonably encompasses a collection of numerical evaluation values. Therefore, the multiple evaluation metrics generated for a model correspond to a holistic evaluation vector comprising a plurality of evaluation scores for the target machine-learning model generated by processors as recited.) :
(i) a data evaluation score corresponding to a training dataset for the target machine learning model (Jesus, paragraph [0017] “Evaluation remains an obstacle to progress in fair ML because of a lack of consistent, well-established, and systematic evaluation of fairness; and scarcity of realistic, large tabular datasets for algorithmic decision-making; among other things.” [0029] “It may be relevant to evaluate models and bias mitigation techniques beyond the bias that naturally occur in datasets (e.g., to artificially inject predefined types of bias into the dataset). This provides fine-grained control over experiments and increases the overall robustness of a benchmark.” [0030]-[0031] “There are several definitions of bias in data…three different types of bias related to a given protected attribute can be defined as: (i) group size disparities, (ii) prevalence disparities, and (iii) distinct conditional class separability.” – describes computing bias metrics on the dataset itself, including group size disparity, prevalence disparity, and conditional class separability. These metrics evaluate characteristics of the training dataset related to fairness. Under BRI, a “data evaluation score” reasonably includes any quantitative fairness or bias metric computed from a dataset.),
(ii) a model evaluation score corresponding to one or more performance metrics for the target machine learning model (Jesus, paragraph [0020] “ Evaluating both fairness and predictive accuracy is typically practiced when introducing novel algorithms, methods, or metrics for bias mitigation.” [0022] “reports of model performance generally refer to a single operating point, i.e., a single threshold…” [0087] “In an evaluation, an 80% fairness threshold was used, meaning an apparatus, system, or process is considered to be fair if it scores higher than 80% in the fairness metric. “ – describes evaluating machine learning models using numerical metrics including predictive accuracy measure and fairness scores. Jesus further explains that model performance is reported at defined operating thresholds and that fairness performance may be expressed as a numerical score such as 80% fairness threshold. These fairness and predictive accuracy metrics constitute numerical performance measures describing how the model performs. Under the broadest reasonable interpretation, such numerical model-performance metrics correspond to a “model evaluation score corresponding to one or more performance metrics for the target machine learning model,” as recited.), and
(iii) a decision evaluation score corresponding to an output class of the target machine learning model (Jesus, paragraph [0038] “ In assistive settings, a positive prediction is related to a positive outcome for the individual (e.g., funding for their project). As such, fairness is achieved by maximizing Equation 1 for the positive class y=1 (ratio of true positive rates). This fairness metric is also known as equal opportunity. Conversely, in punitive settings, a positive prediction is related to a negative outcome for the individual (e.g., losing access to their bank account for being flagged as fraudulent). In these cases, fairness is achieved by maximizing Equation 1 for the negative class y=0 (ratio of false positive rates). This fairness metric is also known as predictive equality, or equal opportunity with reference to y=0” – describes computing fairness metrics that are explicitly conditioned on the model’s predicted output class. For example, in assistive setting s fairness is evaluated based on the true-positive rate for the positive output class, and in punitive settings fairness is evaluated based on the false-positive rate for the negative output class. These metrics represent numerical scores that depend on which output class is being predicted. Under BRI, a numerical fairness metric that is computed for a particular predicted output class constitutes a “decision evaluation score corresponding to an output class of the target machine learning model.”) ;
generating, by the one or more processors, a holistic evaluation score for the target machine learning model based on an aggregation of the holistic evaluation vector; and providing, by the one or more processors, an evaluation output for the target machine learning model based on the holistic evaluation score (Jesus, paragraph [0087] “Experimental results show that the disclosed techniques perform well. In an evaluation, an 80% fairness threshold was used, meaning an apparatus, system, or process is considered to be fair if it scores higher than 80% in the fairness metric. In various embodiments, if no model is found to be fair for some algorithm, this result is output along with a model found to mostly closely the fairness criterion/criteria.” [0088] “Globally, it is noticeable that conventional classification algorithms show general good predictive accuracy but poor fairness.” [0089] “When unaware algorithms satisfy the fairness threshold, the TPR measurement is relatively low (<20%), which may constitute a steep fairness-performance trade-off. “ – describes evaluating machine learning models using multiple numerical metrics, including fairness scores and predictive-accuracy performance. Jesus explains that a model is determined to be fair if it exceeds a fairness threshold and that models may also exhibit differing accuracy performance characteristics. In some cases, these metrics are jointly considered, such as where satisfaction of a fairness threshold is balanced against true-positive-rate performance. Under BRI, a collection of multiple evaluation scores constitutes a “vector” and jointly considering these scores to determine whether a model satisfies fairness and performance criteria constitutes generating a “holistic evaluation score based on an aggregation of the holistic evaluation vector.” Jesus further explains that the result of this evaluation is output, including whether a model satisfies the fairness criteria or which model most closely meets the evaluation determination corresponding to “providing…an evaluation output for the target machine learning model based on the holistic evaluation score.”)
Regarding claim 2, Jesus discloses:
The computer-implemented method of claim 1, wherein: (i) the target machine learning model is previously trained based on the training dataset, (ii) the training dataset comprises a plurality of input data objects and a plurality of input features, (iii) each input data object of the plurality of input data objects comprises an input feature value for one or more of the plurality of input features (Jesus, paragraph [0042] “ Techniques for obtaining a generated dataset with a predetermined bias for evaluating algorithmic fairness of a machine learning model are disclosed. In various embodiments, the generated dataset includes training data (used to train a machine learning model) and/or test data (used to test the performance of a machine learning model). The generated dataset (sometimes called “benchmark suite” or “benchmark”) evaluates ML fairness under different biased patterns in data, indicating which types of data bias a given Fair ML algorithm is capable of handling and the robustness (resistance) of a trained ML model to the presence of bias in data.” [0044] “ Dataset generator 152 is configured to receive an input dataset 140. The input dataset, sometimes called a seed dataset, is processed by the dataset generator according to the disclosed techniques to generate an evaluation dataset. Dataset 140 may be tabular or any other format.” – describes using a generated dataset to train a machine learning model, which is then evaluated for fairness and robustness. This corresponds to the claim requirement “the target machine learning model is previously trained based on the training dataset.” Jesus further discloses that the input dataset may be tabular. Under BRI, a tabular dataset consists of a grid-like structure in which the rows correspond to a plurality of input data objects and the columns correspond to a plurality of input features. In such a structure, each row necessarily contains at least one input feature value in one or more feature columns. Therefore, Jesus teaches that the training dataset comprises a plurality of input data objects and a plurality of input features, and that each input data object comprises an input feature value for one or more of the plurality of input features.), and
(iv) the data evaluation score is indicative of a balance of the training dataset with respect to one or more of the plurality of input features (Jesus, paragraph [0036] “The choice of fairness and performance metric may be highly task dependent. For instance, one can trivially achieve high accuracy (or low misclassification rate) on datasets with severe class imbalance (if a class represents 99% of the data, a model can achieve 99% accuracy by always predicting that class). Regarding fairness metrics, one can trivially achieve perfect equal opportunity by predicting all samples as positive, or achieve perfect predictive equality by predicting all samples as negative. As such, some ways to make metrics comparable between different models include setting a given threshold budget (e.g., number of allowed positive predictions) or choosing a specific point in the ROC curve (e.g., maximum number of false positives, or minimum number of true positives).” –  describes computing fairness evaluation-metrics for machine-learning models using ROC-curve-based quantities such as false-positive-rate, true-positive-rate, and threshold budgets. Jesus further explains that these values change when the dataset is severely imbalanced (e.g., when one class represents 99% of the data). Because these ROC-based fairness metrics are calculated from the distribution of the data across feature-defined groups in the training dataset, the resulting numerical values inherently indicate whether the dataset is balanced or imbalanced. Under BRI, such ROC-based fairness metrics constitute a “data evaluation score” that is indicative of a balance of the training dataset with respect to one or more input features.)
Regarding claim 3, Jesus discloses:
The computer-implemented method of claim 2 further comprising: receiving a data evaluation profile for the training dataset, wherein the data evaluation profile is indicative of: (i) one or more evaluation features from the plurality of input features, (ii) one or more feature values respectively defined for each of the one or more evaluation features, and (iii) one or more input data object exceptions for each of the one or more evaluation features; and  (Jesus, paragraph [0031] “three different types of bias related to a given protected attribute can be defined as: (i) group size disparities, (ii) prevalence disparities, and (iii) distinct conditional class separability.” [0080] “The Fraud dataset contains anonymized tabular data from a real-world stream of client-bank interactions…The protected attribute is the client age. Although it is a discrete value, in some experiments, the client age is binarized to compute fairness metrics, by considering a threshold at age 50. The train set has 80% of the records belonging to the younger group, with a fraud rate of 1.5%, while the remaining 20% are in the older group, with a fraud rate of 3%.” – describes fairness evaluation being performed with respect to protected attributes such as client age. These protected attributes correspond to evaluation features of the training dataset. Jesus further discloses that such protected attributes are divided into defined demographic groups, such as “younger group” or “older group,” which represents distinct values of the evaluation feature. Under BRI, an attribute used for fairness evaluation (such as client age) constitutes an “evaluation feature,” and the demographic groupings associated with that attribute constitute “feature values” defined for the evaluation feature. Accordingly, Jesus teaches a data evaluation profile that is indicative of (i) one or more evaluation features from a plurality of input features and (ii) one or more feature values defined for each such evaluation feature.)
(iii) one or more input data object exceptions for each of the one or more evaluation features (Jesus, paragraph [0109] “the process builds a dataset by randomly sampling the GAN and transforming the synthetic data to ensure several domain constraints, such as value ranges, that are otherwise not captured by the model. In an embodiment, the process applies one or more filters to discard instances that are invalid. For example, synthetic instances with negative values on count-based features are invalid, because they may only take positive integer values. Another type of invalid instance may be a repeated instance, causing repeated instances within the generated dataset or the original dataset to be removed. Filtering may also be performed to enforce privacy constraints so that records cannot be traced back to the original dataset.” – discloses that, during dataset processing and evaluation, certain data instances are identified as invalid, such as instances containing negative feature values or repeated records. These instances are then discarded or removed from the dataset. Under BRI, data objects that are specifically identified and filtered out from the dataset constitute “input data object exceptions,” since they are treated differently from the remaining dataset objects.); and
generating the data evaluation score based on the data evaluation profile (Jesus, paragraph [0031] “three different types of bias related to a given protected attribute can be defined as: (i) group size disparities, (ii) prevalence disparities, and (iii) distinct conditional class separability.”  [0087] “an 80% fairness threshold was used, meaning an apparatus, system, or process is considered to be fair if it scores higher than 80% in the fairness metric.” [0091] “The TFCO algorithm achieved the best results for the Type 2 bias dataset. It outperformed the algorithms for all thresholds, while achieving high fairness scores…The Grid Search method achieved the best score in the Type 3 dataset, with the Logistic Regression variation outperforming all other fair algorithms in the task. At a 10% threshold, two models of this kind were the only ones to achieve fairness and score >60% TPR (FIG. 3).” – describes fairness evaluation being performed with respect to protected attributes in the dataset. Jesus further explains that fairness metrics are computed based on the distribution records across demographic groups, and that these metrics produce fairness scores that are evaluated against thresholds (e.g., an 80% fairness threshold or >60% TPR fairness score). Different algorithms and datasets are compared using these fairness score. Under BRI, the dataset information identifying the evaluation features, their feature values, and the distribution of data objects across those values constitutes a data evaluation profile, and the fairness metric output constitutes a data evaluation score generated based on that data evaluation profile.)
Regarding claim 4, Jesus discloses:
determining a target ratio for an evaluation feature of the one or more evaluation features (Jesus, paragraph [0034] “Distinct conditional class separability extends the previous definition by including the joint distribution of input features X and Y label, P[X,Y]≠P[X,Y|A]. This is achieved by moving the distributions of classes enough so that a linear decision boundary obtains the predefined cumulative value for a negative class (FPR) and for a positive class (TPR).” [0045] “ Bias introducer 156 is configured to inject a predetermined bias, which may be specified according to a configuration, to the anonymized reconstructed dataset 154 to form an evaluation dataset 158. An example of an evaluation dataset is dataset 118. The predetermined bias configured enables a user to specify a desired type of bias to inject into the dataset 154.” – describes adjusting the distribution of dataset records across different values of a protected attribute in order to obtain predefined fairness-relevant rates, such as specific true-positive-rate or false-positive rate levels for particular demographic groups. Jesus further explains that the system may inject a predetermined bias into the dataset, where the user specifies the desired bias configuration to be applied to the protected attributes groups. Under BRI, specifying a desired or predetermined distribution or rate for an evaluation feature (such as the relative proportion or rate of outcomes across demographic groups) corresponds to “determining a target ratio for an evaluation feature of the one or more evaluation features.);
generating a synthetic dataset for the evaluation feature based on the target ratio and the data evaluation profile, wherein the synthetic dataset comprises a plurality of synthetic data objects each comprising at least one feature value from one or more defined feature values of the evaluation feature (Jesus, paragraph [0041] “Techniques for generating anonymized and biased datasets are disclosed.” [0042] “ the generated dataset includes training data (used to train a machine learning model) and/or test data (used to test the performance of a machine learning model).” [0045] “Bias introducer 156 is configured to inject a predetermined bias, which may be specified according to a configuration, to the anonymized reconstructed dataset 154 to form an evaluation dataset 158.” [0048] “In an embodiment, (anonymized) biased dataset generator 100 is configured to produce a synthetic dataset with domain constraints 116. The biased dataset generator 100 includes a feature pre-processor and anonymizer 102, a generative model 104 (such as a GAN or CTGAN), and optionally one or more samplers 106 and 108.” – discloses generating a synthetic dataset using a generative model (e.g., a GAN or CTGAN) in which a predetermined bias is injected for a protected attribute, resulting in a dataset whose records reflect a specified distribution across the attribute’s values. Under BRI, generating a biased synthetic dataset for an evaluation attribute according to a predetermined distribution corresponds to “generating a synthetic dataset for the evaluation feature based on the target ratio and the data evaluation profile.” The resulting synthetic dataset necessarily comprises multiple synthetic data objects, each including one of the defined feature values of that evaluation feature. Accordingly, Jesus discloses generating a synthetic dataset including a plurality of synthetic data objects each comprising at least one feature value of the evaluation feature.); and
generating the data evaluation score based on the synthetic dataset (Jesus, paragraph [0087] “ 80% fairness threshold was used, meaning an apparatus, system, or process is considered to be fair if it scores higher than 80% in the fairness metric. In various embodiments, if no model is found to be fair for some algorithm, this result is output along with a model found to mostly closely the fairness criterion/criteria.” [0091] “The TFCO algorithm achieved the best results for the Type 2 bias dataset. It outperformed the algorithms for all thresholds, while achieving high fairness scores…The Grid Search method achieved the best score in the Type 3 dataset, with the Logistic Regression variation outperforming all other fair algorithms in the task. At a 10% threshold, two models of this kind were the only ones to achieve fairness and score >60% TPR (FIG. 3).” – discloses that machine-learning models are evaluated on biased synthetic generated for protected-attribute features, and that fairness metrics are computed from model performance on those datasets. Jesus explains that these fairness metrics produce numerical fairness scores which are compared to thresholds (e.g., determining whether the score exceeds an 80% fairness threshold or achieves >60% TPR). Under BRI, a fairness score computed from model performance on the biased synthetic dataset constitutes a “data evaluation score generated based on the synthetic dataset.” Accordingly, Jesus discloses generating the data evaluation score based on the synthetic dataset.).
Regarding claim 5, Jesus discloses:
The computer-implemented method of claim 4, wherein: (i) the one or more defined feature values comprise a first feature value and a second feature value, (ii) the target ratio is indicative of a first expected frequency for the first feature value and a second expected frequency for the second feature value, (iii) the plurality of synthetic data objects comprises (a) one or more first synthetic data objects, each comprising the first feature value and (b) one or more second synthetic data objects, each comprising the second feature value, (iv) the one or more first synthetic data objects are based on the first expected frequency, and (v) the one or more second synthetic data objects are based on the second expected frequency (Jesus, paragraph [0032] “ Group size disparity is given by                                 
                                    ∃
                                    a
                                    ϵ
                                    A
                                    :
                                    P
                                    
                                        
                                            A
                                            =
                                            a
                                        
                                    
                                    ≠
                                    
                                        
                                            1
                                        
                                        
                                            N
                                        
                                    
                                
                             , where a represents a single group from a given protected attribute A, and N the number of possible groups. This results in different frequencies for possible values of the protected attribute.” [0048] “(anonymized) biased dataset generator 100 is configured to produce a synthetic dataset with domain constraints 116” [0049] “feature pre-processor and anonymizer 102 is configured to create features, such as aggregations, that better describe the records to a machine learning algorithm.” – teaches that fairness evaluation is performed with respect to a protected attribute that has multiple possible values (for example, “younger” and “older” age groups). Jesus further explains that the group-size disparity results in different frequencies for those possible values. Under BRI, the protected attribute corresponds to an evaluation feature, and its possible values correspond to the first and second feature values, each associated with an expected frequency. Jesus also discloses that the biased dataset generator produces a synthetic dataset whose records include the dataset features. Because the dataset is generated to reflect the differing frequencies of the protected-attribute values, the resulting synthetic dataset necessarily includes synthetic data objects corresponding to the first feature value and additional synthetic data objects corresponding to the second feature value, where the numbers of such objects reflect their respective expected frequencies.).
Regarding claim 6, Jesus discloses:
The computer-implemented method of claim 4, wherein the plurality of input features comprises the one or more evaluation features and one or more non-evaluation features, wherein the computer-implemented method further comprises: generating an input feature profile for a non-evaluation feature of the training dataset based on the training dataset and the synthetic dataset, wherein the input feature profile is indicative of a feature confidence score between the non-evaluation feature and the evaluation feature; and generating the data evaluation score based on the feature confidence score (Jesus, paragraph [0033] “Prevalence disparity occurs when P[Y]≠P[Y|A], i.e., the class probability is dependent on the protected group.” [0034] “ Distinct conditional class separability extends the previous definition by including the joint distribution of input features X and Y label, P[X,Y]≠P[X,Y|A]. This is achieved by moving the distributions of classes enough so that a linear decision boundary obtains the predefined cumulative value for a negative class (FPR) and for a positive class (TPR).” [0044] “ Dataset generator 152 is configured to receive an input dataset 140. The input dataset, sometimes called a seed dataset, is processed by the dataset generator according to the disclosed techniques to generate an evaluation dataset.” [0048] “(anonymized) biased dataset generator 100 is configured to produce a synthetic dataset with domain constraints 116. The biased dataset generator 100 includes a feature pre-processor and anonymizer 102, a generative model 104” – describes evaluating model fairness with respect to a protected attribute                                 
                                    A
                                
                            , while the dataset also includes other predictive features                                 
                                    X
                                
                            . Under BRI, the protected attribute corresponds to the claimed evaluation feature, and the remaining model inputs correspond to non-evaluation features, meaning the plurality of input features includes both types. Jesus further explains biased synthetic datasets are generated from the original training dataset and used to measure how output and feature relationships change when conditioned on the protected attribute (e.g., comparing P[Y] to P[Y|A] or P[X,Y] to P[X,Y|A]). These conditional-probability measures numerically indicate the dependence between the evaluation feature and the non-evaluation features and therefore corresponds to the claimed feature confidence and input feature profile generated based on the training and synthetic datasets. Because Jesus computes fairness scores from these dependence measures, the resulting fairness score corresponds to the claimed data evaluation score generated based on the feature confidence score, as recited.).
Regarding claim 7, Jesus discloses:
The computer-implemented method of claim 6 further comprising: generating a feature correlation score between the evaluation feature and the non-evaluation feature (Jesus, paragraph [0057] “ a second step of evaluation, which is based on the statistical comparison of the generated data and the original data, is divided in two different parts, which are the evaluation of interaction between features, and the evaluation of individual distribution of features. Because of this, the correlation between pairs of features is measured to produce a correlation matrix.” [0058] “the maximum absolute difference in correlations matrices of the original and generated datasets is calculated. In the latter, distributions are compared individually through a similarity metric, such as the Jansen-Shannon divergence or Wasserstein metric, or alternatively Kolmogorov-Smirnov test/distance.” – describes computing correlation values between pairs of dataset features when evaluating the generated synthetic dataset. Under BRI, computing correlations between protected attribute (used as fairness-evaluation feature) and other predictive input attributes (non-evaluation features) corresponds to “generating a feature correlation score between the evaluation feature and the non-evaluation feature,” as recited.);
determining a scaled feature correlation score based on the feature correlation score and the feature confidence score (Jesus, paragraph [0089] “In an embodiment, a protected attribute column is removed from the dataset before training the fairness-blind algorithms…since removing the protected attribute before training typically does not account for other correlated features. The algorithms will still have access to said features, leaving their predictions subject to the remaining latent bias…In the latter, the protected attribute is synthetic, and correlated only with the class label, not the features (in expected value). Thus, removing it allows algorithms to keep the performance high and become fair.” [0090] “ correlations between the protected attribute and the class label are removed by undersampling the majority group's negative observations. Doing so also addresses the problem of correlations with the features, as some of this information is eliminated when dropping observations.”  – explains that the dataset includes a protected attribute (evaluation feature) that is correlated with other predictive input attributes (non-evaluation features). These correlations indicate the degree of dependency between the features, which corresponds to a feature confidence score reflecting the strength of association. Jesus further explains that the fairness-evaluation scores are recomputed after correlation effects are removed (for example, by undersampling correlated feature values). These updated scores therefore represent scaled correlation-based scores that depend both on the correlation value and the confidence (dependency strength) between the evaluation and non-evaluation features. Under BRI, this corresponds to determining a scaled feature correlation score based on the feature correlation score and the feature confidence score, as recited.); and
in response to the scaled feature correlation score achieving a threshold score, augmenting the data evaluation profile with the non-evaluation feature (Jesus, paragraph [0087] “In an evaluation, an 80% fairness threshold was used, meaning an apparatus, system, or process is considered to be fair if it scores higher than 80% in the fairness metric. In various embodiments, if no model is found to be fair for some algorithm, this result is output along with a model found to mostly closely the fairness criterion/criteria.” [0089] “In an embodiment, a protected attribute column is removed from the dataset before training the fairness-blind algorithms.  In the Base Fraud, Type 2 and Type 3 bias datasets, unawareness leads to an increase in fairness, even where desired thresholds are unmet. When unaware algorithms satisfy the fairness threshold, the TPR measurement is relatively low (<20%), which may constitute a steep fairness-performance trade-off. The small fairness increase is not surprising, since removing the protected attribute before training typically does not account for other correlated features. The algorithms will still have access to said features, leaving their predictions subject to the remaining latent bias.” [0090] “Equalizing prevalences in the training set leads to good results…correlations between the protected attribute and the class label are removed by undersampling the majority group's negative observations. Doing so also addresses the problem of correlations with the features, as some of this information is eliminated when dropping observations.” [0095] “However, for more complex bias patterns...removing the protected attribute does not necessarily mask the protected attribute, so unawareness is ineffective. In these cases, in-processing methods may achieve much better results. Increasing the threshold of predicted positives leads to general increases in both performance and fairness.” – discloses that fairness evaluation is performed with respect to protected attributes and subject to defined fairness thresholds (e.g., an 80% fairness threshold). Jesus explains that even when the protected attribute is removed, “correlated features” continue to encode “remaining latent bias,” preventing the model from satisfying the fairness threshold. Jesus further teaches that, when such correlation exists, additional fairness-aware processing must explicitly account for those correlated features in the fairness-evaluation and mitigation process. Under BRI, detecting that a non-evaluation feature remains correlated with the protected attribute and prevents achievement of the fairness threshold corresponds to the “scaled feature correlation score achieving a threshold score.” Expanding the fairness-evaluation scope to explicitly account for those correlated features corresponds to “augmenting the data evaluation profile with the non-evaluation feature.” Thus, Jesus teaches modifying the set of fairness-evaluated features in response to the correlation strength relative to a fairness threshold.).
Regarding claim 8, Jesus discloses:
The computer-implemented method of claim 7 further comprising: generating an input feature risk score for the training dataset based on an aggregation of a plurality of scaled feature correlation scores for the one or more non-evaluation features, wherein the input feature risk score is indicative of a probability that the one or more non-evaluation features are impacted by the feature confidence score; and generating the data evaluation score based on the input feature risk score (Jesus, paragraph [0089] “In an embodiment, a protected attribute column is removed from the dataset before training the fairness-blind algorithms…since removing the protected attribute before training typically does not account for other correlated features. The algorithms will still have access to said features, leaving their predictions subject to the remaining latent bias.” [0038] “Conversely, in punitive settings, a positive prediction is related to a negative outcome for the individual (e.g., losing access to their bank account for being flagged as fraudulent). In these cases, fairness is achieved by maximizing Equation 1 for the negative class y=0 (ratio of false positive rates).” [0087] “an 80% fairness threshold was used…is considered to be fair if it scores higher than 80% in the fairness metric.” [0090] “correlations between the protected attribute and the class label are removed by undersampling the majority group's negative observations. Doing so also addresses the problem of correlations with the features, as some of this information is eliminated when dropping observations.” – explains that the fairness evaluation in punitive settings is based on probability-based metrics such as the false-positive-rate, which represents the likelihood of a negative outcome (e.g., being incorrectly flagged or fraudulent). Jesus further explains that removing the protected attribute does not eliminate bias when other features remain correlated, because model prediction remains “subject to remaining latent bias.” Jesus also teaches computing fairness metrics after addressing such correlations, resulting in updated fairness-evaluation values that reflect the dependency between the protected attribute and the correlated features. Under BRI, each updated fairness-evaluation value based on correlation strength corresponds to a scaled feature-correlation score. Aggregating these multiple scaled-correlation values across features corresponds to generating an input-feature risk score for the dataset. Because these fairness-evaluation metrics are probability-based (e.g., false-positive-rate) and directly reflect the impact of the correlated features on model bias, the aggregated value is indicative of a probability that the non-evaluation features are impacted by the feature-confidence (correlation-strength) score. Jesus further teaches using this aggregated fairness value when determining the fairness of the dataset, which corresponds to generating the claimed data-evaluation score based on the input-feature risk score.).
Regarding claim 9, Jesus discloses:
The computer-implemented method of claim 8 further comprising: generating, using an interpretable machine learning model, a plurality of first feature impact measures for the one or more evaluation features, wherein a first feature impact measure is indicative of a relative impact of the evaluation feature to a predictive output of the target machine learning model (Jesus, paragraph [0033] “ Prevalence disparity occurs when P[Y]≠P[Y|A], i.e., the class probability is dependent on the protected group. “ [0034] “Distinct conditional class separability extends the previous definition by including the joint distribution of input features X and Y label, P[X,Y]≠P[X,Y|A]. This is achieved by moving the distributions of classes enough so that a linear decision boundary obtains the predefined cumulative value for a negative class (FPR) and for a positive class (TPR).” [0038] “In assistive settings, a positive prediction is related to a positive outcome for the individual (e.g., funding for their project). As such, fairness is achieved by maximizing Equation 1 for the positive class y=1 (ratio of true positive rates)…in punitive settings, a positive prediction is related to a negative outcome for the individual (e.g., losing access to their bank account for being flagged as fraudulent). In these cases, fairness is achieved by maximizing Equation 1 for the negative class y=0 (ratio of false positive rates).” [0086] “A set of commonly used fairness-blind ML algorithms: Logistic Regression, Decision Tree, Random Forest, LightGBM, XGBoost, and Neural Networks (MLP) was benchmarked. In addition, two state-of-the-art bias reduction algorithms were evaluated.” -discloses evaluating how a protected attribute (the evaluation feature) affects the predictive output of machine-learning models. Jesus explains that “distinct conditional class separability” occurs when the joint distribution of features and labels changes such that the classifier’s false-positive-rate or true-positive rate varies depending on the value of the protected attribute. Under BRI, the numerical change in TPR/FPR attributable to a feature constitutes a feature-impact measure indicating the relative effect of that feature on the model’s predictive output. Jesus further discloses that the fairness analysis is performed using machine-learning models including Logistic Regression and Decision Trees. These models are conventionally recognized as interpretable ML models that yield measurable feature-impact relationships (e.g., regression coefficients or split-based contributions) indicating how strongly a feature influences prediction outcomes. Accordingly, Jesus teaches generating, using an interpretable machine-learning model, feature-impact measures that indicate the relative impact of the evaluation feature on the predictive output of the target machine-learning model, as recited.);
generating, using one or more partial dependency plots, a plurality of second feature impact measures for the one or more evaluation features, wherein a second feature impact measure for the evaluation feature is indicative of a relationship type between the evaluation feature and one or more predicted output classes of the target machine learning model (Jesus, paragraph [0008] “FIG. 3 shows a graphical representation of distinct conditional class separability for a feature distribution for all instances, on the left, for a majority group, on the middle, and for a minority group, on the right according to an embodiment.” [0010] “FIG. 5 shows a graphical representation of various models' performance and fairness, all fraud datasets for the top 10% predicted positives according to an embodiment.” [0011] “FIG. 6A shows a graphical representation of a fraud type 3 dataset models performance for the top 5% predicted positives according to an embodiment.” [0012] “FIG. 6B shows a graphical representation of a Donors Choose dataset models performance for the top 5% predicted positives according to an embodiment.” [0034] “Distinct conditional class separability extends the previous definition by including the joint distribution of input features X and Y label” – determines how the model’s predictions change as a function of an evaluation feature. Jesus explains that fairness is evaluated based on conditional class behavior, and this dependency is graphically illustrated in FIG. 3, which plots feature values against class-label outcomes for different protected-attribute groups. Jesus further plots model output behavior in FIGS. 5-6B, where fairness and predictive-performance metrics are shown for different model outputs and datasets. Under BRI, these graphical plots show the dependency between an evaluation feature and the model’s predicted output classes, which corresponds to partial dependency plots. The plotted values therefore constitute feature-impact measures because the quantify how strongly the evaluation feature influences the predicted class outcomes.)
determining a data impact score for the training dataset based on the plurality of first feature impact measures and the plurality of second feature impact measures, wherein the data impact score is indicative of a probability that one or more predictive outputs by the target machine learning model are impacted by the feature confidence score (Jesus, paragraph [0098] “Using the previously mentioned datasets, over 5,000 models were evaluated in datasets reflecting distinct real-world case scenarios. Considering a standardized set of fairness metrics, different hyperparameter searches were performed for eight different ML algorithms…Initial results show that 1) baselines tend to exhibit better predictive performance but poor fairness…” [0099] TABLE 2 Fraud base test results Threshold 5% 10% 20%…LGBM + EP…Global TPR 42.09%....Pred. Equality 97.47%...57.67%...98.13%...74.71%...88.6%....FIG. 5 shows a graphical representation of various models' performance and fairness, all fraud datasets for the top 10% predicted positives.” – discloses computing standardized fairness metrics such as Predictive Equality and related probability-based scores that quantify fairness effects arising from biased feature relationships (previously mapped as feature-confidence and impact effects). These metrics are derived from prior bias and feature-impact analysis and are reported as probability values (e.g., fairness ratios and true-positive rates), which indicate whether the predictive output remain affected by biased correlations. Under BRI, the aggregated-fairness probability results constitute a data-level impact value indicating the likelihood that the model’s predictive outputs are impacted by the previously-identified feature confidence score.); and
generating the data evaluation score based on the data impact score (Jesus, paragraph [0087] “n an evaluation, an 80% fairness threshold was used, meaning an apparatus, system, or process is considered to be fair if it scores higher than 80% in the fairness metric.” [0098] “Initial results show that 1) baselines tend to exhibit better predictive performance but poor fairness…” [0104] “One of the advantages of the disclosed techniques is evaluating ML fairness under different biased patterns in the data, and understanding which types of data bias a given Fair ML (or fairness blind) algorithm is capable of tackling.” – discloses computing probability-based fairness metrics (previously mapped as the data impact score), and then using those fairness-probability values to determine whether the model satisfies a fairness requirement under a defined fairness threshold (e.g., 80%). Jesus further explains that models may have good predictive performance but still exhibit “poor fairness,” and that fairness evaluation is performed across different biased data patterns. Under BRI, using the computed fairness-probability metric (data impact score) to determine whether the model satisfies the fairness condition corresponds to “generating the data evaluation score based on the data impact score,” as recited.).
Regarding claim 10, Jesus discloses:
The computer-implemented method of claim 1, wherein: (i) the training dataset comprises a plurality of input data objects and a plurality of input features, (ii) the plurality of input features comprises one or more evaluation features, (iii) the plurality of input data objects comprises one or more evaluation data object sets, (iv) each evaluation data object set comprises one or more input data objects that each comprise a particular feature value of an evaluation feature (Jesus, paragraph [0080] “] The Fraud dataset contains anonymized tabular data…The dataset contains 1,000,000 rows, split into 750,000 rows for training, and 250,000 rows for testing…The protected attribute is the client age…The train set has 80% of the records belonging to the younger group, with a fraud rate of 1.5%, while the remaining 20% are in the older group, with a fraud rate of 3%.” [0037] “Each of the benchmark's datasets is associated with a specific real-world scenario, and carries specific performance and fairness metrics drawn thereafter. Fairness metrics can be computed as the widest disparity between the model's performance per group on the relevant class” [0083] “For the first variant (Type 1), an additional synthetic column is appended to the data: a protected attribute with a majority (e.g., represented by 90% of the instances) and a minority group (e.g., group size disparity).” -Under BRI, Jesus teaches a training dataset formed of rows and records, which correspond to a plurality of input data objects, and feature columns, which correspond to a plurality of input features. Jesus further identifies a protected attribute such as client age that is expressly used for fairness evaluation, which corresponds to the evaluation feature. Jesus discloses grouping the dataset records according to the protected attribute value into younger and older groups, which corresponds to evaluation data objects sets consisting of multiple input data objects. Each group consists of records sharing the same age value for the protected attribute which corresponds to each evaluation data object set including data objects having the same value of the evaluation feature.), and
(v) the model evaluation score is based on a comparison between at least two of the one or more evaluation data object sets (Jesus, paragraph [0081] “fairness is achieved if the model's false-positive rate is independent of the customer's age group.” [0037] “Each of the benchmark's datasets is associated with a specific real-world scenario, and carries specific performance and fairness metrics drawn thereafter. Fairness metrics can be computed as the widest disparity between the model's performance per group on the relevant class:” [0099] Tables 2-11: report Predictive Equality and Equal Opportunity values comparing groups. – Jesus evaluates fairness by comparing model outcome probabilities between at least two protected-attribute groups, such as younger and older users. The fairness metrics are computed from differences or ratios in model performance across these groups. Under BRI, the fairness score is therefore based on a comparison between multiple evaluation data object sets formed according to the protected attribute, which corresponds to the limitation.).
Regarding claim 11, Jesus discloses:
The computer-implemented method of claim 10, wherein the one or more performance metrics comprise a first performance metric, a second performance metric, and a third performance metric, wherein generating the model evaluation score comprises: determining the first performance metric based on a selection rate comparison between the at least two evaluation data object sets (Jesus, paragraph [0037] “Each of the benchmark's datasets is associated with a specific real-world scenario, and carries specific performance and fairness metrics drawn thereafter. Fairness metrics can be computed as the widest disparity between the model's performance per group on the relevant class:” [0080] “The train set has 80% of the records belonging to the younger group, with a fraud rate of 1.5%, while the remaining 20% are in the older group, with a fraud rate of 3%.” [0099] Tables 2-11 report group-dependent outcome probability metrics at different selection thresholds (e.g., 5%, 10%, 20%). – Jesus compares model outcome statistics between groups defined by the protected attribute (younger vs. older). These comparisons include differences in the rate at which member of each group are selected into the predicted-positive class at a given threshold. Under BRI, comparing the probability of positive selection between the protected-attribute groups constitutes determining a first performance metric based on a selection-rate comparison between two evaluation data object sets, as recited.);
 determining the second performance metric based on a false positive rate comparison between the at least two evaluation data object sets (Jesus, paragraph [0081] “ As a punitive task, fairness is achieved if the model's false-positive rate is independent of the customer's age group. This is also known as predictive equality across age groups.” [0037] “ Fairness metrics can be computed as the widest disparity between the model's performance per group on the relevant class:” [0099] Table 2-11 report probability metrics measuring group-dependent predictive outcomes. – Jesus defines fairness in terms of whether the false-positive rate differs between age-based user groups. Thus, the system computers and compares false-positive probabilities for multiple evaluation groups formed by the evaluation feature. Under BRI, this corresponds to determining a second performance metric based on a false-positive rate comparison between the evaluation data object sets, as recited.);
determining the third performance metric based on a false negative rate comparison between the at least two evaluation data object sets (Jesus, paragraph [0037] “Each of the benchmark's datasets is associated with a specific real-world scenario, and carries specific performance and fairness metrics drawn thereafter. Fairness metrics can be computed as the widest disparity between the model's performance per group on the relevant class:” [0081] “fairness is achieved if the model's false-positive rate is independent of the customer's age group.” [0099] Tables 2-11 report class-probability performance including Global TPR in multiple thresholds. – Jesus computes model performance separately for multiple evaluation groups formed according to the protected attribute and compares those performance probabilities across groups to measure fairness disparity. The reported group-performance measures include true-positive probability (TPR), which is the complement of false-negative probability. Therefore, a comparison of TPR values across protected-attribute groups inherently corresponds to a comparison of false-negative rates across the same evaluation data object sets. Under BRI, this meets the limitation of determining a third performance metric based on a false-negative-rate comparison between at least two evaluation data object sets.); and
generating the model evaluation score based on an aggregation of the first performance metric, the second performance metric, and the third performance metric (Jesus, paragraph [0037] “Fairness metrics can be computed as the widest disparity between the model's performance per group on the relevant class:” [0087] “In an evaluation, an 80% fairness threshold was used, meaning an apparatus, system, or process is considered to be fair if it scores higher than 80% in the fairness metric. In various embodiments, if no model is found to be fair for some algorithm, this result is output along with a model found to mostly closely the fairness criterion/criteria.” [0098] “ Considering a standardized set of fairness metrics, different hyperparameter searches were performed for eight different ML algorithms, including both commonly used algorithms such as logistic regression, LightGBM, and neural networks, and also models in fair ML” [0099] Table 2-11 report fairness and class-performance probabilities together for each model and dataset. – Jesus computes multiple group-based performance metrics, including selection behavior and class-error probabilities, and then combines or evaluates those metrics together to produce an overall fairness determination expressed as a fairness metric subject to a fairness threshold. Under BRI, the fairness metric reported in Jesu constitutes a model evaluation score derived from an aggregation of the previously-computed performance metrics (including selection-rate, false-positive-rate, and false-negative-rate comparisons, as recited.).
Regarding claim 14, Jesus discloses:
A computing apparatus comprising memory and one or more processors communicatively coupled to the memory, the one or more processors configured to: generate a holistic evaluation vector for a target machine learning model based on a plurality of evaluation scores for the target machine learning model, wherein the plurality of evaluation scores comprises (Jesus, paragraph [0014] “The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor.” [0020] “Evaluating both fairness and predictive accuracy is typically practiced when introducing novel algorithms, methods, or metrics for bias mitigation.” – describes a computer-implemented machine-learning systems executed using computer processors. Jesus further explains that each machine-learning model is evaluated using multiple numerical metrics, including fairness and predictive-accuracy measures. Under BRI, a “vector” reasonably encompasses a collection of numerical evaluation values. Therefore, the multiple evaluation metrics generated for a model correspond to a holistic evaluation vector comprising a plurality of evaluation scores for the target machine-learning model.):
(i) a data evaluation score corresponding to a training dataset for the target machine learning model (Jesus, paragraph [0017] “Evaluation remains an obstacle to progress in fair ML because of a lack of consistent, well-established, and systematic evaluation of fairness; and scarcity of realistic, large tabular datasets for algorithmic decision-making; among other things.” [0029] “It may be relevant to evaluate models and bias mitigation techniques beyond the bias that naturally occur in datasets (e.g., to artificially inject predefined types of bias into the dataset). This provides fine-grained control over experiments and increases the overall robustness of a benchmark.” [0030]-[0031] “There are several definitions of bias in data…three different types of bias related to a given protected attribute can be defined as: (i) group size disparities, (ii) prevalence disparities, and (iii) distinct conditional class separability.” – describes computing bias metrics on the dataset itself, including group size disparity, prevalence disparity, and conditional class separability. These metrics evaluate characteristics of the training dataset related to fairness. Under BRI, a “data evaluation score” reasonably includes any quantitative fairness or bias metric computed from a dataset.),
(ii) a model evaluation score corresponding to one or more performance metrics for the target machine learning model (Jesus, paragraph [0020] “ Evaluating both fairness and predictive accuracy is typically practiced when introducing novel algorithms, methods, or metrics for bias mitigation.” [0022] “reports of model performance generally refer to a single operating point, i.e., a single threshold…” [0087] “In an evaluation, an 80% fairness threshold was used, meaning an apparatus, system, or process is considered to be fair if it scores higher than 80% in the fairness metric. “ – describes evaluating machine learning models using numerical metrics including predictive accuracy measure and fairness scores. Jesus further explains that model performance is reported at defined operating thresholds and that fairness performance may be expressed as a numerical score such as 80% fairness threshold. These fairness and predictive accuracy metrics constitute numerical performance measures describing how the model performs. Under the broadest reasonable interpretation, such numerical model-performance metrics correspond to a “model evaluation score corresponding to one or more performance metrics for the target machine learning model,” as recited.), and
(iii) a decision evaluation score corresponding to an output class of the target machine learning model (Jesus, paragraph [0038] “ In assistive settings, a positive prediction is related to a positive outcome for the individual (e.g., funding for their project). As such, fairness is achieved by maximizing Equation 1 for the positive class y=1 (ratio of true positive rates). This fairness metric is also known as equal opportunity. Conversely, in punitive settings, a positive prediction is related to a negative outcome for the individual (e.g., losing access to their bank account for being flagged as fraudulent). In these cases, fairness is achieved by maximizing Equation 1 for the negative class y=0 (ratio of false positive rates). This fairness metric is also known as predictive equality, or equal opportunity with reference to y=0” – describes computing fairness metrics that are explicitly conditioned on the model’s predicted output class. For example, in assistive setting s fairness is evaluated based on the true-positive rate for the positive output class, and in punitive settings fairness is evaluated based on the false-positive rate for the negative output class. These metrics represent numerical scores that depend on which output class is being predicted. Under BRI, a numerical fairness metric that is computed for a particular predicted output class constitutes a “decision evaluation score corresponding to an output class of the target machine learning model.”) ;
generate a holistic evaluation score for the target machine learning model based on an aggregation of the holistic evaluation vector; and provide an evaluation output for the target machine learning model based on the holistic evaluation score (Jesus, paragraph [0087] “Experimental results show that the disclosed techniques perform well. In an evaluation, an 80% fairness threshold was used, meaning an apparatus, system, or process is considered to be fair if it scores higher than 80% in the fairness metric. In various embodiments, if no model is found to be fair for some algorithm, this result is output along with a model found to mostly closely the fairness criterion/criteria.” [0088] “Globally, it is noticeable that conventional classification algorithms show general good predictive accuracy but poor fairness.” [0089] “When unaware algorithms satisfy the fairness threshold, the TPR measurement is relatively low (<20%), which may constitute a steep fairness-performance trade-off. “ – describes evaluating machine learning models using multiple numerical metrics, including fairness scores and predictive-accuracy performance. Jesus explains that a model is determined to be fair if it exceeds a fairness threshold and that models may also exhibit differing accuracy performance characteristics. In some cases, these metrics are jointly considered, such as where satisfaction of a fairness threshold is balanced against true-positive-rate performance. Under BRI, a collection of multiple evaluation scores constitutes a “vector” and jointly considering these scores to determine whether a model satisfies fairness and performance criteria constitutes generating a “holistic evaluation score based on an aggregation of the holistic evaluation vector.” Jesus further explains that the result of this evaluation is output, including whether a model satisfies the fairness criteria or which model most closely meets the evaluation determination corresponding to “providing…an evaluation output for the target machine learning model based on the holistic evaluation score.”)
Regarding claim 15, Jesus teaches all the elements of claim 14, therefore is rejected for the same reasons as those presented for claim 14. The claim recites similar limitations corresponding to claim 2 and is rejected for similar reasons as claim 2 using similar teachings and rationale.
Regarding claim 16, Jesus teaches all the elements of claim 15, therefore is rejected for the same reasons as those presented for claim 15. The claim recites similar limitations corresponding to claim 3 and is rejected for similar reasons as claim 3 using similar teachings and rationale.
Regarding claim 17, Jesus teaches all the elements of claim 16, therefore is rejected for the same reasons as those presented for claim 16. The claim recites similar limitations corresponding to claim 4 and is rejected for similar reasons as claim 4 using similar teachings and rationale.
Regarding claim 18, Jesus teaches all the elements of claim 14, therefore is rejected for the same reasons as those presented for claim 14. The claim recites similar limitations corresponding to claim 12 and is rejected for similar reasons as claim 12 using similar teachings and rationale.
Regarding claim 19, Jesus discloses:
One or more non-transitory computer-readable storage media including instructions that, when executed by one or more processors, cause the one or more processors to (Jesus, paragraph [0014] “ including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor.” – discloses a processor-based computer system in which software instructions are stored on a computer-readable medium and executed by one or more processors. Under BRI, such stored program instructions necessarily reside on a physical storage device (e.g., memory or disk), and therefore correspond to the non-transitory computer-readable storage media.):
generate a holistic evaluation vector for a target machine learning model based on a plurality of evaluation scores for the target machine learning model, wherein the plurality of evaluation scores comprises [0020] “Evaluating both fairness and predictive accuracy is typically practiced when introducing novel algorithms, methods, or metrics for bias mitigation.” – describes a computer-implemented machine-learning systems executed using computer processors. Jesus further explains that each machine-learning model is evaluated using multiple numerical metrics, including fairness and predictive-accuracy measures. Under BRI, a “vector” reasonably encompasses a collection of numerical evaluation values. Therefore, the multiple evaluation metrics generated for a model correspond to a holistic evaluation vector comprising a plurality of evaluation scores for the target machine-learning model.):
(i) a data evaluation score corresponding to a training dataset for the target machine learning model (Jesus, paragraph [0017] “Evaluation remains an obstacle to progress in fair ML because of a lack of consistent, well-established, and systematic evaluation of fairness; and scarcity of realistic, large tabular datasets for algorithmic decision-making; among other things.” [0029] “It may be relevant to evaluate models and bias mitigation techniques beyond the bias that naturally occur in datasets (e.g., to artificially inject predefined types of bias into the dataset). This provides fine-grained control over experiments and increases the overall robustness of a benchmark.” [0030]-[0031] “There are several definitions of bias in data…three different types of bias related to a given protected attribute can be defined as: (i) group size disparities, (ii) prevalence disparities, and (iii) distinct conditional class separability.” – describes computing bias metrics on the dataset itself, including group size disparity, prevalence disparity, and conditional class separability. These metrics evaluate characteristics of the training dataset related to fairness. Under BRI, a “data evaluation score” reasonably includes any quantitative fairness or bias metric computed from a dataset.),
(ii) a model evaluation score corresponding to one or more performance metrics for the target machine learning model (Jesus, paragraph [0020] “ Evaluating both fairness and predictive accuracy is typically practiced when introducing novel algorithms, methods, or metrics for bias mitigation.” [0022] “reports of model performance generally refer to a single operating point, i.e., a single threshold…” [0087] “In an evaluation, an 80% fairness threshold was used, meaning an apparatus, system, or process is considered to be fair if it scores higher than 80% in the fairness metric. “ – describes evaluating machine learning models using numerical metrics including predictive accuracy measure and fairness scores. Jesus further explains that model performance is reported at defined operating thresholds and that fairness performance may be expressed as a numerical score such as 80% fairness threshold. These fairness and predictive accuracy metrics constitute numerical performance measures describing how the model performs. Under the broadest reasonable interpretation, such numerical model-performance metrics correspond to a “model evaluation score corresponding to one or more performance metrics for the target machine learning model,” as recited.), and
(iii) a decision evaluation score corresponding to an output class of the target machine learning model (Jesus, paragraph [0038] “ In assistive settings, a positive prediction is related to a positive outcome for the individual (e.g., funding for their project). As such, fairness is achieved by maximizing Equation 1 for the positive class y=1 (ratio of true positive rates). This fairness metric is also known as equal opportunity. Conversely, in punitive settings, a positive prediction is related to a negative outcome for the individual (e.g., losing access to their bank account for being flagged as fraudulent). In these cases, fairness is achieved by maximizing Equation 1 for the negative class y=0 (ratio of false positive rates). This fairness metric is also known as predictive equality, or equal opportunity with reference to y=0” – describes computing fairness metrics that are explicitly conditioned on the model’s predicted output class. For example, in assistive setting s fairness is evaluated based on the true-positive rate for the positive output class, and in punitive settings fairness is evaluated based on the false-positive rate for the negative output class. These metrics represent numerical scores that depend on which output class is being predicted. Under BRI, a numerical fairness metric that is computed for a particular predicted output class constitutes a “decision evaluation score corresponding to an output class of the target machine learning model.”) ;
generate a holistic evaluation score for the target machine learning model based on an aggregation of the holistic evaluation vector; and provide an evaluation output for the target machine learning model based on the holistic evaluation score (Jesus, paragraph [0087] “Experimental results show that the disclosed techniques perform well. In an evaluation, an 80% fairness threshold was used, meaning an apparatus, system, or process is considered to be fair if it scores higher than 80% in the fairness metric. In various embodiments, if no model is found to be fair for some algorithm, this result is output along with a model found to mostly closely the fairness criterion/criteria.” [0088] “Globally, it is noticeable that conventional classification algorithms show general good predictive accuracy but poor fairness.” [0089] “When unaware algorithms satisfy the fairness threshold, the TPR measurement is relatively low (<20%), which may constitute a steep fairness-performance trade-off. “ – describes evaluating machine learning models using multiple numerical metrics, including fairness scores and predictive-accuracy performance. Jesus explains that a model is determined to be fair if it exceeds a fairness threshold and that models may also exhibit differing accuracy performance characteristics. In some cases, these metrics are jointly considered, such as where satisfaction of a fairness threshold is balanced against true-positive-rate performance. Under BRI, a collection of multiple evaluation scores constitutes a “vector” and jointly considering these scores to determine whether a model satisfies fairness and performance criteria constitutes generating a “holistic evaluation score based on an aggregation of the holistic evaluation vector.” Jesus further explains that the result of this evaluation is output, including whether a model satisfies the fairness criteria or which model most closely meets the evaluation determination corresponding to “providing…an evaluation output for the target machine learning model based on the holistic evaluation score.”).
Regarding claim 20, Jesus teaches all the elements of claim 19, therefore is rejected for the same reasons as those presented for claim 19. The claim recites similar limitations corresponding to claim 10 and is rejected for similar reasons as claim 10 using similar teachings and rationale.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



Claims 12 and 13 are rejected under the 35 U.S.C. 103 as being unpatentable over Jesus et al., (Pub. No.: US 20230074606 A1 (Filed: June 2022)) in view of Mehrabi et al., (NPL: “A Survey on Bias and Fairness in Machine Learning” (Published: 2021)).

Regarding claim 12, the rejection of claim 1 is incorporated herein. Jesus further teaches the limitations:
(i) the target machine learning model is previously trained to generate a plurality of predictive outputs for a plurality of input data objects (Jesus, paragraph [0044] “Dataset generator 152 is configured to receive an input dataset 140. The input dataset, sometimes called a seed dataset, is processed by the dataset generator according to the disclosed techniques to generate an evaluation dataset.” [0081] “fairness is achieved if the model's false-positive rate is independent of the customer's age group.” [0099] Table 2-11 report model prediction results such as Global TPR and Predictive Equality for trained ML models including Logistic Regression, Random Forest, LightGBM, XGBoost, Neural Networks, and others. – discloses training multiple machine-learning models on historical datasets and using the trained models to generate predicted outcomes for each dataset record. The reported fairness and performance metrics are computed from the predicted classification outputs of the trained model across many input instances. Under BRI, this corresponds to a previously trained target machine-learning model generating a plurality of predictive outputs for a plurality of input data objects, as recited.),
(ii) each of the plurality of predictive outputs correspond to a positive output class or a negative output class (Jesus, paragraph [0038] “In assistive settings, a positive prediction is related to a positive outcome for the individual (e.g., funding for their project). As such, fairness is achieved by maximizing Equation 1 for the positive class y=1 (ratio of true positive rates). This fairness metric is also known as equal opportunity. Conversely, in punitive settings, a positive prediction is related to a negative outcome for the individual (e.g., losing access to their bank account for being flagged as fraudulent). In these cases, fairness is achieved by maximizing Equation 1 for the negative class y=0 (ratio of false positive rates). This fairness metric is also known as predictive equality, or equal opportunity with reference to y=0.” – Jesus discloses that the machine-learning classifier outputs predictions associated with either a positive class (y=1) or a negative class (y=0), and fairness metrics such as equal opportunity and predictive equality are computed based on those positive and negative class outcomes. Under BRI, this corresponds to each predictive output of the trained model belonging to either a positive output class or a negative output class, as recited.), and
However, Jesus does not teach but Jesus in view of Mehrabi teaches the following limitation:
(iii) the decision evaluation score is based on one or more counterfactual proposals for one or more of the plurality of predictive outputs that correspond to the negative output class (Jesus, [0038] “in punitive settings, a positive prediction is related to a negative outcome for the individual (e.g., losing access to their bank account for being flagged as fraudulent). In these cases, fairness is achieved by maximizing Equation 1 for the negative class y=0 (ratio of false positive rates). This fairness metric is also known as predictive equality, or equal opportunity with reference to y=0.” Mehrabi, [page 13, section 4.1] ““Predictor Ŷ is counterfactually fair if under any context… The counterfactual fairness definition is based on the “intuition that a decision is fair towards an individual if it is the same in both the actual world and a counterfactual world where the individual belonged to a different demographic group.” [pages 19-20] “authors introduce the path-specific counterfactual fairness definition… authors only target discrimination discovery and no removal by finding instances similar to another instance and observing if a change in the protected attribute will change the outcome of the decision. If so, then they declare the existence of discrimination.” – Jesus discloses the trained classifier produces negative-class predictions and evaluates fairness for those adverse decisions. Mehrabi discloses counterfactual fairness analysis in which alternative counterfactual instances are generated by modifying a protected attribute to determine whether the predicted outcome would change. These counterfactual instances constitute counterfactual proposals applied to predicted outcomes, including negative class decisions. Under BRI, evaluating whether a negative-class decision changes in response to a counterfactual proposal produces a fairness determination that corresponds to a decision evaluation score based on counterfactual proposals for predictive outputs that correspond to the negative output class, as recited.).
Accordingly, it would have been obvious to a person of ordinary skill  in the art, before the effective filing date of the claimed invention, having a combination of Jesus and Mehrabi before them, to incorporate the use of counterfactual proposals, as taught by Mehrabi, into the fairness-evaluation framework of Jesus. One would have been motivated to make such a combination in order to more rigorously evaluate whether negative-class machine learning decisions are influenced by biased feature relationships, by determining whether the predicted outcome would change under an alternative counterfactual version of the same input. This would allow a model-evaluation system not only to measure fairness using performance-based statistics but also to detect whether individual negative-class outputs remain stable under counterfactual changes, thereby improving the reliability and completeness of fairness auditing for machine-learning models. 
Regarding claim 13, the rejection of claim 12 is incorporated herein. Jesus does not teach but Jesus in view of Mehrabi teaches the following limitations:
wherein the plurality of input data objects is associated with one or more evaluation features, and wherein the computer-implemented method further comprises: identifying, from the one or more counterfactual proposals, an evaluation counterfactual proposal that comprises an evaluation feature of the one or more evaluation features (Jesus, paragraph [0065] “…A is a given categorical feature that can be used to control the value, a is a possible value for the feature, and P.sub.a the probability of value a on feature A. Thus, in the value function given by Equation 2, there is a term to control prevalence of one or more groups.” [0038] “in punitive settings, a positive prediction is related to a negative outcome for the individual (e.g., losing access to their bank account for being flagged as fraudulent). In these cases, fairness is achieved by maximizing Equation 1 for the negative class y=0 (ratio of false positive rates).” Mehrabi, [page 13, section 4.1] “Predictor Ŷ is counterfactually fair if under any context… The counterfactual fairness definition is based on the “intuition that a decision is fair towards an individual if it is the same in both the actual world and a counterfactual world where the individual belonged to a different demographic group.” [pages 19-20] “authors introduce the path-specific counterfactual fairness definition… authors only target discrimination discovery and no removal by finding instances similar to another instance and observing if a change in the protected attribute will change the outcome of the decision. If so, then they declare the existence of discrimination.” – Jesus discloses that the input data objects in the dataset include a categorical protected attribute that serves as an evaluation feature and whose value may be controlled and varied to analyze fairness behavior. Mehrabi discloses generating counterfactual proposals by modifying the same protected or evaluation attribute to create alternative instances differing only in the evaluation feature value. Under BRI, selecting the counterfactual instance in which the evaluation feature is varied constitutes identifying, from among multiple counterfactual proposals, an evaluation counterfactual proposal that comprises an evaluation feature of the one or more evaluation features, as recited.);
in response to identifying the evaluation counterfactual proposal, generating, using a machine learning recourse model, a recourse action for the evaluation counterfactual proposal; and generating the decision evaluation score based on the recourse action (Mehrabi, [section 5.2.6] “Using causal graphs, one can represent these causal relationships between variables (nodes of the graph) through the edges of the graph. These models can be used to remove unwanted causal dependence of outcomes on sensitive attributes such as gender or race in designing systems or policies… There has been much research on discrimination discovery and removal that uses causal models and graphs to make decisions that are irrespective of sensitive attributes of groups or individuals.” – Jesus discloses a trained classifier that produces predictive outputs whose fairness is evaluated across protect-attribute groups. Mehrabi discloses counterfactual fairness analysis in which alternative counterfactual instances are generated by modifying a protected attribute to determine whether the predicted outcome for an individual would change. Mehrabi further discloses casual fairness methods in which, once discriminatory dependence is detected, the system generates model-based modifications that remove the casual dependence of the outcome on the protected attribute so that the resulting decision becomes insensitive to that attribute. As stated in claim 12, the fairness evaluation includes predictive outputs corresponding to the negative output class. Under BRI, the corrective fairness modifications disclosed by Mehrabi constitute recourse actions generated by a machine learning fairness model in response to counterfactual proposals applied to such predictive outputs. Because the fairness determination is based on whether a recourse action is required to eliminate discriminatory dependence for the counterfactual proposal, the resulting fairness determination corresponds to generating a decision evaluation score based on a recourse action for an evaluation counterfactual proposal, as recited.)

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Daravanh Phakousonh whose telephone number is (571)272-6324. The examiner can normally be reached Mon - Thurs 7 AM - 5 PM, Every other Friday 7 AM - 4PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached at 571-272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Daravanh Phakousonh/Examiner, Art Unit 2121                                                                                                                                                                                                        


/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action
TIERED EVALUATION METRIC FOR COMPREHENSIVELY EVALUATING MACHINE LEARNING MODELS

This examiner grants 50% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

TIERED EVALUATION METRIC FOR COMPREHENSIVELY EVALUATING MACHINE LEARNING MODELS

This examiner grants 50% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email