Machine Learning Classification Models: Exploring the Fairness-Accuracy Trade-Off

August 11, 2023
Authored by
Franklin Cardenoso Fernandez
Researcher at Holistic AI
Machine Learning Classification Models: Exploring the Fairness-Accuracy Trade-Off

In the age of artificial intelligence, classification models have become a crucial area of research as their role in decision-making processes has soared.

Although accuracy has been the main objective for most researchers over the years, another group is seeking to address the bias problem present in these models by developing different methods to mitigate the issue.

We now have the tools and the literature that enable us to ask crucial questions – how do we analyse the fairness-accuracy trade-off in machine learning (ML) models? And how do we balance accuracy while ensuring equitable treatment of all groups?

In this blog, we will explore the fairness-accuracy trade-off analysis for classification models, examining its significance with a pertinent case study.

The fairness-accuracy trade-off

Most efforts in this area have been focused on developing methods – typically categorised as pre-, in-, or post-processing algorithms – to mitigate the bias that can infiltrate ML models at different stages. However, this has raised a new issue: the cost of an increase or decrease in fairness relative to other metrics. Accuracy, for instance.

To address this gap, the University of Nebraska's Christian Haas has proposed an interesting framework to explore the trade-off, allowing a systematic comparison of different techniques to increase fairness and determine which is the most suitable for the task at hand.

The key concept of the proposed framework is the combination of multi-objective optimisation, the Pareto front approach, and the actual metrics to assess the fairness and the accuracy of the models’ predictions to perform analysis and determine the “best” approach for a given scenario.

To enable this analysis, the framework consists of five separate stages:

  • Dataset and protected attributes preprocessing
  • Metrics and objectives definition
  • Classification models selection
  • Pareto front calculation
  • Selection of best model

We will detail every stage in this blog post.

Case study implementation

To put everything into practice, we will perform a case study by following the different stages of the framework, using them to perform a fairness-accuracy analysis.

For the computational implementation we will use a python environment with the following packages:

  • Multiobjective optimization: Distributed Evolutionary Algorithm (DEAP) package
  • Classification model and accuracy metric implementation: Scikit-learn package
  • Fairness metrics and mitigators implementation: Holisticai package.

We will start with our implementation. First of all, we will import the required packages:

from deap import base 

from deap import creator 

from deap import tools 

from deap import algorithms 

import random 

from sklearn.model_selection import train_test_split 

from sklearn.preprocessing import StandardScaler 

import numpy as np 

import pandas as pd 

import matplotlib.pyplot as plt 

Dataset and protected attributes preprocessing

For our analysis, we will use the well-known “Adult dataset” from the UCI Machine Learning Repository, a publicly available dataset containing information about the age, education, marital status, race and gender of individuals from the United States. The objective is to predict whether an individual's income will be above or below $50K per year. The protected attribute we will use in this instance is the “Sex” feature.This dataset can be easily imported and downloaded from the holisticai package by running the following lines:

from holisticai.datasets import load_adult 

data = load_adult() 

Next, we must preprocess and format the data.  This can be done by using the following function and running:

df = pd.concat([data["data"], data["target"]], axis=1) 

df_clean, group_a, group_b = preprocess_adult_dataset(df) 


Dataset and Protected Groups

Now that we have our dataset and protected groups, we will define the input/output sets.

X = df_clean.iloc[:,:-1].values 

y = df_clean.iloc[:,-1].values 

X_train,X_test,y_train,y_test, group_a_tr, group_a_ts, group_b_tr, group_b_ts = \ 

train_test_split(X, y, group_a, group_b, test_size=0.2, random_state=42) 

train_data = X_train, y_train, group_a_tr, group_b_tr 

test_data = X_test, y_test, group_a_ts, group_b_ts 

Metrics and objectives definition

Once we have preprocessed the dataset and the protected groups, we must determine the metrics we will use to define the objective function for the optimisation.

Given that our purpose is to perform a fairness-accuracy analysis, we must select an accuracy metric. Although the simplest decision would be to select the accuracy score, we will instead consider the ROC AUC (Receiver Operating Characteristic Area Under Curve) metric for this analysis. Why? Because this metric is a better indicator of model performance as it compares the relation between True positive rate and False positive rate, while the accuracy score only indicates the percentage of correct predictions.

We also need to select a fairness metric. In the literature, we can find different metrics, such as disparate impact, statistical parity, equality of opportunity, and so on. For this case study, we will select the statistical parity metric since it computes the difference in success rates between the protected groups and, for our purposes, it is easier to optimise – as we will see later.

from sklearn import metrics  

from holisticai.bias.metrics import statistical_parity 

Classification model selection

Model selection varies according to the task at hand. Given that this is a binary classification problem, we will – in the name of simplicity – choose the logistic regression (LR) model for this analysis.

In addition, to observe the effect of the mitigation on the model, we will consider three approaches for the same model: without bias mitigation, a preprocessing technique, and a post-processing technique for bias mitigation.

Specifically because of their fast processing and good results, we will implement the Correlation Remover and the Calibrated Equalized Odds methods.

from sklearn.linear_model import LogisticRegression 

from holisticai.bias.mitigation import CorrelationRemover # Preprocessing technique 

from holisticai.bias.mitigation import CalibratedEqualizedOdds # Postprocessing technique 

Pareto front calculation

With all the components imported, we can start the Pareto front calculation. We will use multi-objective optimisation by applying genetic algorithms (GA) to determine these fronts, wherein one of the objectives will be to increase accuracy while also aiming to increase fairness. As mentioned here, GA is demonstrably a well-suited tool to explore Pareto dominance, evaluating different parameters and propagating efficient solutions.

This powerful tool is based on the basic concept of biological evolution, incorporating into its algorithm the natural processes of mutation, crossing, reproduction and selection applied to increase the fitness of an objective function. Taking this into consideration, we will apply this algorithm to perform a hyperparameter tuning of the classification model to observe the fairness-accuracy trade-off. This part of the implementation is inspired by an impressive tutorial on how to perform hyper-parameter optimisation with GA.

First, we will define the chromosomes of the GA to perform the parameter tuning. In this page, we can find the different parameters that can be set for the logistic regression model. To reduce complexity and decrease the computational effort, we leave some of the parameters as default – for example, the penalty, as some penalties may not work with some solvers.

Taking this into consideration, we will use the following parameters: inverse of regularisation strength, the solver used for optimisation, and the maximum number of iterations taken for the solvers.

With these elements defined, we can create the chromosome by defining the parameter, the choosing value method, and the range of values.

creator.create("FitnessMax", base.Fitness, weights=(1.0, 1.0))
# Maximise the fitness function value 

creator.create("Individual", list, fitness=creator.FitnessMax) 

toolbox = base.Toolbox() 

# Possible parameter values 

c_lower_value, c_upper_value = 1/(2**16), 2**16 

lower_max_iter, upper_max_iter = 2, 100 


solvers = ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'] 


toolbox.register("attr_solver", random.choice, solvers) 

toolbox.register("attr_c_param", random.uniform, c_lower_value, c_upper_value) 

toolbox.register("attr_max_iter", random.randint, lower_max_iter, upper_max_iter) 


toolbox.register("individual", tools.initCycle, creator.Individual, 

(toolbox.attr_solver, toolbox.attr_c_param, toolbox.attr_max_iter), n=N_CYCLES) 

toolbox.register("population", tools.initRepeat, list, toolbox.individual) 

Now, we must define the mutation function that will depend on the gene and will select a random new value for the selected one. This can be achieved by running the following code:

def mutate(individual): 

'''This function randomly selects a gene and randomly generates
a new value for it based on a set of rules''' 

gene = random.randint(0,2) 

if gene == 0: 

solvers_complement = get_complement([individual[gene]], solvers) 

individual[gene] = random.choice(get_complement([individual[gene]], solvers)) 

elif gene == 1: 

individual[gene] = random.uniform(c_lower_value, c_upper_value) 

elif gene == 2: 

individual[gene] = random.randint(lower_max_iter, upper_max_iter) 

return individual, 

Next, we will define our objective function. Since our objective is to maximise the accuracy performance of a model while simultaneously maximising its fairness, we need to use the past parameters to fit the model, evaluate the model with the selected metrics, and use this fitness score to evolve the population.

def evaluate(individual): 


build and test a model based on the parameters in
an individual and returns the AUROC and the fairness value 


# extract the values of the parameters from the individual chromosome 

solver = individual[0] 

c = individual[1] 

max_iter = individual[2] 

# train the model 

X, y, group_a, group_b = train_data 

scaler = StandardScaler() 

X = scaler.fit_transform(X) 

model = LogisticRegression(solver=solver, C=c, max_iter=max_iter), y) 

X, y, group_a, group_b = test_data 

X = scaler.transform(X) 

y_pred = model.predict(X) 

# calculate the metrics 

roc_auc = metrics.roc_auc_score(y_test, y_pred) 

sp = statistical_parity(group_a, group_b, y_pred) 

fairness = 1 - abs(sp) 


return roc_auc, fairness, 

By taking this function as baseline, we implement two similar ones by adding the pre- and post-processing mitigators in the same way.

Notice that we have modified slightly the fairness calculation. Given that statistical parity measures the difference, this could result in negative values. We, therefore, calculate the absolute value for this metric. Moreover, to define it as a maximisation problem, we subtract this value from 1, where a value of 1 could represent a perfect “fair” model.

fairness=1−|statistical parity|

Once we have defined all the objective functions, it is time to put it together. To do this, we define the GA parameters and the evolutionary process, running it with an initial set of individuals that will evolve over a number of generations.

Selection of best model

After running the evolutionary algorithm for all the models, we can plot the Pareto frontier. The next figure shows the calculated Pareto fronts for the analysed models with the given dataset.

For this particular case, we clearly observe that the model with the Correlation remover as its mitigation method dominates the other approaches. Consequently, we could infer that this model represents a better fairness-accuracy trade off.

Interestingly, the models present a negative correlation between fairness and accuracy, indicating that increasing one metric means a decrease in the another for all the methods.

Figure 1. Pareto fronts for the tested models
Figure 1. Pareto fronts for the tested models

Selection of best model

Finally, after the Pareto fronts calculation, we can select the “best” model according the cost function proposed by Haas.

This is a cost-based analysis that linearly combines both metrics (accuracy and fairness) with some weights and can be used to select the model that presents the lower cost-value. The equation is as follows:


The following table summarises the numeric results of the cost calculation for all the models with the previous equation.

As we can see, the initial assumption that the LR model with the correlation remover mitigator is the best model according the Pareto fronts is confirmed by observing these results. Indeed, this superiority is replicated for all the scenarios (equal weighting, more weight for accuracy, and more weight for fairness), presenting the lowest values in comparison with the remaining methods.

𝜶=𝟏, 𝜷=𝟏 𝜶=𝟑, 𝜷=𝟏 𝜶=1, 𝜷=3
Cost ROC AUC Fairness Cost ROC AUC Fairness Cost ROC AUC Fairness
Base Model 0.425 0.749 0.826 0.737 0.727 0.845 0.926 0.749 0.826
LR with correlation remover 0.365 0.747 0.887 0.544 0.715 0.913 0.871 0.749 0.882
LR with calibrated equalized 0.413 0.709 0.877 0.579 0.651 0.923 0.994 0.709 0.878

Achieving balance between accuracy and fairness in machine learning

During this tutorial, we have observed how to apply the framework proposed by Haas to perform a fairness-accuracy trade-off analysis for classification models.

Given a certain scenario (binary classification, in our case), the proposed framework uses a multi-objective optimisation (accuracy and fairness) through an evolutionary algorithm to calculate the Pareto fronts for the selected models, applying a cost function to determine which model presents a better trade-off.

This framework is applied with the Adult dataset, using the Logistic Regression model and the Correlation Remover and Calibrated Equalized mitigators to perform the analysis. Consequently, we observed after all the calculations that the best model was the LR with correlation remover, which presented the lowest cost for all the scenarios, even changing the importance of the analysed metrics.

For a comprehensive understanding of the framework discussed, it is advisable to read Haas's original paper. The author notes that the framework is versatile and can be adapted to various scenarios, so there is scope to experiment with different models and metrics to provide insights into the versatility and performance of the proposed approach in diverse contexts.

A complete implementation of this code can be found here.

Holistic AI – guiding organisations towards fair and accurate AI

As seen in this article, there can often be a trade-off between accuracy and fairness in machine learning models. While accuracy has traditionally been the priority, considerations around bias and fairness are becoming increasingly important.

At Holistic AI, we help organisations navigate this complex issue. Our bespoke auditing services analyse your models to uncover biases and suggest techniques to improve fairness without sacrificing predictive performance. We find the right balance for your specific needs.

Schedule a call to learn more about our model auditing and how we can help guide you towards equitable AI.

DISCLAIMER: This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.

Subscriber to our Newsletter
Join our mailing list to receive the latest news and updates.
We’re committed to your privacy. Holistic AI uses this information to contact you about relevant information, news, and services. You may unsubscribe at anytime. Privacy Policy.

Discover how we can help your company

Schedule a call with one of our experts

Schedule a call