In recent years, the fairness of automated employment decision tools (AEDTs) has received increasing attention. In November 2021, the New York City Council passed Local Law 144, which mandates bias audits of these systems. From 5 July 2023 (initially 1 January 2023 and then initially postponed to 15 April 2023), companies that use AEDTS to screen candidates for employment or employees for promotion in the city must have these systems audited by an independent entity. Rules proposed by the Department of Consumer and Worker Protection (DCWP) specify that bias must be determined using impact ratios that compare the outcomes for subgroups based on sex/gender and race/ethnicity categories at a minimum.
In this this post, we evaluate how the metrics proposed by the DCWP to calculate impact ratios for regression systems can be fooled by unexpected distributions and tweaking the data.
A common application of Artificial Intelligence (AI) in recruitment is to determine whether a candidate should be hired or progress to the next stage in the hiring process. This type of task is called binary classification, where the aim is to classify the data (candidates, in our example) into one of two categories: “unsuccessful” (0) and “successful” (1).
According to Local Law 144, the fairness of this algorithm can be measured by comparing the proportion of successful candidates across different categories (e.g. male/female or black/white).
Mathematically, this means that we will need to calculate the selection rate of a group by computing the following ratio:
We can then calculate the impact ratio for the binary classification case using the following metric: which is the ratio of selection rates between two groups (call them group a and group b).
Note that we always consider the denominator the largest selection rate, so the metric is always smaller than 1. It is worth noting that Local Law 144 does not explicitly indicate a threshold for fairness. However, it is common in the field to use the Equal Employment Opportunity Commission’s four-fifths rule, which indicates that the impact ratio should be larger than 4/5 for a selection procedure to be considered fair.
The previous metric is a common way of measuring fairness with binary data. However, what happens when the data is not binary (0 or 1)? Imagine, for example, the case where the CVs of candidates are scored on a range from 0 to 100. This type of task is called regression, where the AEDT aims to predict a number with a continuous value.
In this case, the proposed rules provide a different formula for estimating the impact ratio that is not based on selection rates. Under the first version of the proposed rules, the metric compared the average score across categories (e.g. male/female or black/white).
Mathematically, the average score of a group is computed as the sum of all scores assigned to candidates from the group divided by the total number of candidates from the group, which is represented by the following equation:
The metric then uses the is the ratio of the average score for each group to determine whether bias is occurring, once again taking the largest average score to be the denominator.
Again, this metric always results in a value between 0 and 1, with 1 being maximally fair and indicating that both subgroups of the population have the same average score.
At first, this metric may look like a good generalisation of the impact ratio for binary data. Suppose we use the regression version of the metric on binary data (0 or 1). In that case, the average value for each group is equal to the selection rate:
However, regression data is more complex than binary data, and considering only the average brings some important limitations. Indeed, a major limitation is that the data can easily be tweaked to push the computed metric into the fairness range. In the following sections, we will examine how this metric can be fooled and provide evidence for why the initially proposed impact ratio metric is not always adequate for detecting bias in regression data.
Suppose we have recruitment data, along with protected attribute information for two groups (e.g., males and females). If the average score for female candidates is 20 and 26 for male candidates, the impact ratio (20/26=0.69) is below the .80 (4/5) threshold so would not be considered fair. However, a simple tweak will change this into a perfectly fair dataset; we can take the highest scoring female and increase their score until the average for female candidates falls within a fair range, as shown in the figures below.
By increasing the score of a single female candidate from 36 to 100, the impact ratio has increased from 0.69 to 0.84. Arguably this has not made the data less biased since most candidates remain completely unaffected by the change, yet the impact ratio is 1.
This method is not robust, and would not work with larger sample sizes, and would be illegal under the Civil Rights Act of 1991, which prohibits scores being adjusted based on protected characteristics. However, the example illustrates how using the average score is not always a reliable indicator of fairness.
Let's consider another example where the original data has an impact ratio 0.67. Since this fall outside of the fair range, we can shift all scores by some amount T such that our ratio now falls in the fair range.
The order and distribution of the scores has not changed, as can be seen in the figures below where we shifted the dataset by a constant value of 30, but now our data is considered fair.
Observe how the impact ratio has changed from 0.66 to 0.83, which is in the suggested fair range. In contrast to the previous method that relies on one candidate, this is a harder transformation to spot.
Another problem with using the regression disparate impact metric is that the average does not account for the whole dataset. In the binary case, the average value is more informative, in that we can deduce the number of 0s and 1s from the average. In a regression setting, the average carries limited information. For example, we could have vastly different distributions with the same average. Some of these distributions will yield fairer results than others, yet the metric cannot see past the fact that the averages are equal.
Let’s consider again the case where male and female candidates are scored in a range from 0 to 100. Whereas male candidates consistently get scores around 50, female candidates seem to be scored either approximately 25 or 75, as seen in the figure below
However, these vastly different distributions have equal mean (50). Is this data fair? The answer to this question depends on what success means in this case. If success means scoring above 50, then the data will be perfectly fair. However, if success means being in the top 20% of candidates (which is not an unreasonable assumption in recruitment), then almost all chosen candidates will be female.
Due to concerns raised during the DCWP’s public hearing about the suitability of using the average score to calculate impact ratios, in their updated rules, the DCWP has now proposed a new metric based on the median score for the sample. The median of a dataset is defined as the score lying at the midpoint of the data when it is in ascending order. The idea is that the median splits our data such that half of the values are above the median and the other half below it.
Getting back to our example of CV scores, we can sort our data in increasing fashion 𝑥1, …, 𝑥𝑁, …, as in the figure above. If the number of candidates is odd, as for the male candidates in our example (orange), then the median will be the middle score
If instead the number of candidates is even, as for the female candidates in our example (yellow), then the median will fall in between the two central data points
Mathematically, the scoring rate of a group is computed as the number of candidates from a group that have scores above the median score divided by the total number of candidates from that group:
The proposed metric then uses the ratio of the number of people in each group scoring above the median (the scoring rate) to calculate the impact ratio:
Once again, the denominator is always the group with the highest scoring rate and the outcome is a value between 0 and 1, with 1 being maximally fair and indicating that both subgroups of the population have the same proportion of scores above the median score.
While this new metric may seem like a more natural way to binarize regression data, since it is splitting the dataset in half and is more robust to outliers and data shifts, it still suffers from shortcomings and may not work for some applications or distributions, as we will investigate in the following sections.
Let’s consider the case where male and female candidates are scored from 0 to 100. Whereas male candidates consistently get scores around 50 (unimodal distribution), female candidates seem to be scored either approximately 25 or 75 (bimodal distribution). The median of the full dataset (across males and females) is 50, because half the data falls below it and the other half falls above it.
Is this data fair? The answer to the question depends on what success means in this case. If all candidates scoring above the median value of 50 are hired, then the data will be perfectly fair. However, if only the top 20% of candidates are hired, then almost all chosen candidates will be female.
Let’s once again consider the case where male and female candidates are scored from 0 to 100. In this example, male candidates are scored either approximately 30 or 70, whereas female candidates are scored either approximately 30 or 80. In this example, there are two peaks for each, with the lower one being consistent for both male and females while the higher one is slightly different. Like in the above example, the median of the full dataset is approximately 50, because half the data falls below it and the other half falls above it.
Again, deciding whether the data is fair will depend on our definition of success. If all candidates scoring above 50 are hired, then the data will be perfectly fair. This is because we are essentially reducing our data to a binary pass/fail classification, and the difference in the higher scores for male and females will be masked. However, as soon as our notion of success changes, big differences are revealed. To better observe this phenomenon, we can calculate how the impact ratio varies when the candidates hired are respectively the top 50% (which is equivalent to the median), 40%, 30%, 20% and 10%.
As is seen from the figure above, the computed binary disparate impact will greatly depend on the threshold we use. At the median value, we obtain perfect fairness. For any values above the median, the fairness rapidly decreases due to the distribution of the data.
In conclusion, both proposed impact ratio metrics for regression systems have some important limitations, but the revised metric based on the median is a notable improvement. Even so, it is still relying on the implicit assumption that splitting the data in half is a meaningful description of the data. We showed through our examples that this is not always true and will very much depend on the specific application and prior knowledge around the data.
As such, we highly encourage the use of different metrics, that would be better suited for measuring regression bias. Alternative metrics that consider fairness over the whole distribution, using tests to compare different distributions, or metrics that compare the ranking of candidates rather than the score itself will be more suitable.
Holistic AI’s open-source library contains a variety of metrics for both binary and regression systems, as well as bias mitigation strategies. To find out more about how we can help you audit your AEDTs for bias, get in touch at email@example.com.
Authored by Dr Sara Zannone, Head of Research at Holistic AI, and Giulio Filippi, Applied Researcher at Holistic AI.
Subscribe to our newsletter!
Join our mailing list to receive the latest news and updates.
Our AI Governance, Risk and Compliance platform empowers your enterprise to confidently embrace AIGet Started