Overcoming Small Sample Sizes When Identifying Bias
Bias Audit

Overcoming Small Sample Sizes When Identifying Bias

January 19, 2023

Amid concerns about the use of artificial intelligence and algorithms in high-stakes decision-making, such as for making employment-related decisions, the New York City council passed a landmark piece of legislation that mandates bias audits of automated employment decision tools (AEDTs) being used to evaluate candidates for employment or employees for promotion within New York City.

Under Local Law 144, employers and employment agencies are required to commission independent, impartial bias audits of their tools, where, under the latest version of the Department of Consumer and Worker Protection’s (DCWP) proposed rules, bias should be determined using impact ratios based on outcomes for different subgroups. In this blog post, we outline the metrics required to conduct the bias audit, how small sample sizes can pose issues, and how they can be dealt with when carrying out audits.

Key takeaways:

  • For systems that result in a binary output, the selection rate of each subgroup is used to calculate the impact ratio by comparing it to the group with the highest rate.
  • For systems that result in a continuous output, the scoring rate for each subgroup is calculated by assigning scores to pass/fail depending on whether they are above or below the median score for the sample.
  • Scoring rates are then used to calculate the impact ratio by comparing the rate of each subgroup to the group with the highest rate.
  • These metrics are similar to the Equal Employment Opportunity Commission’s four-fifths rule.
  • The four-fifths rule can provide discrepant results with other metrics and produce false positives when sample sizes are small.
  • Local Law 144 does not provide guidance on small sample sizes but the delayed enforcement date gives more time to collect additional data.

Impact ratio metrics

In line with the requirements of the New York City bias audit law, impact ratios should be calculated for subgroups based on sex/gender (male, female, and other) and race/ethnicity (Hispanic or Latino, White, Black or African American, Native Hawaiian or Pacific Islander, Asian, Native American or Alaska Native, and two or more races). For AEDTs that result in a binary outcome such as being selected for the position, the impact ratio is calculated using the selection rate for each subgroup, or percentage of applicants from each being allocated to the positive condition:

Impact ratio metrics

On the other hand, the impact ratio calculation for systems that result in a continuous outcome, such as a score, is based on the so-called scoring rate, which is calculated by binarising the scores to a pass/fail categorisation. According to this metric, individuals scoring above the median score for the sample are allocated to the pass category well those scoring below are allocated to the fail category. Once the scoring rate has been calculated, it is then used in a similar way to the selection rate to calculate the impact ratio for each subgroup:

The Scoring Rate

The four-fifths rule

Although not explicitly referenced in the NYC legislation, the above metrics are similar to the Equal Employment Opportunity Commission’s (EEOC) four-fifths rule, which is used to determine whether a hiring procedure results in adverse or disparate impact. According to the rule, the hiring rate of one group should not be less than four-fifths (.80) of the hiring rate of the group with the hiring rate and is calculated in the same way as the first impact ratio provided above. While the .80 threshold is not endorsed by Local Law 144, since the metric is the same, it is still subject to the same limitations as the four-fifths rule.

Decades of research into the metric highlights how the metric can be problematic when the calculation is made using small sample sizes. Indeed, research into the four-fifths rule can result in false positives when sample sizes are small and the z-test, an alternative metric used to calculate disparate impact that is more suited to smaller sample sizes, can have discrepant findings with the four-fifths rule. Even the Uniform Guidelines on Employee Selection Procedures, which introduced the four-fifths rule, note that violations of the four-fifths rule may not constitute disparate impact if the calculation is based on small sample sizes.

Although there is no hard and fast rule for the minimum sample size that is required for the four-fifths rule, the EEOC’s clarifications on the Uniform Guidelines specify that adverse impact analysis should only be carried out for groups who represent at least 2% of the labor force. To support this, power analysis can be used to determine the statistical power for different sample sizes, and it is recommended that subgroups should represent at least 5-10% of the sample for the analysis to be meaningful.

Calculating impact ratios with small samples

The DCWP’s rules specify that the ethnicity/race categories that should be examined are Hispanic or Latino, White, Black or African American, Native Hawaiian or Pacific Islander, Asian, Native American or Alaska Native, and two or more races. There are likely multiple categories with small samples, particularly for the Native Hawaiian or Pacific Islander, Native American or Alaska Native, and Two or more races categories. However, the DCWP does not provide any clarification on what is considered an adequate sample size for analysis to be meaningful.

Given the issues with identifying bias, or disparate impact, based on small sample sizes, one approach is to combine the Native Hawaiian or Pacific Islander or Native American or Alaska Native categories into one broader “Other” category. However, there is no guarantee that this will increase the sample to a sufficient size for a robust analysis and could mean that it is harder to identify and mitigate bias for particular subgroups if they do not have their own category. It is worth noting that the examples provided by the DCWP in the new proposed rules keep the categories separate, with one category listed representing less than 1.5% of the workforce.

Alternatively, analysis can be conducted on individual groups regardless of their sample size, with results based on small sample sizes indicated using an asterisk in the summary of results. This is approach will enable greater compliance with the rules, particularly since they do not specify how to handle small sample sizes and the examples provided by the DCWP include small samples and analyze the categories separately.

More time for compliance

Since the enforcement date of Local Law 144 has been delayed from 1January 2023 to 15 April 2023, this gives employers, employment agencies, and vendors time to collect additional data to increase their sample sizes and make the analysis more robust. In the coming weeks, the DCWP may also release guidance on calculating impact ratios when sample sizes are small or may propose an alternative metric that is more suited to smaller samples. Our open-source library has a range of metrics that can be used to identify bias, as well as mitigation strategies for when bias is identified.

Schedule a demo to find out more about how Holistic AI can help you become compliant with the NYC Bias Audit Law.

Written by Airlie Hilliard, Senior Researcher at Holistic AI and Lindsay Carignan, Head of Customer Success at Holistic AI

Manage risks. Embrace AI.

Our automated AI Risk Management platform empowers your enterprise to confidently embrace AI

Get Started