Amid concerns about the use of artificial intelligence and algorithms in high-stakes decision-making, such as for making employment-related decisions, the New York City council passed a landmark piece of legislation that mandates bias audits of automated employment decision tools (AEDTs) being used to evaluate candidates for employment or employees for promotion within New York City.
Under Local Law 144, employers and employment agencies are required to commission independent, impartial bias audits of their tools, where, under the latest version of the Department of Consumer and Worker Protection’s (DCWP) proposed rules, bias should be determined using impact ratios based on outcomes for different subgroups. In this blog post, we outline the metrics required to conduct the bias audit, how small sample sizes can pose issues, and how they can be dealt with when carrying out audits.
In line with the requirements of the New York City bias audit law, impact ratios should be calculated for subgroups based on sex/gender (male, female, and other) and race/ethnicity (Hispanic or Latino, White, Black or African American, Native Hawaiian or Pacific Islander, Asian, Native American or Alaska Native, and two or more races). For AEDTs that result in a binary outcome such as being selected for the position, the impact ratio is calculated using the selection rate for each subgroup, or percentage of applicants from each being allocated to the positive condition:
On the other hand, the impact ratio calculation for systems that result in a continuous outcome, such as a score, is based on the so-called scoring rate, which is calculated by binarising the scores to a pass/fail categorisation. According to this metric, individuals scoring above the median score for the sample are allocated to the pass category well those scoring below are allocated to the fail category. Once the scoring rate has been calculated, it is then used in a similar way to the selection rate to calculate the impact ratio for each subgroup:
Although not explicitly referenced in the NYC legislation, the above metrics are similar to the Equal Employment Opportunity Commission’s (EEOC) four-fifths rule, which is used to determine whether a hiring procedure results in adverse or disparate impact. According to the rule, the hiring rate of one group should not be less than four-fifths (.80) of the hiring rate of the group with the hiring rate and is calculated in the same way as the first impact ratio provided above. While the .80 threshold is not endorsed by Local Law 144, since the metric is the same, it is still subject to the same limitations as the four-fifths rule.
Decades of research into the metric highlights how the metric can be problematic when the calculation is made using small sample sizes. Indeed, research into the four-fifths rule can result in false positives when sample sizes are small and the z-test, an alternative metric used to calculate disparate impact that is more suited to smaller sample sizes, can have discrepant findings with the four-fifths rule. Even the Uniform Guidelines on Employee Selection Procedures, which introduced the four-fifths rule, note that violations of the four-fifths rule may not constitute disparate impact if the calculation is based on small sample sizes.
Although there is no hard and fast rule for the minimum sample size that is required for the four-fifths rule, the EEOC’s clarifications on the Uniform Guidelines specify that adverse impact analysis should only be carried out for groups who represent at least 2% of the labor force. To support this, power analysis can be used to determine the statistical power for different sample sizes, and it is recommended that subgroups should represent at least 5-10% of the sample for the analysis to be meaningful.
The DCWP’s rules specify that the ethnicity/race categories that should be examined are Hispanic or Latino, White, Black or African American, Native Hawaiian or Pacific Islander, Asian, Native American or Alaska Native, and two or more races. There are likely multiple categories with small samples, particularly for the Native Hawaiian or Pacific Islander, Native American or Alaska Native, and Two or more races categories. However, the DCWP does not provide any clarification on what is considered an adequate sample size for analysis to be meaningful.
Given the issues with identifying bias, or disparate impact, based on small sample sizes, one approach is to combine the Native Hawaiian or Pacific Islander or Native American or Alaska Native categories into one broader “Other” category. However, there is no guarantee that this will increase the sample to a sufficient size for a robust analysis and could mean that it is harder to identify and mitigate bias for particular subgroups if they do not have their own category. It is worth noting that the examples provided by the DCWP in the new proposed rules keep the categories separate, with one category listed representing less than 1.5% of the workforce. However, the adopted version of the rules does allow auditors to exclude groups representing less than 2% of the sample, providing that the number of applicants in that category and the scoring rate or selection rate of that category is included in the summary of results.
Alternatively, analysis can be conducted on individual groups regardless of their sample size, with results based on small sample sizes indicated using an asterisk in the summary of results. This is approach will enable greater compliance with the rules, particularly since they do not specify how to handle small sample sizes and the examples provided by the DCWP include small samples and analyze the categories separately.
Since the enforcement date of Local Law 144 has been delayed from 1 January 2023 to 5 July 2023, this gives employers, employment agencies, and vendors time to collect additional data to increase their sample sizes and make the analysis more robust. In the coming weeks, the DCWP may also release guidance on calculating impact ratios when sample sizes are small or may propose an alternative metric that is more suited to smaller samples. Our open-source library has a range of metrics that can be used to identify bias, as well as mitigation strategies for when bias is identified.
DISCLAIMER: This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.
Schedule a call with one of our experts