With artificial intelligence (AI) being increasingly used in high-stakes applications, such as in the military, for recruitment, and for insurance, there are growing concerns about the risks that this can bring. This is because algorithms can introduce novel sources of harm, where issues such as bias can be amplified and perpetuated by the use of AI. As such, recent years have seen a number of controversies around the misuse of AI, which have affected a range of sectors. In this blog post, we explore some of the harms we have seen from the use of AI, where things went wrong, and how they could have been prevented or mitigated by adopting an ethical approach to AI and risk management practices.
Perhaps one of the most high-profile cases of AI harm is Northpointe’s COMPAS tool to predict recidivism, or reoffending. The company, now Equivant, offers intelligent Case Management Solutions (CMS) for use in the legal process, based their predictions on scores derived from answers to a set of 137 questions, where answers were either provided by the defendants themselves or pulled from criminal records.
An investigation into the tool by ProPublica using data on 11,757 people evaluated pre-trial by the tool in 2013 and 2014 from Broward County in Florida had some concerning findings. With each defendant being given three scores – Risk of Recidivism, Risk of Violence, and Risk of Failure to Appear – these were designated to either the low-risk, medium-risk, or high-risk category. Matching individuals to their criminal records using their name and date of birth and using racial categorisations from Broward County Sheriff’s Office, the investigation found that black defendants who did not re-offend over a two-year period were almost twice as likely to be incorrectly predicted to be at a higher risk of reoffending than white defendants, 45% vs 23%, respectively. The reverse was true for white defendants, who were more likely to be incorrectly predicted to have a lower risk score. Further, black offenders were 45% more likely to be assigned a higher risk category than white defendants, even when prior crimes, future recidivism, age, and gender were controlled for.
Although there is a lack of transparency about how the questionnaire answers are converted into scores by the algorithm, the algorithm was likely trained, at least in part, using prior human judgements. Given that there is a long history of racial bias in the legal system, some of the judgements used to train the models were likely biased, which was then echoed by the software. One way to combat this is to ensure that the training data for the model is as unbiased as possible. However, there are also more technical approaches to mitigating bias, such as reweighing features associated with protected characteristics, or even removing them.
Another infamous example of AI harm is Amazon’s scrapped resume screening tool. Trained on the resumes of people who had applied for technical positions at the company over the past 10 years, the algorithm penalised the resumes of people who used the word “women’s” in their resume (e.g., if they went to a women’s college), therefore being biased against female applicants.
This is because the data used to train the tool was made up of mostly male applicants, reflecting the gender imbalance in the tech industry. Since the male resumes did not include the word “women’s”, any resumes containing the word were subsequently penalised as they did not echo the resumes of previously successful applicants. Fortunately, the tool was retired before it was ever deployed, meaning there was potential harm, not actual harm. Nevertheless, this highlights what can go wrong when there is an underrepresentation of certain subgroups in the training data.
Indeed, had the training data been examined for representativeness, it may have been able to be foreseen that the algorithm would perform less well for female applicants. Therefore, this could have been mitigated by collecting additional data, or carrying out other steps, to make sure that the training data represented as many different groups as possible, particularly in relation to sex/gender and race/ethnicity, although other characteristics such as age and disability could also be considered. The more representative the training data is, the better the chances that model will perform well across different groups.
On a similar note, the well-known gender shades project sought to examine the accuracy of AI-driven commercial gender classification tools, also finding worrying results. Investigating tools offered by IBM, Microsoft, and Face++, the researchers found that across all of the systems, there were higher error rates for females than males, with differences in error rates ranging from 8.1% to 20.6%. Similarly, the tools were most accurate for lighter-skinned subjects than darker skinned, with differences in error rates ranging from 11.8% to 19.2%. The largest disparity, however, is the accuracy of the systems for lighter-skinned males and darker-skinned females, with error rates of up to .8% and 34.7%, respectively.
Again, the poor performance of the tools for females and those with darker skin is unsurprising considering that white males were overwhelmingly represented in the training data. There is also a lack of separation of measures of accuracy for different demographics, meaning that the overall performance of the system can be skewed if it performs particularly well for certain subgroups, especially if they are overrepresented in the test data.
Considering the quality and representativeness of training data is a key step towards ensuring that systems perform acceptably across different groups. Comprehensive documentation can also support this and can be used to justify why certain steps were taken and decisions were made and can be a useful resource to consult in the event that systems do not perform as expected.
While many of the harms surrounding the use of AI are concerned with bias, the failure of Knight Capital's trading algorithm highlights the importance of ensuring that algorithms are efficacious and work as expected. As a leading financial services firm with a market share of around 17% on the New York Stock Exchange, the company suffered a $440 million loss in 2012 after its new trading algorithm had a flaw that saw it purchase 150 stocks at a cost of around $7 billion – all of this happening within the first hour of trading.
Having to pay for the shares the algorithm had acquired 3 days later, Knight Capital tried to get the trades cancelled but was refused by the Securities and Exchange Commission (SEC), meaning the company was forced to sell all off the erroneously purchased shares. Goldman Sachs then stepped in to buy the shares, at a $440 million cost to Knight.
Caused by the failure of an engineer to copy the new software code to one of the servers, the error had major consequences for the company which later received a $400 million cash infusion from investors and was acquired by rival company Getco LLC the next year. The lack of procedures to verify the implementation of new software meant that this error went unnoticed until it was too late, highlighting the importance of having adequate oversight mechanisms to prevent financially and reputationally expensive mistakes when using algorithms.
Finally, Apple’s Credit Card, the Apple Card, jointly offered by Apple and Goldman Sachs, highlights the need for explainable algorithms. Indeed, the credit card was called out on social media following claims that a man was given a higher credit limit than his wife, despite her higher credit score. Potentially violating New York laws, New York's Department of Financial Services (DFS) launched an investigation into the algorithm used to determine the credit limits.
The investigation found that the card did not discriminate based on sex and Goldman Sachs was cleared of illegal activity. Supporting this investigation was the investment banking giant’s ability to explain how the algorithm came to each decision, highlighting the importance of ensuring the explainability of algorithms. It also highlights the need to examine whether algorithms use proxy variables for protected characteristics, even when protected characteristics themselves are not included in the model.
Affecting multiple risk verticals and multiple applications of AI, the above examples highlight the importance of comprehensively examining the performance of algorithms and taking steps towards implementing a risk management framework.
A number of upcoming laws, such as the European Union AI Act and laws targeted at the human resources (HR) sector (e.g., NYC Local Law 144 – the Bias Audit law) will soon require companies to take steps to ensure that they minimise the risks of their AI and that it is used safely. Taking steps early is the best way to prepare to be compliant with upcoming legal requirements. Schedule a demo to find out more about how Holistic AI can help you with AI governance, risk, and compliance.
Written by Airlie Hilliard, Senior Researcher at Holistic AI
Subscribe to our newsletter!
Join our mailing list to receive the latest news and updates.
Our AI Governance, Risk and Compliance platform empowers your enterprise to confidently embrace AIGet Started