Insightful Resources for Uncovering Bias in English Speech Recognition

January 27, 2023
Authored by
Cristian Munoz
Machine Learning Researcher at Holistic AI
Insightful Resources for Uncovering Bias in English Speech Recognition

However, as this technology increases in popularity, we are uncovering its fragility. For example, imagine you're trying to use a virtual assistant like Siri, Alexa or OK Google. But no matter how hard you try, the assistant seems to have trouble understanding you. Even though you're a bilingual or native speaker, you have to repeat yourself multiple times or even change your accent for the assistant to understand you finally. This unfair behaviour, also called bias, refers to the tendency to perform differently for certain factors such as age, gender, and accent, among others. To mitigate bias, it is essential to use diverse training data and continually evaluate and enhance the system's performance on underrepresented groups.

Possible Bias in Automatic Speech Recognition

Some groups that may suffer from bias in Automatic Speech Recognition (ASR) systems are:

  • Gender: Some speech recognition systems may be trained primarily on male voices, leading to poor performance for female voices.
  • Non-native speakers: Systems may not perform as well for people with accents or those who speak languages other than the one the system is trained on.
  • Older adults: As people age, their speech patterns may change, which can affect the performance of a speech recognition system.
  • People with disabilities: Speech recognition systems may not be designed with accessibility in mind, leading to poor performance for people with disabilities such as hearing loss or speech impairments.
  • People with different dialects or sociolects (accent): Systems may perform poorly for people who speak with a dialect or sociolect different from the one the system was trained on.
  • People with different noise environments: Such as in a car or a crowded room.

To diagnose this bias, it is essential to have annotated data that enables the evaluation of ASR systems. For example, the Speech Accent Archive Dataset was used in some studies to analyse the performance of commercial ASR systems provided by Amazon and Google. The results showed that both perform form with high error rates for second-language speakers of English, male speakers, and speakers of some varieties spoken in the North and Northeast of England compared to native speakers, women, and those from the South of England.

Another requirement are the metrics used in the domain of speech recognition. Some of them are:

Character Error Rate (CER) calculates the number of incorrect characters divided by the total number of characters in the text. A lower CER indicates better performance.

Character Error Rate (CER)

Where S is the number of substitutions (incorrect characters), D the number of deletions (missing characters), I the number of insertions (extra characters), and N the total number of characters in the text.

Word Error Rate (WER) calculates the number of incorrect words divided by the total number of words in the text. A lower WER indicates better performance.

Word Error Rate (WER)

Where S is the number of substitutions (incorrect words), D is the number of deletions (missing words), I is the number of insertions (extra words), and N is the total number of words in the text.

Dialect Density Measure (DDM) evaluates the degree of dialectal variation present in a speech corpus, especially in multilingual or multidialectal approaches. The measure can identify regions or speakers where the system may perform poorly. DDM counts the number of dialect-specific phonemes and compares them to the total number of phonemes in the corpus. A higher dialect density indicates a higher degree of dialectal variation in the corpus. The metric can vary depending on the specific implementation, but a common approach is as follows:

Dialect Density Measure (DDM)

Where Nd is the number of dialect-specific phonemes in the corpus, and N is the total number of phonemes in the corpus. Some other forms of the equation may exist depending on the specific implementation.

In this blog, we show you some great data sets to consider when analysing bias in your ASR systems. The selected datasets are well documented, support different speech recognition tasks, and annotate important attributes such as accent, gender, age, region, ethnicity, education, etc. Remember, the license is important for your application!

Here are some excellent datasets to consider:

Speech Accent Archive
Author/Supervision Steven H. Weinberger/George Mason University
Description Contains Native and non-native speakers of English. Dataset contains 2,140 speech samples. Participants come from 177 countries with 214 different native languages.
Annotation Accent (117 countries), Gender, Age, Birthplace
Tasks Accent Classification
License Attribution-NonCommercial-ShareAlike 2.0 Generic

ACL Anthology Dataset
Author/Supervision Google Research
Description 31 hours of recordings from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English.
Annotation Accent (UK)
Tasks Accent Classification
License Attribution-ShareAlike 4.0 International

Santa Barbara Corpus of Spoken American English
Author/Supervision John W. Du Bois, Wallace L. Chafe, Charles Meyer, Sandra A. Thompson
Description Recordings: 14 recordings as .mp3 files Transcripts: Time-aligned transcripts for all 14 recordings, in the CHAT format Metadata: A .csv with demographic information on speakers, as well as which recordings they appear in. (Some speakers appear in more than one recording.)
Annotation Age, Gender, Origin, Ethnicity, Education
Tasks Speech Recognition
License Attribution-NoDerivs 3.0 United States

OpenSLR (SLR83)
Author/Supervision Işın Demirşahin, Oddur Kjartansson, Alexander Gutkin, Clara Rivera / Google Research
Description Contains a total of 17,877 recordings of six dialects with the associated transcriptions. A total of 120 volunteers were recorded, (49 female /71 male). The dialects represent in the recordings are: Irish English, Midlands English, Northern English, Scottish English, Southern English, Welsh English.
Links &
Annotation Accent (6 dialects), Gender
Tasks Accent Classification
License Attribution-ShareAlike 4.0 International

Datatang’s British English Speech Dataset
Author/Supervision Datatang
Description 831 hours of data of Mobile Phone conversations of adults of a wide range of ages speaking British English.
Annotation Gender, Age, Noise Environment
Tasks Speech Recognition
License -

Artie Bias Corpus
Author/Supervision Josh Meyer / Artie
Description The Artie Bias Corpus consists of 1,712 audio clips (≈ 2.4 hours) along with their transcripts and demographic data about the speaker.
Annotation Gender, Age, Accent
Tasks Speech Recognition
License Mozilla Public License 2.0

Voice Gender Detection
Author/Supervision Jim Schwoebel/Digital Ocean
Description Cleaned Dataset for Voice gender detection using the VoxCeleb dataset (7000+ unique speakers and utterances, 3683 males / 2312 females). The VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.
Annotation Gender [Male, Female]
Tasks Audio Gender Classification
License Creative Commons Attribution 4.0 International /Apache License Version 2.0

Author/Supervision Sören Becker/Department of Video Coding & Analytics, Germany
Description The dataset consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers. Additionally "audioMNIST_meta.txt" provides meta information such as gender or age of each speaker.
Annotation Age, Gender, Accent, Origin, Native Speaker [True/False]
Tasks Audio Number Classification
License MIT License



[1] Weinberger, S. (2013). Speech accent archive. George Mason University.

[2] Markl, N. (2022, June). Language variation and algorithmic bias: understanding algorithmic bias in British English automatic speech recognition. In 2022 ACM Conference on Fairness, Accountability, and Transparency (pp. 521-534).

[4] Open-source Multi-speaker Corpora of the English Accents in the British Isles (Demirsahin et al., LREC 2020).


[6] Meyer, J., Rauchenstein, L., Eisenberg, J. D., & Howell, N. (2020, May). Artie bias corpus: An open dataset for detecting demographic bias in speech applications. In Proceedings of the 12th language resources and evaluation conference (pp. 6462-6468).Speech technologies have become integral to everyday life and have many applications. For instance, with Automatic Speech Recognition (ASR), you can control devices and access information simply by using voice commands. Additionally, a speech-to-text application allows for real-time dictation and not taking, while a speaker identification application can be used to identify who is speaking in an audio sample. This technology can also aid communication with people who speak different languages by converting spoken audio from one language to another through language translation applications. These are just a few examples of the many tasks that speech recognition technology can be used for.

DISCLAIMER: This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.

Subscriber to our Newsletter
Join our mailing list to receive the latest news and updates.
We’re committed to your privacy. Holistic AI uses this information to contact you about relevant information, news, and services. You may unsubscribe at anytime. Privacy Policy.

Discover how we can help your company

Schedule a call with one of our experts

Schedule a call