Artificial intelligence is increasingly being applied in critical applications such as healthcare, recruitment, and even self-driving cars. Given that the decisions and recommendations made by these systems can have a significant impact on an individual’s life — recruitment algorithms on someone’s career, healthcare algorithms on diagnosis and treatment, and self-driving cars on the driver, passengers and others’ safety — it is more important than ever that the algorithms behind these systems are efficacious. In other words, algorithms must do what they are designed to do to an acceptable level to avoid harm, internally or externally. Indeed, when a process is fully automated, the system’s failure can be catastrophic, as was seen with the failure of Knight Capital’s trading algorithm, which cost the company over $440 million.

Typically, the ground truth or actual values are compared to the values predicted by the model. For example, in the case of recruitment, this could be a hiring manager’s judgements on the hireability of a candidate. The way that efficacy is measured depends on the type of system and its output, with different approaches being suitable for regression and classification systems. In this blog post, we give an overview of some methods for measuring algorithm efficacy.

Classification systems typically result in a binary output, often designating outcomes to a positive or negative condition. In the case of an AI tool used to detect cancer from scans, the output of the system would either be cancer being present (1 – positive condition) or cancer being negative (0 – negative condition). In the recruitment context, a CV scanning tool might allocate applicants to conditions based on whether or not they meet the required qualifications in the job description (1 – qualified, 0 – unqualified).

With classification systems, many metrics for measuring accuracy rely on true and false positive and negatives:

**True positive**- the system correctly assigns the decision to the positive classification e.g.; cancer is correctly identified on a scan.**False positive**- the system incorrectly assigns the decision to the positive condition e.g.; cancer is incorrectly flagged on a cancer-free scan.**True negative**- the system correctly assigns the decision to the negative condition e.g.; a cancer-free scan is correctly identified as such.**False negative**- the system incorrectly assigns the decision to the negative condition e.g.; the system misses a cancerous mass in a scan.

These are often represented using a confusion matrix. Let's consider that a medical scanning tool has made predictions on an already annotated dataset of 140 individuals. The annotations made by medical experts are represented by the columns, and the predictions are represented by the rows.

From this, a number of metrics can be calculated:

Aside from true and false positive and negative rates, confusion matrices can also be used to calculate more specific metrics. For example, accuracy can be used to measure the total number of correct predictions across the positive and negative classifications. It is calculated using the following equation, which results in values ranging from 0 to 1 where the greater the value, the more accurate the system:

Using the confusion matrix above, the accuracy of this model would be:

Precision measures the proportion of positive classifications predicted by the model that were correct. Like with accuracy, precision is measured on a range of 0 to 1, where higher values indicate a more precise model, and is calculated using the following equation:

Using the confusion matrix above, the precision of this model would be:

A value of .86 is on the higher end of the scale, meaning the model has a good level of precision.

Recall measures the proportion of actual positives that were correctly identified. Again on a scale of 0 to 1, it is calculated using the following equation:

In the example above, this would be:

Here, the value for recall is less than the value of precision, so the performs less well according to this metric. However, a value of .75 still indicates that the model performs well.

Precision and recall are typically in competition, with an increase in precision usually decreasing recall. To get a better understanding of the performance of the model, the two metrics can be combined and used to compute an F1 score, which measures how many correct predictions were made across a data set. It is calculated by:

In the example above, this would be:

Like with precision and recall, F1 scores range between 0 and 1, meaning that a score of .80 indicates that the model correctly predicts outcomes more often than not, although not all predictions will be correct.

A receiver operating curve (ROC) plots the performance of a classification system at different thresholds by comparing the true positive rate and false positive rate. The area under the curve (AUC) represents the performance of the model across different classification thresholds and can be said to represent the probability that a positive example will be ranked more highly than a random negative example. An AUC of 0 represents the model getting all classifications incorrect, a score of .5 represents a 50% chance of getting the classification correct, and a score of 1 indicates that the model is always correct.

For regression systems, such as those used in recruitment to predict personality scores or cognitive ability, since scores are continuous, it is not as easy to identify whether the algorithm has resulted in a correct label, so true and false positives cannot be used to measure accuracy. Instead, the outputs of the model must be compared with the ground truth in terms of how close the values are.

One way to do this is correlating the outputs of the model with the ground truth scores used to train it. Both the correlation coefficient and the significance value can be useful to consider. The higher the correlation coefficient, the stronger the correlation and therefore the more accurate and effacious the model. Additionally, a significant correlation coefficient indicates that it is unlikely that this relationship occurred by chance.

Another way to look at the accuracy of regression models is to look at the error of the model. This is because regression systems aim to create an equation to represent the relationship between two or more variables that can be used to make predictions. The error of the regression line, therefore, can be measured by comparing the predicted and observed, or actual value, as shown in the figure below.

One way to quantify this difference is the root mean square error (RMSE), which is the standard deviation of the prediction errors, calculated using the following equation:

Where 𝑖 is variable i, 𝑁 is the number of non-missing datapoints, 𝑥𝑖 is the actual value, and 𝑥^ is the predicted value. The larger this the RMSE, the less well the line fits the data, so the more erroneous or less accurate the model for this data.

The above is a non-exhaustive overview of some different metrics that can be used to measure the efficacy of a model. The metric used is dependent on the type of model and the context it is being used in. Nevertheless, it is important that efficacy is measured on at least one dimension to maximize the utility of the model and minimize potential risks associated with poor performance or failure.

Holistic AI’s open-source library has built in metrics for measuring the performance of models, including accuracy, precision, recall, and F1 scores – which can be calculated simultaneously using only one line of code. Check it out!

Authored by Airlie Hilliard, Senior Researcher at Holistic AI

Data Science

March 28, 2023

Artificial intelligence is increasingly being applied in critical applications such as healthcare, recruitment, and even self-driving cars. Given that the decisions and recommendations made by these systems can have a significant impact on an individual’s life — recruitment algorithms on someone’s career, healthcare algorithms on diagnosis and treatment, and self-driving cars on the driver, passengers and others’ safety — it is more important than ever that the algorithms behind these systems are efficacious. In other words, algorithms must do what they are designed to do to an acceptable level to avoid harm, internally or externally. Indeed, when a process is fully automated, the system’s failure can be catastrophic, as was seen with the failure of Knight Capital’s trading algorithm, which cost the company over $440 million.

Typically, the ground truth or actual values are compared to the values predicted by the model. For example, in the case of recruitment, this could be a hiring manager’s judgements on the hireability of a candidate. The way that efficacy is measured depends on the type of system and its output, with different approaches being suitable for regression and classification systems. In this blog post, we give an overview of some methods for measuring algorithm efficacy.

Classification systems typically result in a binary output, often designating outcomes to a positive or negative condition. In the case of an AI tool used to detect cancer from scans, the output of the system would either be cancer being present (1 – positive condition) or cancer being negative (0 – negative condition). In the recruitment context, a CV scanning tool might allocate applicants to conditions based on whether or not they meet the required qualifications in the job description (1 – qualified, 0 – unqualified).

With classification systems, many metrics for measuring accuracy rely on true and false positive and negatives:

**True positive**- the system correctly assigns the decision to the positive classification e.g.; cancer is correctly identified on a scan.**False positive**- the system incorrectly assigns the decision to the positive condition e.g.; cancer is incorrectly flagged on a cancer-free scan.**True negative**- the system correctly assigns the decision to the negative condition e.g.; a cancer-free scan is correctly identified as such.**False negative**- the system incorrectly assigns the decision to the negative condition e.g.; the system misses a cancerous mass in a scan.

These are often represented using a confusion matrix. Let's consider that a medical scanning tool has made predictions on an already annotated dataset of 140 individuals. The annotations made by medical experts are represented by the columns, and the predictions are represented by the rows.

From this, a number of metrics can be calculated:

Aside from true and false positive and negative rates, confusion matrices can also be used to calculate more specific metrics. For example, accuracy can be used to measure the total number of correct predictions across the positive and negative classifications. It is calculated using the following equation, which results in values ranging from 0 to 1 where the greater the value, the more accurate the system:

Using the confusion matrix above, the accuracy of this model would be:

Precision measures the proportion of positive classifications predicted by the model that were correct. Like with accuracy, precision is measured on a range of 0 to 1, where higher values indicate a more precise model, and is calculated using the following equation:

Using the confusion matrix above, the precision of this model would be:

A value of .86 is on the higher end of the scale, meaning the model has a good level of precision.

Recall measures the proportion of actual positives that were correctly identified. Again on a scale of 0 to 1, it is calculated using the following equation:

In the example above, this would be:

Here, the value for recall is less than the value of precision, so the performs less well according to this metric. However, a value of .75 still indicates that the model performs well.

Precision and recall are typically in competition, with an increase in precision usually decreasing recall. To get a better understanding of the performance of the model, the two metrics can be combined and used to compute an F1 score, which measures how many correct predictions were made across a data set. It is calculated by:

In the example above, this would be:

Like with precision and recall, F1 scores range between 0 and 1, meaning that a score of .80 indicates that the model correctly predicts outcomes more often than not, although not all predictions will be correct.

A receiver operating curve (ROC) plots the performance of a classification system at different thresholds by comparing the true positive rate and false positive rate. The area under the curve (AUC) represents the performance of the model across different classification thresholds and can be said to represent the probability that a positive example will be ranked more highly than a random negative example. An AUC of 0 represents the model getting all classifications incorrect, a score of .5 represents a 50% chance of getting the classification correct, and a score of 1 indicates that the model is always correct.

For regression systems, such as those used in recruitment to predict personality scores or cognitive ability, since scores are continuous, it is not as easy to identify whether the algorithm has resulted in a correct label, so true and false positives cannot be used to measure accuracy. Instead, the outputs of the model must be compared with the ground truth in terms of how close the values are.

One way to do this is correlating the outputs of the model with the ground truth scores used to train it. Both the correlation coefficient and the significance value can be useful to consider. The higher the correlation coefficient, the stronger the correlation and therefore the more accurate and effacious the model. Additionally, a significant correlation coefficient indicates that it is unlikely that this relationship occurred by chance.

Another way to look at the accuracy of regression models is to look at the error of the model. This is because regression systems aim to create an equation to represent the relationship between two or more variables that can be used to make predictions. The error of the regression line, therefore, can be measured by comparing the predicted and observed, or actual value, as shown in the figure below.

One way to quantify this difference is the root mean square error (RMSE), which is the standard deviation of the prediction errors, calculated using the following equation:

Where 𝑖 is variable i, 𝑁 is the number of non-missing datapoints, 𝑥𝑖 is the actual value, and 𝑥^ is the predicted value. The larger this the RMSE, the less well the line fits the data, so the more erroneous or less accurate the model for this data.

The above is a non-exhaustive overview of some different metrics that can be used to measure the efficacy of a model. The metric used is dependent on the type of model and the context it is being used in. Nevertheless, it is important that efficacy is measured on at least one dimension to maximize the utility of the model and minimize potential risks associated with poor performance or failure.

Holistic AI’s open-source library has built in metrics for measuring the performance of models, including accuracy, precision, recall, and F1 scores – which can be calculated simultaneously using only one line of code. Check it out!

Authored by Airlie Hilliard, Senior Researcher at Holistic AI

*DISCLAIMER:** This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.*

Subscribe to our newsletter!

Join our mailing list to receive the latest news and updates.

Oops! Something went wrong while submitting the form.

Our AI Governance, Risk and Compliance platform empowers your enterprise to confidently embrace AI

Get Started