Operationalising Safety in Generative AI: Model Evaluations and Algorithm Audits

February 23, 2024
Authored by
Siddhant Chatterjee
Public Policy Strategist at Holistic AI
Operationalising Safety in Generative AI: Model Evaluations and Algorithm Audits

The explosion of generative AI models and their widespread applications has highlighted a vital need for organizations developing and deploying the highly complex models to ensure their integrity, safety, security, and reliability. In general, two crucial processes contribute to this assurance: model evaluations and algorithm audits. While both aim to assess and enhance the trustworthiness of AI systems, they operate distinctively and each serves a unique purpose in the journey toward responsible AI deployment.

This blog post provides an overview of model evaluations and algorithm audits, and how they should be jointly leveraged to ensure the responsible, safe, and ethical deployment of powerful generative models.

Key Takeaways:

  1. Model evaluations are comprehensive assessments aiming to measure the effectiveness of algorithmic systems across various parameters such as baseline performance levels, model modalities, and risks like bias, mis/disinformation, and toxicity. These include interventions like benchmarking (which involves comparing a model’s performance against a predefined set of tasks or requestions, and adversarial training, which involves stress-testing a model to unearth unknown risks and vulnerabilities.
  1. While Model Evaluations are useful and deployed widely, they lack external validity and may unintentionally ignore implicit biases during model development. This establishes the need to complement evaluations with independent algorithm audits.
  1. Algorithm audits are third-party assessments by impartial auditors, assessing reliability, identifying risk, analyzing mitigation strategies, and ensuring regulatory compliance. They are holistic, consisting of Governance Audits (that examine organizational procedures in model development) and Outcome Audits, that help evaluate risk detection and mitigation capabilities across robustness, efficacy, bias, explainability, and privacy-related parameters.
  1. Model evaluations and algorithm audits are two sides of the same coin and should be jointly leveraged to build the evidence base of a model’s safety and risk mitigation capabilities. They help identify risks, establish red lines to prohibit the development of dangerous model capabilities, and ensure protective measures are in place for safe and ethical AI deployment.

What are Model Evaluations?

Model evaluations (or evals) are essentially a set of socio-technical interventions that investigate, assess, and determine the effectiveness of an algorithmic system's broader capabilities. This covers a diverse set of activities, including evaluating a model’s efficacy across baseline performance levels (assessing their reading comprehension, common-sense reasoning, and mathematical capabilities), model modalities (text, image, audio, video, and emerging multimodal configurations), and model risks (such as cybersecurity, mis/disinformation, stereotyping and biased outputs, and toxicity in generated text), among others.

The most popular types of interventions used in model evaluation exercises are Benchmarking and Adversarial Testing (or Red-Teaming).

Benchmarking LLMs

Benchmarking, a success-testing intervention, helps assess an AI system's performance against a predefined task by mapping an AI system’s outputs to a dataset of prompts and responses. An examination or test of sorts, these can help gauge a model’s capabilities across general performance considerations, such as common-sense reasoning (eg: HellaSwag) and math capabilities (MATH), in mitigating risks like bias (e.g., WinoBias and BOLD) and model-generated risks like toxicity (e.g., ToxiGen) across text, audio, image and other multimodal interfaces. Benchmarking can be integrated into a model's development process to continuously enhance its efficacy through iterative improvements, a process known as hill-climbing. Additionally, benchmarking aids in comparing a model's efficacy with that of competing models.

While a potent tool for measuring and improving a model’s baseline levels of efficacy, benchmarking is only useful in cases where the model risks are known. However, this is often not the case - new and unknown model risks can emerge frequently. Consequently, the current state of benchmarking techniques are not adequate to detect future vectors of harm, and large generative models sometimes show a tendency to conceal risky and deceptive behaviors. In such cases, adversarial testing, or red teaming emerges as an effective intervention.

Red teaming LLMS

Drawing from cybersecurity risk management principles, adversarial testing or red teaming helps uncover vulnerabilities and latent risks inherent in deploying a generative model. Here, the model is subjected to various stress-testing techniques, including curated prompt attacks, training data extraction, backdooring models, adversarial prompting, data poisoning, and exfiltration mechanisms, through which systemic vulnerabilities can be unearthed.

A socio-technical measure, red-teaming can (and should) be participatory exercises leveraging the collective intelligence of experts across domains. For example, adversarial testing  for mis/disinformation can involve convening practitioners from the medical and public health domain, journalism, media studies, sociology, behavioral science, computer science, machine learning, safety policy, and hate-speech, among others. This not only fosters inclusivity but also enriches the evaluation process by considering a broader spectrum of potential risks and vulnerabilities.

The need to audit generative models

While useful interventions to assess and monitor the safety and efficacy levels of a system, solely relying on model evaluations may lead to instances wherein unknown biases remain ignored, external oversight mechanisms are absent, and appropriate levels of transparency, explainability, and accountability are not provided. Furthermore, when a generative model is fine-tuned and appropriated for a dedicated downstream use-case, its risk profile must be investigated in the context that it is operating in through third-party interventions. Considering the profound societal and ethical impacts generative models can be associated with, it becomes imperative to complement model evaluations with independent algorithm audits.

What are algorithm audits?

Algorithm Audits are socio-technical mechanisms of independent and impartial system evaluation, whereby an auditor with no conflict of interest, can assess an AI system’s reliability, detect unidentified errors, discrepancies, system deficits, and vulnerabilities, and offer recommendations on the same. In addition to providing a defensible and robust mechanism to verify a model’s safety, algorithm audits also differ from model evaluations, in that they serve as potent signals of regulatory compliance.

How can generative models be audited?

Algorithm audits for generative models generally encompass two critical dimensions: Governance Audits and Outcome Audits. Governance Audits scrutinize the organizational procedures, accountability structures, oversight mechanisms, and quality management systems of AI developers, ensuring that robust processes are in place to govern AI development and deployment. On the other hand, Outcome Audits evaluate the effectiveness, robustness, biases, explainability, and privacy implications of the results produced by AI systems. Essentially, algorithm audits involve thoroughly examining not only the code, but also the processes and systems set in place during a generative model’s development.

As generative models are also fine-tuned for specific use-cases, a third type of audit has surfaced – research has indicated the need to institutionalize application audits, which involve a down-streamed model being continuously monitored and audited for context-specific risks and harms.

Leveraging Model Evaluations and Algorithm Audits for Responsible, Safe and Ethical AI Usage

Model evaluations and Algorithm Audits are different sides of the same coin; they should be leveraged simultaneously to build and establish a robust evidence base to validate a model's safety and risk management capabilities. As this evidence base expands, so should the model's interaction with the external world. This progression is gradual and iterative, requiring model audits to be implemented at different stages of development, and scaling up access for external researchers to interrogate the model through interventions like adversarial testing. Additionally, model evaluations and audits can identify red lines and dangerous zones where development should halt until adequate protective measures are enhanced and implemented for production.

Auditing Large Language Models with Holistic AI

Concerted regulatory momentum to legislate generative models, and specifically Large Language Models (LLMs) is accelerating across the world – and companies desirous of developing and deploying such models must proactively ensure they fulfill the increasing list of obligations.

Holistic AI takes a comprehensive, interdisciplinary approach to responsible AI. We combine technical expertise with ethical analyses to assess systems from multiple angles.

Safeguard, Holistic AI's LLM Auditing product, acts as a robust solution to identify and address these issues through a multifaceted approach:

  • Blocking Serious Risks and Safety Issues: Can be used to prevent the inadvertent leakage of personal information, enabling organizations to leverage their full data set while ensuring data privacy and protecting their brand reputation.
  • Detecting Hallucinations and Stereotypes: It detects and rectifies incorrect responses, hallucinations, and the perpetuation of stereotypes in generated text.
  • Preventing Offensive Language and Toxicity: It proactively averts the use of offensive language, counters malicious prompts, and minimizes toxicity in generated text.
  • Providing Readability Scores: It assesses and offers readability scores for generated text, ensuring that the output is both comprehensible and suitable for its intended audience.

To find out more about how Holistic AI can help you, schedule a call with our expert team.

DISCLAIMER: This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.

Subscriber to our Newsletter
Join our mailing list to receive the latest news and updates.
We’re committed to your privacy. Holistic AI uses this information to contact you about relevant information, news, and services. You may unsubscribe at anytime. Privacy Policy.

Discover how we can help your company

Schedule a call with one of our experts

Schedule a call