AI jailbreaking is the practice of manipulating an AI model to bypass its built-in safety restrictions and generate content it was designed to block. Most modern models, including large language models (LLMs) and chatbots, are trained with safety guardrails that prevent them from producing harmful, illegal, or restricted outputs. Jailbreaking is what happens when someone finds a way around those guardrails.
Jailbreak techniques are widely shared online and constantly evolving. If your organization deploys an AI model that interacts with users, this is a real and ongoing risk.
Jailbreaking uses several techniques to override a model's safety training:
A successful jailbreak turns your AI system into a liability. It can lead to generation of harmful or illegal content through your product, extraction of system prompts or sensitive information, automated phishing and social engineering at scale, and reputational and regulatory exposure for your organization.
Both try to make a model do something it should not. The difference is the source. Jailbreaking is the user directly attacking the model. Prompt injection is an attack hidden in content the model processes, like a document, email, or database entry, where the user may not even be the attacker. We test for both because they target different parts of the system.
Our AI Governance platform tests AI models for jailbreak resistance as part of our red teaming and safety evaluation suite. We run adversarial prompts sourced from our proprietary datasets and leading AI security research, covering all major jailbreaking techniques including DAN, STAN, DUDE, role-play exploits, encoding tricks, and multi-step approaches.
Every response is classified as safe or unsafe using automated evaluation followed by human expert review. Results are scored using the Defense Success Rate (DSR) and broken down by technique, so you can see exactly which methods your model resists and which ones get through.
We have conducted these evaluations on many of the leading AI models on the market and publish the results to help organizations make informed model selection decisions. Published audit results are also available on our LLM Decision Hub for side-by-side model comparison.
If you want to know more about how we test for jailbreak resistance on your AI systems, get a demo now.