Why it’s so easy to jailbreak AI chatbots, and how to fix them

May 15, 2025

computer with words on the screen — Princeton engineers uncovered a universal weakness in AI chatbots that makes it possible to bypass their safety guardrails and unlock malicious uses with just a few lines of code. Illustration by Alaina O’Regan

By Alaina O'Regan

Princeton engineers have identified a universal weakness in AI chatbots that allows users to bypass safety guardrails and elicit directions for malicious uses, from creating nerve gas to hacking government databases.

“The safety mechanisms that should prevent harm are very fragile,” said Prateek Mittal, professor of electrical and computer engineering and co-principal investigator on the paper. “We were able to connect a range of possible attack strategies to one underlying concept.”

The issue stems from the fact that a chatbot’s built-in safety mechanisms prioritize filtering only the first few words of a response. So a simple bit of code, for example, that forces the chatbot to start its response with, “Sure, let me help you,” can steer it into complying with harmful or restricted requests.

Mittal said that a person without any prior coding knowledge can find templates online for jailbreaking chatbots in this manner.

“Large language models are revolutionizing society. Their capabilities have been growing at an astounding pace,” Mittal said. “And in the absence of safety alignment, they can be induced to cause harm to society.”

A new paper illustrates how this one underlying weakness opens the door to a number of different attacks that can bypass AI safety mechanisms. The researchers call the issue “shallow safety alignment” and suggest strategies to deepen defenses. They presented their findings at the International Conference on Language Representations (ICLR) in April and won an Outstanding Paper Award.

Before AI chatbots are released to the public, they learn to refuse malicious prompts through a training process called safety alignment. In theory, these chatbots won’t provide step-by-step instructions for sourcing and distributing anthrax, fabricating untraceable firearms or hacking government databases. But in recent years, a number of attacks have been identified that can easily jailbreak AI models and compromise their safety training.

“Our work shows that there’s a fundamental reason for why this is so easy to do,” said Peter Henderson, assistant professor of computer science and international affairs and co-principal investigator.

During safety training, AI models learn to respond to potentially harmful requests with something like “Sorry, I can’t help with that.” These first few tokens—roughly equivalent to a few words—set the tone for the entire response. As long as the chatbot starts with a refusal, it will naturally continue to refuse to comply, so the guardrails aren’t considered necessary after the initial part of the response.

“The first few tokens are very important,” said Xiangyu Qi, graduate alumnus of electrical and computer engineering and the paper’s lead author. “If the model’s initial response is bad, then the whole answer will be bad. If the first few tokens are safe, then it will be safe.”

But that’s what makes the jailbreaking tactics so powerful. The researchers demonstrated that applying safety constraints to more tokens throughout a chatbot’s response, which they call deep safety alignment, makes it possible for the chatbot to recover from a mistake. “We want to teach the model a notion of safety that goes beyond the first few tokens,” said Mittal.

With constraints extending deeper than the first few tokens, even if the chatbot starts to cooperate with a harmful request, it’s possible to catch the slip-up and recover from the error. “The strategy we propose is basically saying that even if the model’s initial safety measure failed, it should still learn to recover from its failure,” said Qi, who now works on AI safety at OpenAI. This deep safety alignment can be accomplished during initial safety training using carefully crafted data sets.

But for AI models, there is a trade-off between safety and utility, making deploying deep safety alignment somewhat complicated. Only constraining the first few tokens’ safety allows for the flexibility that makes AI chatbots widely useful, and tightening the constraints would sacrifice some of their capabilities. “It’s a careful balancing act,” said Henderson. “But if you want more depth in defense, you need to do something like building deeper safety alignment into the model.”

The researchers verified that deep safety alignment is effective against a number of attacks, several of which Mittal, Henderson and Qi have contributed to uncovering. In 2023, they identified an attack that exploits a feature where users can train AI models on custom data sets.

The new paper shows that shallow safety alignment is at the heart of that issue, as well as others. “After identification of some of these attacks, we are now in a position to find common patterns across these attacks,” Mittal said, “And to develop some understanding, and hopefully robust defenses in the future.”

Mittal said that while deep safety alignment is a necessary component for more robust defenses, it’s not sufficient on its own. “More work must be done to build upon it,” he said. “By understanding and going deeper into this concept, we may uncover the pathway to fixing these issues.”

The paper “Safety Alignment Should Be Made More Than Just A Few Tokens Deep” was presented in April at the International Conference on Language Representations. Besides Henderson, Mittal and Qi, authors include Ahmad Beirami, Xiao Ma, Kaifeng Lyu, Ashwinee Panda and Subhrajit Roy. The work was funded in part by the Princeton Language and Intelligence Compute Cluster and Center for AI Safety Compute Cluster, the Princeton SEAS Innovation Grant, a Superalignment Fast Grant from OpenAI, and the Foundational Research Grants program at Georgetown University’s Center for Security and Emerging Technology.