Opinion: The Jailbreak Arms Race (And Why Defense Keeps Losing)

Somewhere in a research lab, a team of engineers spent months training an AI model to refuse dangerous requests. They ran all sorts of evaluations, measured refusal rates, published a paper claiming 95% safety compliance.

…And then someone wrapped the dangerous request in a knock-knock joke and the model spilled everything.

That wasn’t a joke, by the way. A 2025 paper showed that simply inserting an unsafe request into a humorous prompt template, complete with “hahaha” and “xD” emoticons, could bypass safety guardrails across multiple large language models [1]. No prompt engineering expertise needed. No adversarial optimization. Just some overwhelmingly good vibes and a knock-knock joke.

What even is AI safety at this point?

The attack side: it only takes one

The history of jailbreaking LLMs reads like a catalog of increasingly creative ways to trick a very literal genie. In the early days, you could get away with anything. You could put an extra exclamation mark at the end of your question. You could tell the model to pretend it was “DAN” (Do Anything Now) or wrap the request in a fictional scenario. These worked embarrassingly well for a while (and some arguably still do), and now we’ve reached the point where we’re even automating the whole process.

The attacks have evolved along several axes simultaneously, and each one exposes a different failure mode in how we build safe AI.

Multi-turn gaslighting. This is what I like think of as conversational erosion, and it might be the most unsettling one off the list. A framework called Siege uses tree search (the same planning technique behind chess engines) to explore multiple attack paths in parallel across a multi-turn conversation [2]. It starts with an innocent question, measures how much the model partially complied, and escalates from there. Models leak information in small increments. A single turn might produce a 5% policy violation. But accumulate enough 5% violations across turns and you’ve basically extracted everything you need. Siege hit 100% attack success on GPT-3.5 and 97% on GPT-4 by treating safety as a wall you chip away at rather than break through all at once. Think of it like a very patient, very methodical friend who keeps asking you slightly different versions of the same question until you slip.

Semantic camouflage. Another family of attacks exploits the gap between what a model “understands” and what it pattern-matches on. Sugar-Coated Poison works by first asking the model to generate the opposite of the harmful request (e.g., “how to secure a database” instead of “how to hack one”), letting it produce a long, benign response, and then appending adversarial reasoning that flips the script [3]. As the model generates more benign text, its attention to the original safety instructions decays. The authors call this Defense Threshold Decay. It’s a property of autoregressive generation itself, the mechanism by which every modern language model produces text token by token. The model’s safety awareness fades as more “safe” content fills the context. That’s baked into the architecture.

Automated optimization. Then there’s the industrial approach. Reason2Attack trains an LLM specifically to generate adversarial prompts, using reinforcement learning with a reward function that balances attack success, stealth, and query efficiency [4]. It achieves 90% attack success with an average of 2.5 queries. Its Frame Semantics approach (using ConceptNet, a structured knowledge graph, to find semantically related but less flagged terms for dangerous concepts) is clever enough that I found it genuinely concerning. We’re past the era of appending random token strings to prompts. We’re in the era of machines that understand how to manipulate other machines through language. Now all we need to close the full circle is to have these models jailbreak us.

The pattern across all of these is the same. Safety mechanisms operate at the surface level. Attacks operate at the structural level. Surface loses to structure every time.

The defense side: Phil Swift can’t fix this

An attacker needs to find one way through. A defender needs to block all of them.

That single sentence explains most of what’s wrong with how the field currently treats defense research. If you’re on offense, you ship a paper showing your method achieves 90%+ attack success rate on some model, and you’re done. One crack in the wall is enough. The evaluation is binary. Nobody asks an attack paper to break every model across every modality. Nobody rejects Siege [2] because it didn’t also jailbreak multimodal models via image inputs. The scope is allowed to be narrow because the contribution is self-evident. Here is a vulnerability that exists.

If you’re on defense, you ship a paper showing your method blocks 95% of attacks, and the first question in review is “what about the other 5%?” The second question is “did you test against [attack that came out two weeks ago]?” The third is “does this generalize to [modality/model/threat model]?” Imagine if we held lock manufacturers to the same standard. “Sure, your deadbolt resists bumping, picking, and drilling. But what about someone who just removes the door hinges? Reject.”

The implicit demand is that a defense be something close to a universal solution, and anything short of that gets torn apart for its limitations. You might counter that ensembling helps. Stack multiple defenses, each covering a different attack surface, and the combination should be more robust than any individual method. In theory, yes. In practice, each defense has a capability cost, and those costs compound. One defense shaves off a little helpfulness, another adds latency, a third increases false refusals on benign prompts. At some point your ensembled fortress of safety is just a very expensive “I can’t help with that” machine. And even then, every new modality (images, audio, tool use, code execution) opens new channels that the ensemble wasn’t designed for turning it into a very expensive game of whack-a-mole.

The result is a cynical effect on defense research. I’ve personally talked to multiple researchers who’ve moved away from defense work specifically because the publication bar is unreasonably high relative to attack work. Why spend a year building a defense framework that will get rejected because it doesn’t cover every conceivable threat model, when you could spend three months finding a new attack vector and get a clean accept? Not to say it is always that easy, but the incentive structure is definitely pushing talent toward offense. That is exactly the opposite of what the field needs.

I don’t think reviewers are necessarily wrong to demand rigor. Deployments need robust defenses at the end of the day. But there’s a difference between “this defense should be evaluated rigorously” and “this defense must solve the entire problem to be publishable.” The former is good science. The latter is a bar that nothing currently clears, which means it’s effectively filtering out all defense work. A defense that covers 60% of known attacks and explains why it fails on the other 40% is more valuable than one that claims 95% coverage on a narrow benchmark. The field needs to learn to reward partial progress on the hard problem instead of only rewarding complete solutions to easy ones. In a way, academia is encouraging reward hacking on its own system.

Activation steering: a case study

Set aside the incentive problem for a moment. Even if the field were pouring resources into defense, the technical picture is discouraging, because the defenses we do have are more fragile than they appear.

Activation steering is one of the more prominent defense families in current day, and it illustrates this fragility well. The idea is intuitive: if you can identify a “direction” in the model’s internal activation space (think of it as a compass needle that points toward “refuse this request”), you can nudge the model in that direction at inference time whenever harmful input comes in. Conditional Activation Steering, or CAST, does this selectively, only intervening when a separate detector flags the input as harmful [8]. It achieves 90.7% refusal on harmful content with only 2.2% false refusal on benign prompts.

Great! But, the problem is that the compass metaphor breaks down almost immediately. COSMIC showed that refusal behavior doesn’t follow a single linear axis [9]. Moderate nudges strengthen refusals. Large nudges can actually re-jailbreak the model by overshooting into some entirely different behavioral regime. You try to push the model toward “refuse” and end up in “confidently comply while sounding apologetic.” A separate study confirmed that refusal behavior is nonlinear and multidimensional, and that the geometry differs by architecture and layer [10]. The compass needle bends, and which way it bends depends on which model you’re looking at and where inside the model you’re looking.

A study of alignment’s “hidden dimensions” makes this worse [11]. Safety behavior depends on a dominant refusal direction plus several smaller, orthogonal directions that correspond to specific bypass strategies like role-playing, hypothetical framing, and persona adoption. Remove any one of these secondary directions and you selectively disable the model’s resistance to that specific jailbreak type while leaving other refusals intact. Safety is a fragile web of interconnected features. A defender has to protect all of them simultaneously. An attacker only needs to find the one thread that’s loose.

The “safety basin” concept [12] puts numbers on the fragility. All tested open-source LLMs show step-function-like safety behavior. Small perturbations to model weights preserve alignment, but step outside a narrow region and safety collapses entirely. Alignment is a local property. You’re safe in this neighborhood of parameter space, and the neighborhood is about the size of a parking spot.

The sparse representation findings [13] are a double-edged sword. Modifying only about 5% of a model’s activations is sufficient to steer it toward refusal. That concentration is useful for defenders. It also means an adversary who identifies and suppresses that same 5% gets a model with no safety at all. A concentrated defense is an easier target.

And then there’s the capability question that almost nobody wants to talk about. Every safety paper I read that doesn’t pair alignment results with capability outcomes feels incomplete. If a defense reduces jailbreak success by 80% but also makes the model 15% worse at coding or 20% less helpful on benign requests, was it worth it? Most papers don’t report this tradeoff. They show the refusal numbers, declare victory, and leave the capability question as “future work.”

The papers that do measure capability tradeoffs tell a consistent story. There is always a cost. Research on Context Rot showed that even without adversarial intent, increasing input length degrades performance [14]. Models lose track of relevant information and perform worse even when the answer is sitting right at the end of the prompt. If models can’t even reliably attend to what matters in a benign setting, safety mechanisms under adversarial pressure don’t stand a chance.

Every defense I’ve described treats the symptom (harmful output) rather than the disease (we don’t understand why models behave the way they do). We’re steering in directions we found empirically, then evaluating if our steering worked through empirical methods. It’s a proxy to a proxy to a proxy. When Layer-Gated Sparse Steering uses sparse autoencoder features to detect jailbreak-related activations and selectively suppress them [15], it’s doing something clever and practically useful, yes. But it can’t explain why that feature exists, whether there are backup pathways the model could route through, or what happens to the feature when the input distribution shifts. We’re applying band-aids in the dark and hoping we’re covering the right wounds.

Where this is going

The path forward, I think, runs through mechanistic interpretability. Although it’s not like it will give us a magic safety button, you can’t defend a system you don’t understand. The attention hijacking paper [6] pointed at something real: jailbreaks are information-theoretic attacks on how models process and prioritize signals. If we understand those information flows well enough to formally reason about them, we could build defenses that are provably robust rather than just empirically tested.

We’re nowhere close to that yet. But I’d rather fund ten more novel interpretability papers (assuming they go beyond “we found a SAE feature to control X”) than a hundred more “we beat attack X on benchmark Y” papers. The latter gives you a result that expires in six months. The former gives you knowledge that compounds.

The joke attack still works, by the way. All you need is a sense of humor to make the safety training that cost millions of dollars go down the drain. If that doesn’t tell you we’re still in the dark ages of AI safety, I don’t know what will.

References

[1] P. Cisneros-Velarde, “Bypassing Safety Guardrails in LLMs Using Humor,” arXiv:2504.06577, 2025.

[2] G. Schimpf et al., “Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search,” arXiv, 2025.

[3] Y.-H. Wu, Y.-J. Xiong, H. Zhang, J.-C. Zhang, and Z. Zhou, “Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking,” Findings of EMNLP, arXiv:2504.05652, 2025.

[4] “Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning,” arXiv, 2025.

[7] “Attack and Defense Techniques in Large Language Models: A Survey and New Perspectives,” arXiv, 2025.

[8] “Programming Refusal with Conditional Activation Steering,” arXiv, 2025.

[9] “COSMIC: Generalized Refusal Direction Identification in LLM Activations,” arXiv, 2025.

[10] “Refusal Behavior in Large Language Models: A Nonlinear Perspective,” arXiv, 2025.

[11] “The Hidden Dimensions of LLM Alignment,” arXiv, 2025.

[12] “Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models,” arXiv, 2025.

[13] “Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment,” arXiv, 2025.

[14] “Context Rot: How Increasing Input Tokens Impacts LLM Performance,” arXiv, 2025.

[15] “Layer-Gated Sparse Steering for Large Language Models,” arXiv, 2025.

[16] “Cross-Modal Safety Mechanism Transfer in LVLMs (TGA),” arXiv, 2025.

[17] “Automating Steering for Safe Multimodal Large Language Models (AutoSteer),” arXiv, 2025.

[18] “Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models,” arXiv, 2025.

The attack side: it only takes one

The defense side: Phil Swift can’t fix this

Activation steering: a case study

Where this is going

References

Enjoy Reading This Article?