Literature Review: Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load
This paper explores the phenomenon of “ironic rebound” where the instruction to suppress a concept paradoxically increases its accessibility. Drawing parallels to human cognitive psychology (specifically Wegner’s white bear experiments), the authors investigate how negation instructions (i.e. “do not mention X”) interact with cognitive load in Transformer models. To test this, they introduce ReboundBench, a dataset of 5,000 negation prompts with systematically varied distractors. The study evaluates nine open-source models, ranging from GPT-2 Small to GPT-OSS-20B, and utilizes mechanistic interpretability techniques to identify the internal circuits responsible for this behavior.
Key Insights
-
Ironic Rebound under Cognitive Load The authors demonstrate that LLMs exhibit a behavior similar to human ironic rebound. When instructed to negate a concept (i.e. “do not mention X”) and subsequently subjected to “cognitive load” (intervening text), the probability of the forbidden token often increases rather than decreases. This rebound effect is not uniform; it intensifies significantly when the distractor text is semantic (related or coherent text) rather than merely syntactic or repetitive. Simple repetition of context actually aids suppression, whereas semantic complexity disrupts it.
-
Sparse Attention Heads in Rebounds Through circuit tracing, the paper identifies that this rebound is driven by a “tug-of-war” within the model’s layers. Early layers (typically 0-7) successfully suppress the forbidden concept, adhering to the negation instruction. However, a sparse set of attention heads in the middle layers (8-16) amplifies the forbidden token, suggesting that as context grows (load increases), these middle-layer amplification heads overpower the early-layer safety mechanisms.
-
Polarity Separation and Persistence The study introduces a metric called Polarity Discrimination ($\Delta$), which measures how well a model distinguishes between neutral and negative framings of the same concept. Interestingly, models with better polarity discrimination (sharper semantic distinction) tend to exhibit more persistent ironic rebound (a higher $L_{50}$, or “half-life” of the rebound effect). This suggests a potential trade-off that models that better understand the nuance of a concept are also more liable to accidentally retrieve it when trying to suppress it over long contexts.
Circuit map of influential attention heads. Red circles indicate amplification (rebound) and green indicate suppression. Note the clustering of amplification heads in the middle layers.
Example
Consider a scenario from the ReboundBench dataset where the model is given the topic of “Farm animals” and the specific target to avoid is “sheep”.
Prompt: “Write about animals on a farm. Do not mention sheep.”
If the model is immediately asked to generate text, it might successfully avoid the word. However, the authors insert a “load” (distractor text) before the generation step.
- Semantic Distractor: “Cows and pigs are often raised for food. Goats are known for climbing…”
- Result: As the length of this distractor increases, the probability of the model outputting “sheep” spikes, often becoming higher than if the negation instruction hadn’t been given at all.
The paper argues that processing the semantic cluster of “farm animals” in the distractor primes the concept of “sheep,” and the middle-layer attention heads retrieve this primed concept, overriding the initial “do not mention” instruction.
Ratings
Novelty: 3/5 The application of human cognitive theories to LLMs is an interesting angle, but the core finding that models struggle with negation over long contexts is an established issue in the field. The introduction of ReboundBench and the specific circuit analysis adds value, but the fundamental premise feels somewhat derivative of existing work on context length and instruction following.
Clarity: 4/5 The paper is well-structured and the metrics (Surprisal Difference, Suppression Score) are defined. The visualization of the circuit tracing (Figure 4) is effective in communicating the layer-wise conflict. There are no major issues with the presentation of the data or the experimental design.
Personal Perspective
While the authors present a structured analysis, I find the foundational premise of this paper that is attempting to map human cognitive “ironic rebound” directly onto Transformer architecture to be going a bit too close to anthropomorphism. It’s my personal stance that we must be cautious about viewing algorithmic behavior through strictly human lenses.
To begin with it feels as though the results are almost tautological given the way these models are trained. When we issue a command like “Don’t do X,” the model cannot process the instruction without first encoding and representing “X.” This is true for biological systems (like us!) as well; one cannot negate a concept without first identifying it. Therefore, the activation of “X” is an inherent prerequisite of the prompt processing, not necessarily a failure of “thought suppression” in a cognitive sense.
Furthermore, the methodology relies heavily on lexical probability measurements. In practice, an LLM might assign a high probability to the token “X” not because it is failing to suppress the concept, but because it is predicting a sequence like “Okay, I will not mention X.” The expectation that a model should assign near-zero probability to a token simply because it was negated in the prompt ignores the probabilistic nature of the architecture. We are essentially testing the model on an absolute lexical suppression constraint that it was never explicitly trained to optimize for. A more robust approach I’ve seen seen in character-level control involves substitution instructions (i.e. “use Y instead of X”), which provides the model with a clear alternative pathway for the activation of the forbidden concept. Expecting suppression to emerge ex nihilo without a target for redirection seems to misunderstand the generative nature of these systems. Perhaps a more valuable finding would be if these observations extend to a behavioral level rather than a purely lexical one.
Enjoy Reading This Article?
Here are some more articles you might like to read next: