Literature Review: Identifying Query-Relevant Neurons in Large Language Models for Long-Form Texts
This paper introduces Query-Relevant Neuron Cluster Attribution (QRNCA), a framework aimed at identifying query-relevant (QR) neurons in large, decoder-only language models, especially in the context of long-form (multi-choice) text generation. The work addresses gaps in the localization of internal model knowledge, empirically validating new datasets to cover both domain and language knowledge. The analysis reveals the existence of localized knowledge regions within LLMs and demonstrates potential downstream applications in knowledge editing and neuron-based prediction.
Key Insights
QRNCA extends prior attribution and knowledge localization techniques (i.e. Knowledge Attribution, Dai et al. 2022) to modern, decoder-only LLMs like Llama and Mistral using prompt engineering to transform arbitrary queries into constrained multi-choice QA tasks, allowing gradient-based attribution at the neuron level.
The framework calculates neuron attribution using adapted gradients for GLU-based FFNs, aggregates neuron clusters over query variants, introduces inverse cluster attribution (ICA) to downweight neurons common across queries, and prunes ‘common neurons’ (those associated with general or high-frequency tokens).
QRNCA reliably identifies neurons whose activation modulates specific knowledge predictions, surpassing activation-based and earlier knowledge attribution baselines in their probability change ratio (PCR) metric. Notably, the method reveals domain-specific QR neurons distributed mainly in mid-to-top transformer layers, while language-specific neurons are more diffusely distributed.
The heatmap analyses indicate domain knowledge is manifested in relatively concentrated neuron clusters, often in mid-layers, whereas language concepts are more dispersed. This aligns with prior findings on hierarchical abstraction in transformer models.
Example
Suppose a model is given a biology multi-choice question. QRNCA computes neuron attribution scores for each FFN neuron in response to the query, identifies clusters of highly attributed neurons, and prunes those commonly activated across diverse queries (i.e. common neurons like those signaling the letter “A” or frequent stop words). Boosting the activation of identified QR neurons measurably increases the probability of the correct answer, and suppressing them decreases it. By manipulating only these neurons, researchers can “edit” the model’s factual predictions in a targeted, query-specific manner.
Ratings
Novelty: 2/5
While QRNCA systematizes neuron attribution for decoder-only LLMs and introduces inverse cluster attribution, the technical core is an incremental adaptation of known approaches (gradient attribution, common neuron filtering, prompt engineering). The underlying methods are not fundamentally new; the main advancement is practical application to larger models and long-form tasks, plus new dataset construction and analysis.
Clarity: 3/5
The exposition suffers from ambiguity around the precise algorithmic contributions relative to prior work. The paper has references to previous methodology, which sometimes obscures what is truly novel; greater emphasis on comparative summary and clearer demarcation of new techniques vs. adaptation would be beneficial.
Personal Comments
QRNCA is a useful step in the ongoing march from “knowledge neuron” studies in compact models (BERT, GPT-2) towards the realities of modern, much larger and more complex LLMs. The triage of attribution-based neuron selection, inverse cluster attribution, and common neuron pruning is sensible and well-motivated by empirical limitations of existing methods when faced with the combinatorial sprawl of open-domain knowledge in LLMs.
Nevertheless, the work is not a paradigm shift. The central innovation of performing neuron attribution in decoder-only, long-form settings, while nontrivial in engineering, is not a conceptual leap beyond existing attribution and mediation frameworks. The study is clearer in its problems tackled than in the novelty of its solutions. There is some missed opportunity in not pushing further into questions of why some concepts (across domains or languages) are more or less localizable, or systematically analyzing which types of knowledge are easiest vs. hardest to pin down neuronally. These directions would make both interpretability and mechanistic understanding better.
Enjoy Reading This Article?
Here are some more articles you might like to read next: