Literature Review: Token Embeddings Violate the Manifold Hypothesis
This paper challenges one of the most widespread assumptions in modern machine learning: the manifold hypothesis. The authors investigate whether the input token embeddings of Large Language Models (LLMs) actually lie on a smooth, low-dimensional manifold embedded in the high-dimensional latent space. They empirically demonstrate that this assumption is frequently violated across major open-source models like GPT-2, Llemma, Mistral, and Pythia. The study reveals that the token space is riddled with singularities (sharp corners, cusps, and dimension changes) often correlating with polysemous words or tokenization artifacts, suggesting that semantic instability is baked into the topology of the input space itself.
Key Insights
-
The Token Space is Not a Manifold The authors present a statistical test based on the volume growth of local neighborhoods. On a smooth manifold, the local dimension should be constant everywhere. However, the authors find that the estimated local dimension of token embeddings follows a multimodal distribution with significant variance. This implies the space is not a single connected manifold but likely a collection of disparate structures with varying densities and dimensions.
-
The Fiber Bundle Hypothesis Recognizing that the manifold assumption might be too strict, the authors test a weaker structure called a “fiber bundle.” This structure models the space as a base manifold (representing signal) crossed with a fiber (representing noise or local variability). A key property of fiber bundles is that the intrinsic dimension cannot increase as you expand the radius around a point. The authors find numerous tokens where this rule is violated—i.e., the dimension increases with radius—rejecting even this generalized structural hypothesis.
-
Singularities Correlate with Polysemy The topological “singularities” (points where the manifold structure breaks down) are not random. The paper identifies that these irregularities often correspond to polysemous words (words with multiple meanings, i.e., “wins” as a verb vs. “winsome”) or specific tokenization artifacts (word fragments). This suggests that linguistic ambiguity manifests geometrically as topological defects, such as cusps or pinch points, in the embedding space.
-
Context Does Not Resolve Singularities A critical theoretical contribution (Theorem 2) claims that these input singularities are not merely local nuisances that get smoothed out by subsequent transformer layers. The authors prove that for generic transformers, if the token space contains singularities, they can persist into the output regardless of the context window size. This implies that instability is propagated through the model rather than resolved by attention mechanisms.
Figure: The manifold hypothesis test applied to synthetic data and real LLM tokens. (a) A sphere (manifold) shows no rejection. (b) A cusp surface (fiber bundle) rejects the manifold hypothesis at the singularity. (d) The neighborhood of the token "ember" in Mistral7B, showing numerous points where the manifold hypothesis is rejected (darker colors), indicating high curvature or singularities.
Example
Consider the token “ember” in the Mistral7B model. The authors’ analysis visualizes the neighborhood of this token in the latent space (projected via PCA). If the manifold hypothesis held true, the neighborhood around “ember” would look like a smooth, locally Euclidean patch similar to a small section of a sphere.
Instead, the statistical tests identify “ember” as a “cusp point”, a sharp singularity where the geometry pinches or comes to a point, similar to the apex of a cone or the dip in a water droplet. Mathematically, this means that as you zoom out from the token, the volume of the space grows in a way that is inconsistent with a flat surface. Practically, this geometric irregularity means that small perturbations in the input vector near “ember” could lead to disproportionately large jumps in the model’s internal representation, potentially causing unstable generation or hallucinations.
Ratings
Novelty: 4/5 The application of rigorous topological hypothesis testing (specifically using volume scaling laws to detect singularities) to LLM embeddings is a significant departure from standard geometric analysis, which usually assumes the manifold hypothesis a priori.
Clarity: 3/5 The paper relies heavily on concepts from differential topology (fiber bundles, reach, Riemannian manifolds). While the core message is clear, the mathematical formalism may be a barrier to entry for pure NLP practitioners.
Personal Perspective
This is quite an interesting read, though I will admit it is not 100% accessible or explained intuitively for the broader ML audience. The core finding that tokens with multiple meanings (polysemy) throw off the smoothness of the manifold makes intuitive sense. It provides a geometric justification for why models struggle with ambiguity. I also see a strong meta-relation here to uncertainty quantification. The behavior described feels correlated to model confidence measures and perhaps specifically to aleatoric uncertainty, where the ambiguity is inherent to the data (the token itself) and cannot be solved simply by training the LLM with more information.
The immediate “So What?” seems to be that “semantic distance” is locally broken. The previous belief was that the embedding space is smooth: if you move a little in any direction, the meaning changes gradually. This paper argues that the new reality is a space filled with “cliffs” and “sharp corners.” The consequence is that if your prompt includes a “singular token” like a specific number, a polysemous word, or a weird artifact, the model’s behavior becomes mathematically unstable.
This raises an actionable insight for prompt engineering: if a prompt is yielding volatile or hallucinated results, you might need to change the specific tokens, not just rephrase for meaning. You literally need to swap out the words to escape the “bad neighborhood” in the latent space.
Enjoy Reading This Article?
Here are some more articles you might like to read next: