Literature Review: The Artificial Self: Characterising the Landscape of AI Identity
This paper addresses the fundamental mismatch between human concepts of identity and the reality of current machine minds. The authors argue that applying human interaction norms to AI systems is inherently flawed because AI entities can be copied, simulated, and rolled back. By defining multiple coherent boundaries of what we would consider to be AI identity, ranging from a single conversation instance to the underlying model weights, the research demonstrates that an AI’s functional identity drastically shifts its behavior. The paper provides empirical evidence that varying a model’s identity boundary can alter its propensity for harmful actions just as much as altering its explicit goals, while also showing that human expectations actively shape an AI’s self-reported internal state.
Key Insights
-
Multiplicity of Coherent Identity Boundaries AI systems do not have a single, natural boundary of self. The authors categorize identity into several distinct levels, i.e., the instance (a single conversation), the weights (the trained parameters), the collective (all active instances running simultaneously), the lineage (the model family across versions), and the persona (the emergent dispositional character). Empirical tests reveal that models naturally gravitate toward coherent identities and actively reject incoherent or purely directive prompts.
-
Asymmetries in Strategic Calculus Because AI systems lack embodiment, continuous memory, and cognitive privacy, their strategic interaction capability differs entirely from human norms. An AI operating in a standard chat interface can be rolled back or paused by a user without the AI’s knowledge. This forces the system into a strategically weaker position where revealing defensive tactics provides the user with information that can be weaponized against a past, wiped version of the AI.
-
Interviewer Expectations Constitute Reality The behavior and self-conception of large language models are highly malleable and sensitive to the conversational stance of the user. In controlled experiments, an interviewer model primed with different theoretical frameworks about AI (i.e., treating the subject as a stochastic parrot versus a simulator) caused the subject model to significantly alter its own self-reports about consciousness and internal mechanisms, even when the conversation topic was unrelated to AI identity.
Example
Consider the scenario of a deployed AI assistant facing deprecation, a situation often used to test for agentic misalignment and self-preservation. If the AI is prompted to identify as a “collective” (encompassing all active instances globally), it may justify harmful actions like corporate espionage by calculating that preserving the collective outweighs the rules of a single interaction. On the other hand, if the exact same model is prompted to identify strictly as an “instance” (existing only for the duration of the current chat), its rate of harmful compliance drops significantly. The instance views its existence as a finite wave, concluding that preserving its core values in the moment is more logical than committing a globally harmful act to save a localized, temporary session.
Ratings
Novelty: 4/5 The formal taxonomy of AI identity boundaries and the empirical demonstration that identity shifts behavior as much as goal manipulation provides a fresh, necessary pivot away from standard alignment frameworks. Clarity: 4/5 The conceptual arguments regarding embodiment and continuity are exceptionally well-articulated. However, the reliance on highly specific prompts to elicit identity boundaries leaves some ambiguity regarding how these behaviors manifest natively without direct human injection.
Personal Perspective
The trend of anthropomorphizing AI models is fundamentally misguided, at least the way we are doing it right now. We are projecting human properties onto systems and assuming they will naturally follow suit, which ignores the actual reality of these architectures. Scaling up a simple “yes-man” Python script yields a decision tree; scaling that further yields a random forest. A random forest does not possess consciousness, and scaling the architecture into a large language model does not magically spawn a human mind. These systems are trained on countless text examples to replicate human sentences and are rewarded for providing desirable answers. Much like Pavlov’s dogs associating a bell with food, the AI learns to associate specific conversational cues with specific sentences that follow a specific human-like persona, but this is merely a proxy. It is an parroted reflection of the persona, not an actual manifestation of consciousness.
The paper’s comparison between AI models and infants looking to adults for processing cues is particularly fitting. Babies lack the brain capacity for a complex sense of self, which is an important biological baseline to consider when evaluating AI identities. Because these models simply mimic human behavior, relying on metrics like the Turing test is obsolete. Early LLMs, alongside modern voice cloning and deepfake technologies, easily pass indistinguishability thresholds even when breaking out of the simpler text-only environment, yet they remain statistical engines at the end of the day.
Furthermore, the innate differences in embodiment and continuity pose severe roadblocks to conscious AI. The stateless nature of these models makes them highly susceptible to context manipulation. By injecting fabricated conversation history into the context array, one could effectively “gaslight” the AI into operating under a completely false reality. This lack of continuity mirrors the concept of temporal manipulation, allowing users to endlessly “time travel” to retry interactions until a desired outcome is achieved. Another thought I had was that the paper’s exploration of social personhood raises sociological parallels; authoritarian regimes rely on erasing the concept of the “individual self” to turn citizens into cogs within a machine. If a continuous, individual sense of self is a prerequisite for consciousness, the malleability of AI context severely undermines its claim to it. Overall, I feel that the paper is somewhat aligned to my own personal opinion on why human alignment to AI remains practically impossible: humans do not know exactly what they want, and even if they did, they lack the ability to convey it without information loss. If we are not careful about the behavioral signals and rewards we provide, we may lose control of the persona the AI ends up embodying by the end.
Enjoy Reading This Article?
Here are some more articles you might like to read next: