Interpretability | Jehyeok Yeon

Nov 22, 2025	Literature Review: DAUNCE: Data Attribution through Uncertainty Estimation
Nov 22, 2025	Literature Review: Reasoning Models Don't Always Say What They Think
Oct 21, 2025	Literature Review: Fresh in Memory: Training-Order Recency is Linearly Encoded in Language Model Activations
Sep 30, 2025	Literature Review: Knowledge Awareness and Hallucinations in Language Models
Sep 25, 2025	Literature Review: Scaling Monosemanticity – Extracting Interpretable Features from Claude 3 Sonnet
Sep 05, 2025	Literature Review: Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
Aug 16, 2025	Literature Review: The Hidden Dimensions of LLM Alignment
Aug 16, 2025	Literature Review: Refusal Behavior in Large Language Models: A Nonlinear Perspective
Aug 09, 2025	Literature Review: Cross-Modal Safety Mechanism Transfer in LVLMs (TGA)
Aug 03, 2025	Literature Review: Identifying Query-Relevant Neurons in Large Language Models for Long-Form Texts
Jul 19, 2025	Literature Review: Universal Jailbreak Suffixes Are Strong Attention Hijackers
Jul 13, 2025	Literature Review: SelfElicit - Your Language Model Secretly Knows Where is the Relevant Evidence
Jun 14, 2025	Literature Review: COSMIC: Generalized Refusal Direction Identification in LLM Activations
Jun 14, 2025	Literature Review: Layer-Gated Sparse Steering for Large Language Models
Jun 14, 2025	Literature Review: Auto-Patching: Enhancing Multi-Hop Reasoning in Language Models