Trustworthy AI

an archive of posts with this tag

Sep 30, 2025 Literature Review: Knowledge Awareness and Hallucinations in Language Models
Sep 30, 2025 Literature Review: Agentic Misalignment – How LLMs Could Be Insider Threats
Sep 25, 2025 Literature Review: One Token to Fool LLM-as-a-Judge
Sep 25, 2025 Literature Review: Scaling Monosemanticity – Extracting Interpretable Features from Claude 3 Sonnet
Sep 05, 2025 Literature Review: Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base
Sep 05, 2025 Literature Review: Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
Aug 16, 2025 Literature Review: The Hidden Dimensions of LLM Alignment
Aug 16, 2025 Literature Review: Jailbreak Antidote – Runtime Safety-Utility Balance via Sparse Representation Adjustment
Aug 16, 2025 Literature Review: Refusal Behavior in Large Language Models: A Nonlinear Perspective
Aug 03, 2025 Literature Review: Manifold Regularization for Locally Stable Deep Neural Networks
Jul 19, 2025 Literature Review: Universal Jailbreak Suffixes Are Strong Attention Hijackers
Jul 05, 2025 Literature Review: Teaching Language Models to Self-Improve by Learning from Language Feedback
Jul 05, 2025 Literature Review: LLMs Unlock New Paths to Monetizing Exploits
Jun 25, 2025 Literature Review: RedCode: Risky Code Execution and Generation Benchmark for Code Agents
Jun 21, 2025 Literature Review: A Practical Memory Injection Attack against LLM Agents
Jun 14, 2025 Literature Review: COSMIC: Generalized Refusal Direction Identification in LLM Activations
Jun 14, 2025 Literature Review: Layer-Gated Sparse Steering for Large Language Models
May 19, 2025 Literature Review: Attack and Defense Techniques in Large Language Models: A Survey and New Perspectives
May 19, 2025 Literature Review: Large Language Models are Autonomous Cyber Defenders
May 12, 2025 Literature Review: Toward a Human-Centered Evaluation Framework for Trustworthy LLM-Powered GUI Agents