AI Safety

an archive of posts with this tag

Feb 10, 2026 Literature Review: Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load
Feb 10, 2026 Literature Review: Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load
Feb 10, 2026 Literature Review: Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load
Feb 10, 2026 Literature Review: Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load
Feb 10, 2026 Literature Review: Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load
Feb 10, 2026 Literature Review: Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load
Sep 30, 2025 Literature Review: Agentic Misalignment – How LLMs Could Be Insider Threats
Sep 05, 2025 Literature Review: Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
Jun 28, 2025 Literature Review: Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning
Jun 25, 2025 Literature Review: RedCode: Risky Code Execution and Generation Benchmark for Code Agents
Jun 14, 2025 Literature Review: COSMIC: Generalized Refusal Direction Identification in LLM Activations
Jun 09, 2025 Literature Review: Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models
May 28, 2025 Literature Review: Programming Refusal with Conditional Activation Steering
May 19, 2025 Literature Review: REVEAL – Multi-turn Evaluation of Image-Input Harms for Vision LLMs
Apr 29, 2025 Literature Review: Bypassing Safety Guardrails in LLMs Using Humor
Apr 29, 2025 Literature Review: Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search
Apr 29, 2025 Literature Review: Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking