Adversarial AI

an archive of posts with this tag

Sep 30, 2025 Literature Review: Effective Red-Teaming of Policy-Adherent Agents
Sep 05, 2025 Literature Review: Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
Aug 16, 2025 Literature Review: The Hidden Dimensions of LLM Alignment
Aug 16, 2025 Literature Review: Jailbreak Antidote – Runtime Safety-Utility Balance via Sparse Representation Adjustment
Aug 09, 2025 Literature Review: Cross-Modal Safety Mechanism Transfer in LVLMs (TGA)
Jul 19, 2025 Literature Review: Universal Jailbreak Suffixes Are Strong Attention Hijackers
Jul 05, 2025 Literature Review: LLMs Unlock New Paths to Monetizing Exploits
Jun 28, 2025 Literature Review: Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning
Jun 25, 2025 Literature Review: RedCode: Risky Code Execution and Generation Benchmark for Code Agents
Jun 21, 2025 Literature Review: Prompt Injection Attack to Tool Selection in LLM Agents
Jun 21, 2025 Literature Review: A Practical Memory Injection Attack against LLM Agents
Jun 14, 2025 Literature Review: COSMIC: Generalized Refusal Direction Identification in LLM Activations
Jun 09, 2025 Literature Review: Gaming Tool Preferences in Agentic LLMs
Jun 09, 2025 Literature Review: Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models
May 19, 2025 Literature Review: Attack and Defense Techniques in Large Language Models: A Survey and New Perspectives
May 19, 2025 Literature Review: REVEAL – Multi-turn Evaluation of Image-Input Harms for Vision LLMs