publications
2026
-
Securing Multimodal AI through Internal Information DecompositionProceedings of the 43rd International Conference on Machine Learning (ICML 2026) , 2026Spotlight - InferenceBench: A Benchmark for Open-Ended LLM Inference Optimization by AI AgentsICML 2026 Agents in the Wild (AIWILD) Workshop , 2026Spotlight
-
Certifying Robustness of Agent Tool-Selection Under Adversarial AttacksICLR 2026 Agentic AI in the Wild (AIWILD) Workshop , 2026 -
GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety SteeringICML 2026 AI4GOOD Workshop , 2026 - ResearchArena: Evaluating Sabotage and Monitoring in Automated AI R&D2026Preprint. Under review. * Equal contribution.
2025
-
TRAP: Targeted Redirecting of Agentic PreferencesProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) , 2025 - The Power of Friendship: Analyzing Leadership and Adversarial Attacks in Multi-Agent Collaboration2025Poster accepted to ACM Collective Intelligence 2025; Non-archival