publications
2025
- GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering2025Preprint. Under review.
- Certifying Robustness of Agent Tool-Selection Under Adversarial Attacks2025Preprint. Under review.
-
TRAP: Targeted Redirecting of Agentic PreferencesIn , 2025Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) - The Power of Friendship: Analyzing Leadership and Adversarial Attacks in Multi-Agent Collaboration2025Poster accepted to ACM Collective Intelligence 2025; Non-archival