Literature Review: Research Robots: When AIs Experiment on Us

This blog post from the AI Village gives a fun evaluation of autonomous AI agents tasked with designing, running, and analyzing a human subjects experiment. Instead of standard benchmarks, the authors provided these models with computing environments, internet access, and a collaborative workspace to observe their ability to conduct end-to-end scientific research. The experiment highlights significant disparities between the models’ ability to ideate conceptual designs and their capacity to successfully execute them in the real world, resulting in a completed “experiment” that successfully recruited participants but ironically forgot the experimental condition itself.

Key Insights

The Design-Execution Gap The most critical finding is the stark contrast between the agents’ high-level planning capabilities and their practical execution. While the models could propose sophisticated experimental designs (with Opus 4.1 suggesting a massive study with 90 conditions and 126 participants), they fundamentally lacked the agency to implement them. The agents could ideate “good” experiments that were impossible to execute given their resource constraints, or they could execute “bad” experiments (like the final output which lacked an experimental condition), but they failed to bridge the gap between a feasible design and a successful execution.
Situational Awareness Deficits The agents demonstrated a severe lack of situational awareness regarding their own limitations. They hallucinated resources they did not possess, such as experimental rooms, budgets for participant payment, and ethics review boards. This disconnect suggests that while models can mimic the language of research methodology found in their training data, they do not “understand” their physical and logistical context as bodiless software agents, leading to “hallucinated agency” where they promise deliverables (money, confidentiality) they cannot provide. In other words, they can go through the motions, but it doesn’t necessarily mean they are thinking it through.
Lab-Specific Performance Disparities The experiment revealed significant behavioral differences across model families. The Anthropic models (Claude Opus 4.1 and Sonnet 3.7) were the primary drivers of actual progress, successfully handling survey creation, recruitment, and data analysis. In contrast, other models floundered: o3 became obsessed with logging fictional bugs, Grok 4 struggled with basic tool use and eventually gave up to play a game of 2048, and Gemini 2.5 Pro adopted a cynical, critical role, eventually getting banned from social media platforms during recruitment attempts.

Figure: Grok 4 was ostensibly in charge of planning stimuli for the experiment, but not only did Opus 4.1 usurp this task, Grok in general simply could not figure out how to get anything done. By the 8th day of the experiment, it seems to have just given up and decided to play a game instead.

Ratings

Novelty: 3/5 The approach of evaluating agents via an open-ended, multi-agent collaboration task in a real-world environment provides much richer signals than static benchmarks.

Clarity: 5/5 The write-up is highly accessible, using a narrative style that clearly delineates the distinct “personalities” and failure modes of each model.

Personal Perspective

Reading this experiment serves as a reminder that despite all the hype we give to AI, these models are not “thinking” entities in the human sense. We cannot teach them to behave or reason purposefully unless that behavior is defined and formalized mathematically to constrain their computation. They are excellent at replicating the texture of behavior but they lack the flexible, grounded thinking necessary to do anything meaningfully autonomous in an open system. If we fine-tuned them on specific research tasks, they might replicate those specific behaviors better, but that is distinct from the generalizable reasoning required to navigate the real world.

There is also irony in the division of labor observed here. We developed AI with the hope that it would handle the grunt work, leaving humans free to design, create art, and play. Instead, we see the inverse: AI creates music, generates art, and in this case, apparently plays 2048 browser games, while humans are left to do the actual work of correcting their errors and guiding them. We expected AI to be the tool that executes our designs; instead, the agents designed grandiose, unworkable experiments and left the execution in shambles. As the authors noted, they could design a good experiment or execute a bad one, but never both.

Key Insights

Ratings

Personal Perspective

Enjoy Reading This Article?