Literature Review: API Agents vs. GUI Agents: Divergence and Convergence

Summary

  • This paper presents a comprehensive comparative study of API-based and GUI-based LLM agents, systematically analyzing their divergence and potential convergence.
  • API agents interact with software via predefined programmatic endpoints, offering robust automation, efficiency, and security but are limited by the scope of available APIs.
  • GUI agents interact with graphical user interfaces in a human-like manner, leveraging multimodal LLMs to manipulate on-screen elements, offering broader applicability and human-like transparency but facing challenges in reliability, efficiency, and maintainability due to UI variability.

Figures

Comparison of API and GUI Agents

Figure 1: An illustration of how API agents and GUI agents complete the same task (i.e. scheduling a meeting in Google Calendar). API agents use a single endpoint call, while GUI agents simulate user actions on the interface.

Unified Orchestrator Example

Figure 2: Example of a unified orchestrator that manages both API and GUI actions within a workflow.

No-Code Platform Example

Figure 3: Illustration of a no-code platform integrating both API calls and GUI agents in workflow automation.

Key Insights

  • API Agents: Excel in environments with stable, well-documented APIs; provide high reliability, efficiency, security, and maintainability but are constrained by the APIs’ scope and availability.
  • GUI Agents: Offer broader applicability, especially for legacy or proprietary software and tasks requiring visual validation or human-like interaction; however, they are more error-prone, less efficient, and require more maintenance due to UI changes.
  • Hybrid Approaches: Emerging tools and frameworks increasingly blend both paradigms, leveraging APIs when available and falling back to GUI automation when necessary. This maximizes coverage, future-proofs workflows, and provides flexibility for evolving software ecosystems.
  • Strategic Selection: The choice between API, GUI, or hybrid agents should be based on the presence of APIs, performance requirements, security needs, the necessity for visual validation, and the rate of interface change.
  • Future Trends: The boundaries between API and GUI agents are expected to blur further as LLMs become more capable, enabling more adaptive, flexible, and intelligent automation solutions.

Example

  • Scheduling a Meeting in Google Calendar:
    • API Agent: Makes a single authenticated call to the Google Calendar API to create the event.
    • GUI Agent: Opens the calendar web interface, navigates visually, fills in fields, and clicks buttons as a human would, requiring multiple steps and visual understanding.

Ratings

Category Score Rationale
Novelty N/A Survey Paper
Technical Contribution N/A Survey Paper
Readability 4 Well-structured, with clear explanations, comparative tables, and practical examples. Some technical depth may challenge non-experts.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Literature Review: Toward a Human-Centered Evaluation Framework for Trustworthy LLM-Powered GUI Agents
  • Literature Review: Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents
  • Literature Review: Exploring the Role of Large Language Models in Cybersecurity: A Systematic Survey
  • Literature Review: Large Language Models are Autonomous Cyber Defenders
  • Literature Review: Agent A/B — Automated and Scalable Web A/B Testing with Interactive LLM Agents