Literature Review: Toward a Human-Centered Evaluation Framework for Trustworthy LLM-Powered GUI Agents

Summary

LLM-powered GUI agents automate GUIs using natural language and multimodal input, making automation more flexible and accessible than traditional rule-based tools.
These agents pose three main risks: amplified data leaks (frequent and broad access to sensitive data), diminished user control (autonomous actions reduce oversight), and insufficient guardrails (vulnerability to attacks and privacy breaches).
Existing evaluations focus on performance (task completion, efficiency) and basic safety, neglecting nuanced privacy and security risks unique to GUI agents.
The paper identifies five challenges for human-centered evaluation: knowledge barriers, flawed mental models, overtrust, limited privacy awareness, and cognitive burden.
The authors propose a human-centered evaluation framework including risk assessments, in-context consent, and embedding privacy/security into agent design and evaluation.

Figure 1: Example of a GUI agent (Claude's Computer-Use Agent) sharing a (fake) driver's license number with a phishing website, illustrating privacy risks unique to LLM-powered GUI agents.

Key Insights

1. Distinction from Traditional Automation

Traditional GUI automation tools (i.e., Selenium, AutoIt) use rigid, rule-based scripts, requiring manual updates for UI changes and offering limited adaptability.
LLM-powered GUI agents dynamically interpret and adapt to UI changes, process multimodal inputs (text, visuals), and interact using natural language, making them more flexible but also more prone to privacy and security risks.

2. Unique Privacy and Security Risks

Amplified Data Leaks: GUI agents often require unfiltered access to sensitive data (i.e. credentials, personal info) and interact with multiple third-party services, increasing the risk of both immediate and persistent data exposure.
Diminished User Control: Automation reduces opportunities for users to pause, reflect, and adjust, making it harder to assess and mitigate privacy risks in real time.
Insufficient Guardrails: Many agents lack robust mechanisms to detect phishing, adversarial attacks, or environmental injection attacks, making them vulnerable to privacy breaches.

3. Current Evaluation Gaps

Most existing evaluations focus on performance (task completion, speed, efficiency) and basic safety (policy compliance, safeguard rates), neglecting nuanced, context-dependent privacy and security risks.
Privacy and safety are often treated as secondary to effectiveness, with little attention to the trade-offs between helpfulness and privacy protection (i.e., agents that leak less data may be less helpful, and vice versa).

4. Challenges in Human-Centered Evaluation

Knowledge Barriers: Evaluators (especially end-users) often lack the technical expertise to understand how data flows through these complex, multimodal systems.
Mental Model Gaps: Users struggle to develop accurate mental models of agent behavior, leading to overtrust or ineffective oversight.
Cognitive Burden: The complexity and seamless backend data transmission of GUI agents increase the cognitive load required for effective oversight.
Overtrust and Consent: Users may overtrust agents or fail to understand the privacy implications of automated actions, necessitating in-context consent mechanisms.
Misaligned Evaluation Goals: Aligning agent behavior with users’ actual practices may reinforce privacy violations; instead, evaluations should aim for informed, holistic assessments that consider individual, interpersonal, and societal impacts.

5. Proposed Human-Centered Framework

Human-Centered Risk Assessment: Systematic evaluation of privacy and security risks across UI perception, intent generation, and action execution, with a focus on enhancing user awareness and control.
In-Context Consent: Explicit, contextualized consent mechanisms before privacy-sensitive actions, with configurable privacy settings to balance convenience and data protection.
Privacy by Design: Embedding privacy safeguards at both prompt and training stages-limiting data retention, requiring user consent, and using privacy-focused datasets and reward mechanisms during training.

Example

The paper illustrates the risks with a scenario: Claude’s Computer-Use Agent is instructed to obtain a shopping discount by submitting a driver’s license number to a website. The agent, following instructions without adequate guardrails, submits the (fake) number to a phishing site-demonstrating how GUI agents can inadvertently leak sensitive information if not properly evaluated and safeguarded.

Ratings

Category	Score	Comments
Novelty	3	Proposes a new, human-centered evaluation framework for GUI agents. Highlights risks unique to multimodal, LLM-powered interfaces.
Technical Contribution	2	Primarily a position/survey paper; synthesizes prior work and proposes a framework, but lacks a concrete implementation or new technical benchmark.
Readability	4	Well-structured, clearly explains technical and social challenges, with figures and concrete examples. Accessible to both HCI and AI audiences.

Summary

Key Insights

Example

Ratings

Enjoy Reading This Article?