Computer Use Agents: UI-TARS, Claude, and Desktop Automation
Understanding Computer Use Agents (CUA)—from ByteDance's UI-TARS to Claude's desktop automation. How vision-language models are learning to control computers through screenshots and actions.
Table of Contents
Beyond Browser Automation
While web agents automate browser tasks, Computer Use Agents (CUAs) go further—they can control entire desktop environments, interact with native applications, and perform any task a human could do with a mouse and keyboard.
From research: "Prominent examples of computer-use agents include commercial products like OpenAI's Operator, Anthropic's Claude Computer Use, Google's Project Mariner and open-source projects like ByteDance's UI-TARS, Agent S2, InfantAgent, and Jedi."
How CUAs Work
The Perception-Action Loop
CUAs operate through a continuous cycle:
1. Capture Screenshot → VLM processes visual state
2. Reason → Plan next action based on task
3. Execute → Mouse/keyboard action
4. Observe → Capture new screenshot
5. Verify → Check if action succeeded
6. Repeat → Until task complete
From OpenAI: "Given a user's instruction, CUA operates through an iterative loop that integrates perception, reasoning, and action: Screenshots from the computer are added to the model's context, CUA reasons through next steps using chain-of-thought, and it performs actions—clicking, scrolling, or typing—until the task is completed."
Vision-Language Models (VLMs)
CUAs are powered by multimodal models that understand both images and text.
Why vision is essential for computer use: Traditional automation (Selenium, PyAutoGUI scripts) relies on programmatic selectors—find the button with ID "submit" and click it. This breaks whenever the UI changes. CUAs use the same approach humans do: look at the screen and understand what you see. A button that says "Submit" is recognizable whether it has ID "submit", "btn-primary", or no ID at all. This visual understanding makes CUAs more robust to UI changes than traditional automation.
The see-reason-act loop in detail: Each iteration of a CUA involves: (1) Capturing a screenshot—the agent's only window into the computer state; (2) Sending the screenshot + task context to a VLM; (3) The VLM reasoning about what it sees and what action would progress toward the goal; (4) Executing the chosen action (click, type, scroll); (5) Waiting for the UI to respond; (6) Repeating until the task is complete or an error is detected. This loop typically runs 10-50 times for a multi-step task.
The latency challenge: Each loop iteration involves a VLM inference (200-2000ms depending on model), screenshot capture (~50ms), and action execution (varies). A 20-step task might take 10-40 seconds just in inference time. This is why CUAs feel slower than scripted automation—but they're more flexible.
| Model | Use Case |
|---|---|
| GPT-4o | OpenAI Operator/CUA |
| Claude Sonnet | Anthropic Computer Use |
| UI-TARS | ByteDance desktop agent |
| Qwen-VL | Open-source alternative |
ByteDance UI-TARS
Overview
The leading open-source computer use model.
Why UI-TARS matters for the ecosystem: Before UI-TARS, computer use was proprietary—you could use Claude Computer Use or OpenAI Operator, but you couldn't run models locally or fine-tune them for your specific use case. UI-TARS provides a capable open-source alternative that you can run on your own hardware, customize for specific applications, and deploy without API costs. This democratizes computer use capabilities.
The unified model approach: Many earlier approaches used separate models for different tasks: one model for understanding what's on screen, another for deciding what to do, another for locating where to click. UI-TARS combines all these capabilities into a single vision-language model. This unified approach reduces latency (one model call instead of several) and enables better reasoning (the same model that understands the screen also decides actions).
From ByteDance: "UI-TARS is a native GUI agent model that solely perceives screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations)."
From VentureBeat: "UI-TARS, which stands for User Interface — Task Automation and Reasoning System, is engineered to enhance interactions with graphical user interfaces through sophisticated AI capabilities."
Architecture
From research: "Unlike conventional modular systems, UI-TARS consolidates essential elements—perception, reasoning, grounding, and memory—into a unified vision-language model (VLM)."
Model sizes:
- UI-TARS-7B: Lightweight version
- UI-TARS-72B: Full capability version
From research: "Trained on roughly 50B tokens and offered in 7B- and 72B-parameter versions."
Key Innovations
From the UI-TARS paper:
-
Enhanced Perception: "Leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning."
-
Unified Action Modeling: "Standardizes actions into a unified space across platforms."
-
System-2 Reasoning: "Incorporates deliberate reasoning into multi-step decision making, involving task decomposition, reflection thinking, and milestone recognition."
Performance
From VentureBeat: "The PC/MacOS agent achieves state-of-the-art (SOTA) performance on 10+ GUI benchmarks across perception, grounding and overall agent capabilities, consistently beating out OpenAI's GPT-4o, Claude and Google's Gemini."
Benchmark comparison: From research: "In head-to-head tests against OpenAI CUA and Claude 3.7, UI-TARS-1.5 came out on top:"
| Benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 |
|---|---|---|---|
| OSWorld (computer) | 42.5% | 36.4% | 28% |
| WebVoyager (browser) | 84.8% | - | - |
| AndroidWorld (phone) | 64.2% | - | - |
Latest Version: UI-TARS-2
From ByteDance: "On 2025.09.04, ByteDance announced the release of UI-TARS-2, which is a major upgrade featuring enhanced capabilities in GUI, Game, Code and Tool Use. It is an 'All In One' Agent model."
Agent TARS: From ByteDance: "Agent TARS is a multimodal AI agent that aims to explore a work form closer to human-like task completion through rich multimodal capabilities (such as GUI Agent, Vision) and seamless integration with various real-world tools."
Using UI-TARS
Desktop application:
# UI-TARS Desktop (Electron app)
git clone https://github.com/bytedance/UI-TARS-desktop
cd UI-TARS-desktop
npm install
npm start
Python SDK:
from ui_tars import UITARSAgent
agent = UITARSAgent(
model="ui-tars-72b",
screenshot_interval=0.5
)
# Execute task
result = await agent.execute(
task="Open Chrome and search for 'AI news'",
max_steps=20
)
Anthropic Claude Computer Use
API-Based Approach
Claude's computer use is developer-focused.
The API-first philosophy: Unlike OpenAI's Operator (which is a consumer product), Claude Computer Use is designed for developers to build into their own applications. You get the computer use capability as a tool that Claude can use, but you're responsible for the surrounding infrastructure: capturing screenshots, executing actions, managing the environment. This gives maximum flexibility but requires more engineering work.
Security through isolation: Computer use is inherently risky—you're letting an AI control a computer. Anthropic's approach pushes security responsibility to the developer: run in a sandbox, use a VM, limit what the agent can access. The reference implementation uses Docker containers with display forwarding. For production, you'd want additional safeguards: network isolation, file system restrictions, action rate limiting, and human-in-the-loop approval for sensitive operations.
The tool-based architecture: Claude Computer Use is implemented as tools that Claude can call. This integrates naturally with Claude's existing function calling: Claude reasons about the task, decides to use the computer tool, specifies coordinates and action type, and your code executes it. This design means computer use works alongside other tools—Claude can use both computer use AND web search in the same task.
From Anthropic: "Developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text."
Tool Configuration
From Anthropic: "As of January 2025, Computer Use requires the API header anthropic-beta: computer-use-2025-01-24 with the claude-sonnet-4-5 model."
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=4096,
tools=[
{
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 1,
},
{
"type": "text_editor_20250124",
"name": "str_replace_editor",
},
{
"type": "bash_20250124",
"name": "bash",
},
],
messages=[
{"role": "user", "content": "Open the calculator app and compute 15 * 37"}
],
betas=["computer-use-2025-01-24"],
)
Available Actions
From Anthropic: "The computer use tool supports basic actions (all versions) like mouse_move to move cursor to coordinates. Enhanced actions (computer_20250124) are available in Claude 4 models."
| Action | Description |
|---|---|
mouse_move | Move cursor to coordinates |
left_click | Click left mouse button |
right_click | Right click |
double_click | Double click |
type | Type text |
key | Press keyboard keys |
scroll | Scroll in direction |
screenshot | Capture current screen |
Integration Example
import anthropic
import pyautogui
import base64
from PIL import ImageGrab
def take_screenshot():
screenshot = ImageGrab.grab()
buffer = io.BytesIO()
screenshot.save(buffer, format='PNG')
return base64.standard_b64encode(buffer.getvalue()).decode()
def execute_action(action):
if action["type"] == "mouse_move":
pyautogui.moveTo(action["x"], action["y"])
elif action["type"] == "left_click":
pyautogui.click()
elif action["type"] == "type":
pyautogui.write(action["text"])
# ... more actions
async def run_computer_agent(task: str):
client = anthropic.Anthropic()
messages = [{"role": "user", "content": task}]
while True:
# Add screenshot to context
screenshot = take_screenshot()
messages.append({
"role": "user",
"content": [{
"type": "image",
"source": {"type": "base64", "data": screenshot}
}]
})
# Get next action
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
tools=[computer_tool],
messages=messages,
betas=["computer-use-2025-01-24"]
)
# Execute action
for block in response.content:
if block.type == "tool_use":
execute_action(block.input)
if response.stop_reason == "end_turn":
break
OpenAI Computer-Using Agent (CUA)
Architecture
From OpenAI: "CUA combines the vision capabilities of GPT-4o with reasoning abilities from OpenAI's more advanced models."
Integration with Operator
CUA powers both Operator and ChatGPT Atlas:
From OpenAI: "Operator is powered by a new model called Computer-Using Agent (CUA)."
Performance: From research: "In benchmark assessments, Operator achieved 38.1% on OSWorld benchmarks and 58.1% on WebArena benchmarks."
CUA Infrastructure
The Cua Project
Open-source infrastructure for computer-use agents:
From GitHub: "Cua provides open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows)."
Features: From research: "Cloud VLM Platform support for Claude Opus, Qwen3 VL 235B, and UI-TARS-2 on Cua VLM cloud infrastructure, along with QEMU Container Support for native Linux and Windows container execution via QEMU virtualization."
Running CUAs Safely
Docker-based isolation:
FROM ubuntu:22.04
# Install desktop environment
RUN apt-get update && apt-get install -y \
xvfb \
x11vnc \
fluxbox \
firefox \
python3-pip
# Install agent dependencies
RUN pip install ui-tars anthropic pyautogui
# Start virtual display
CMD ["Xvfb", ":99", "-screen", "0", "1920x1080x24"]
VM-based approach:
import subprocess
# Launch QEMU VM
vm_process = subprocess.Popen([
"qemu-system-x86_64",
"-m", "8G",
"-hda", "agent-vm.qcow2",
"-vnc", ":1",
"-enable-kvm"
])
# Connect agent to VM's VNC
agent = UITARSAgent(
vnc_host="localhost",
vnc_port=5901
)
Benchmarks and Evaluation
OSWorld
Operating system-level task benchmark:
| Agent | Success Rate |
|---|---|
| UI-TARS-1.5 | 42.5% |
| OpenAI CUA | 36.4% |
| Claude 3.7 | 28.0% |
| Human | ~90% |
WebArena
Web-based task benchmark:
| Agent | Success Rate |
|---|---|
| UI-TARS-1.5 | 84.8% |
| OpenAI Operator | 58.1% |
| Claude | ~45% |
AndroidWorld
Mobile GUI tasks:
| Agent | Success Rate |
|---|---|
| UI-TARS-1.5 | 64.2% |
Use Cases
Desktop Automation
- Data entry across applications
- Report generation
- File organization
- Software testing
Enterprise Workflows
- ERP system navigation
- Legacy application automation
- Cross-application data transfer
Personal Productivity
- Email management
- Calendar scheduling
- Document processing
Development
- IDE automation
- Build and deployment tasks
- Testing workflows
Implementation Best Practices
1. Clear Task Specification
# Bad: vague task
task = "Do something with Excel"
# Good: specific task
task = """
Open Microsoft Excel.
Create a new workbook.
In cell A1, type "Name".
In cell B1, type "Revenue".
Add sample data in rows 2-5.
Save the file as "report.xlsx" on the Desktop.
"""
2. Checkpoint Verification
async def execute_with_checkpoints(agent, task, checkpoints):
for i, checkpoint in enumerate(checkpoints):
result = await agent.execute(checkpoint["action"])
# Verify checkpoint reached
screenshot = await agent.screenshot()
verified = await agent.verify(
screenshot,
checkpoint["expected_state"]
)
if not verified:
# Retry or escalate
await handle_checkpoint_failure(agent, checkpoint)
3. Error Recovery
async def robust_execution(agent, task, max_retries=3):
for attempt in range(max_retries):
try:
result = await agent.execute(task)
if result.success:
return result
except Exception as e:
# Capture state for debugging
screenshot = await agent.screenshot()
# Ask agent to diagnose
diagnosis = await agent.diagnose(
error=str(e),
screenshot=screenshot,
task=task
)
if diagnosis.recoverable:
await agent.execute(diagnosis.recovery_steps)
else:
raise
raise MaxRetriesExceeded(task)
4. Safety Boundaries
class SafeComputerAgent:
BLOCKED_PATTERNS = [
r"rm\s+-rf",
r"format\s+c:",
r"sudo\s+shutdown",
r"password|credential|secret",
]
SENSITIVE_APPS = ["banking", "wallet", "keychain"]
async def execute(self, action):
# Check for dangerous commands
if self.is_dangerous(action):
raise SafetyViolation(action)
# Check for sensitive applications
current_app = await self.get_foreground_app()
if any(s in current_app.lower() for s in self.SENSITIVE_APPS):
return await self.request_human_approval(action)
return await self._execute(action)
Comparison: CUA Approaches
| Aspect | UI-TARS | Claude Computer Use | OpenAI CUA |
|---|---|---|---|
| Type | Open-source model | API service | Integrated product |
| Self-hosting | Yes | No | No |
| Desktop support | Full | Full | Browser-focused |
| Best benchmark | OSWorld 42.5% | OSWorld 28% | OSWorld 36.4% |
| Cost | Self-hosted | API pricing | Subscription |
| Customization | Full | Limited | None |
Future Directions
Trends
- Improving accuracy: Current agents achieve 30-40% on complex tasks; aiming for 80%+
- Faster execution: Reducing screenshot-to-action latency
- Better reasoning: System-2 thinking for multi-step planning
- Cross-platform: Unified models for desktop, mobile, web
Challenges
- Security: Agents with system access need robust sandboxing
- Reliability: Current accuracy insufficient for critical tasks
- Speed: Human-level speed requires optimization
- Generalization: Handling unseen applications and edge cases
Conclusion
Computer Use Agents represent the next frontier in AI automation:
- UI-TARS leads in open-source with best-in-class benchmarks
- Claude Computer Use offers API-based access for developers
- OpenAI CUA powers consumer products like Atlas
The technology is rapidly maturing—expect significant improvements in reliability and capability in the coming months.
Frequently Asked Questions
Related Articles
Agentic Browsing: AI Web Agents and Browser Automation
The rise of AI web agents—Browser-Use, Stagehand, OpenAI Operator, and the tools enabling LLMs to browse, interact, and automate the web autonomously.
Building Agentic AI Systems: A Complete Implementation Guide
Hands-on guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.
Multimodal LLMs: Vision, Audio, and Beyond
Field guide to multimodal LLMs—vision-language models, audio understanding, video comprehension, and any-to-any models. Architecture deep dives, benchmarks, implementation patterns, and production deployment.