Is UI-TARS better than Claude Computer Use?

On benchmarks, yes—UI-TARS-1.5 scores 42.5% on OSWorld vs Claude's 28%. However, Claude is easier to integrate via API, while UI-TARS requires self-hosting. Choose based on your needs: control and performance (UI-TARS) vs ease of use (Claude).

Can CUAs run on my local machine safely?

Use isolation: VMs or Docker containers with limited permissions. Never give CUAs access to sensitive data, credentials, or the ability to access the network without oversight. Consider air-gapped VMs for maximum safety.

How much compute do CUAs need?

UI-TARS-7B runs on consumer GPUs (16GB VRAM). UI-TARS-72B needs enterprise hardware. Claude and OpenAI CUA run in the cloud—you just need network access. The bottleneck is often screenshot processing speed, not compute.

Can CUAs handle any application?

In theory, yes—they see screenshots like humans do. In practice, accuracy varies by application complexity. Standard apps (browsers, office suites) work well. Custom or unusual UIs may need fine-tuning or examples.

What's the latency for CUA actions?

Typically 1-5 seconds per action (screenshot → model → action). UI-TARS is faster self-hosted; cloud-based solutions add network latency. Not suitable for real-time interaction, but fine for automation tasks.

Computer Use Agents: UI-TARS, Claude, and Desktop Automation | Enrico Piovano

Beyond Browser Automation

While web agents automate browser tasks, Computer Use Agents (CUAs) go further—they can control entire desktop environments, interact with native applications, and perform any task a human could do with a mouse and keyboard.

From research: "Prominent examples of computer-use agents include commercial products like OpenAI's Operator, Anthropic's Claude Computer Use, Google's Project Mariner and open-source projects like ByteDance's UI-TARS, Agent S2, InfantAgent, and Jedi."

How CUAs Work

The Perception-Action Loop

CUAs operate through a continuous cycle:

Code

1. Capture Screenshot → VLM processes visual state
2. Reason → Plan next action based on task
3. Execute → Mouse/keyboard action
4. Observe → Capture new screenshot
5. Verify → Check if action succeeded
6. Repeat → Until task complete

From OpenAI: "Given a user's instruction, CUA operates through an iterative loop that integrates perception, reasoning, and action: Screenshots from the computer are added to the model's context, CUA reasons through next steps using chain-of-thought, and it performs actions—clicking, scrolling, or typing—until the task is completed."

Vision-Language Models (VLMs)

CUAs are powered by multimodal models that understand both images and text.

Why vision is essential for computer use: Traditional automation (Selenium, PyAutoGUI scripts) relies on programmatic selectors—find the button with ID "submit" and click it. This breaks whenever the UI changes. CUAs use the same approach humans do: look at the screen and understand what you see. A button that says "Submit" is recognizable whether it has ID "submit", "btn-primary", or no ID at all. This visual understanding makes CUAs more robust to UI changes than traditional automation.

The see-reason-act loop in detail: Each iteration of a CUA involves: (1) Capturing a screenshot—the agent's only window into the computer state; (2) Sending the screenshot + task context to a VLM; (3) The VLM reasoning about what it sees and what action would progress toward the goal; (4) Executing the chosen action (click, type, scroll); (5) Waiting for the UI to respond; (6) Repeating until the task is complete or an error is detected. This loop typically runs 10-50 times for a multi-step task.

The latency challenge: Each loop iteration involves a VLM inference (200-2000ms depending on model), screenshot capture (~50ms), and action execution (varies). A 20-step task might take 10-40 seconds just in inference time. This is why CUAs feel slower than scripted automation—but they're more flexible.

Model	Use Case
GPT-4o	OpenAI Operator/CUA
Claude Sonnet	Anthropic Computer Use
UI-TARS	ByteDance desktop agent
Qwen-VL	Open-source alternative

ByteDance UI-TARS

Overview

The leading open-source computer use model.

Why UI-TARS matters for the ecosystem: Before UI-TARS, computer use was proprietary—you could use Claude Computer Use or OpenAI Operator, but you couldn't run models locally or fine-tune them for your specific use case. UI-TARS provides a capable open-source alternative that you can run on your own hardware, customize for specific applications, and deploy without API costs. This democratizes computer use capabilities.

The unified model approach: Many earlier approaches used separate models for different tasks: one model for understanding what's on screen, another for deciding what to do, another for locating where to click. UI-TARS combines all these capabilities into a single vision-language model. This unified approach reduces latency (one model call instead of several) and enables better reasoning (the same model that understands the screen also decides actions).

From ByteDance: "UI-TARS is a native GUI agent model that solely perceives screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations)."

From VentureBeat: "UI-TARS, which stands for User Interface — Task Automation and Reasoning System, is engineered to enhance interactions with graphical user interfaces through sophisticated AI capabilities."

Architecture

From research: "Unlike conventional modular systems, UI-TARS consolidates essential elements—perception, reasoning, grounding, and memory—into a unified vision-language model (VLM)."

Model sizes:

UI-TARS-7B: Lightweight version
UI-TARS-72B: Full capability version

From research: "Trained on roughly 50B tokens and offered in 7B- and 72B-parameter versions."

Key Innovations

From the UI-TARS paper:

Enhanced Perception: "Leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning."
Unified Action Modeling: "Standardizes actions into a unified space across platforms."
System-2 Reasoning: "Incorporates deliberate reasoning into multi-step decision making, involving task decomposition, reflection thinking, and milestone recognition."

Performance

From VentureBeat: "The PC/MacOS agent achieves state-of-the-art (SOTA) performance on 10+ GUI benchmarks across perception, grounding and overall agent capabilities, consistently beating out OpenAI's GPT-4o, Claude and Google's Gemini."

Benchmark comparison: From research: "In head-to-head tests against OpenAI CUA and Claude 3.7, UI-TARS-1.5 came out on top:"

Benchmark	UI-TARS-1.5	OpenAI CUA	Claude 3.7
OSWorld (computer)	42.5%	36.4%	28%
WebVoyager (browser)	84.8%	-	-
AndroidWorld (phone)	64.2%	-	-

Latest Version: UI-TARS-2

From ByteDance: "On 2025.09.04, ByteDance announced the release of UI-TARS-2, which is a major upgrade featuring enhanced capabilities in GUI, Game, Code and Tool Use. It is an 'All In One' Agent model."

Agent TARS: From ByteDance: "Agent TARS is a multimodal AI agent that aims to explore a work form closer to human-like task completion through rich multimodal capabilities (such as GUI Agent, Vision) and seamless integration with various real-world tools."

Using UI-TARS

Desktop application:

Bash

# UI-TARS Desktop (Electron app)
git clone https://github.com/bytedance/UI-TARS-desktop
cd UI-TARS-desktop
npm install
npm start

Python SDK:

Python

from ui_tars import UITARSAgent

agent = UITARSAgent(
    model="ui-tars-72b",
    screenshot_interval=0.5
)

# Execute task
result = await agent.execute(
    task="Open Chrome and search for 'AI news'",
    max_steps=20
)

Anthropic Claude Computer Use

API-Based Approach

Claude's computer use is developer-focused.

The API-first philosophy: Unlike OpenAI's Operator (which is a consumer product), Claude Computer Use is designed for developers to build into their own applications. You get the computer use capability as a tool that Claude can use, but you're responsible for the surrounding infrastructure: capturing screenshots, executing actions, managing the environment. This gives maximum flexibility but requires more engineering work.

Security through isolation: Computer use is inherently risky—you're letting an AI control a computer. Anthropic's approach pushes security responsibility to the developer: run in a sandbox, use a VM, limit what the agent can access. The reference implementation uses Docker containers with display forwarding. For production, you'd want additional safeguards: network isolation, file system restrictions, action rate limiting, and human-in-the-loop approval for sensitive operations.

The tool-based architecture: Claude Computer Use is implemented as tools that Claude can call. This integrates naturally with Claude's existing function calling: Claude reasons about the task, decides to use the computer tool, specifies coordinates and action type, and your code executes it. This design means computer use works alongside other tools—Claude can use both computer use AND web search in the same task.

From Anthropic: "Developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text."

Tool Configuration

From Anthropic: "As of January 2025, Computer Use requires the API header anthropic-beta: computer-use-2025-01-24 with the claude-sonnet-4-5 model."

Python

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=4096,
    tools=[
        {
            "type": "computer_20250124",
            "name": "computer",
            "display_width_px": 1920,
            "display_height_px": 1080,
            "display_number": 1,
        },
        {
            "type": "text_editor_20250124",
            "name": "str_replace_editor",
        },
        {
            "type": "bash_20250124",
            "name": "bash",
        },
    ],
    messages=[
        {"role": "user", "content": "Open the calculator app and compute 15 * 37"}
    ],
    betas=["computer-use-2025-01-24"],
)

Available Actions

From Anthropic: "The computer use tool supports basic actions (all versions) like mouse_move to move cursor to coordinates. Enhanced actions (computer_20250124) are available in Claude 4 models."

Action	Description
`mouse_move`	Move cursor to coordinates
`left_click`	Click left mouse button
`right_click`	Right click
`double_click`	Double click
`type`	Type text
`key`	Press keyboard keys
`scroll`	Scroll in direction
`screenshot`	Capture current screen

Integration Example

Python

import anthropic
import pyautogui
import base64
from PIL import ImageGrab

def take_screenshot():
    screenshot = ImageGrab.grab()
    buffer = io.BytesIO()
    screenshot.save(buffer, format='PNG')
    return base64.standard_b64encode(buffer.getvalue()).decode()

def execute_action(action):
    if action["type"] == "mouse_move":
        pyautogui.moveTo(action["x"], action["y"])
    elif action["type"] == "left_click":
        pyautogui.click()
    elif action["type"] == "type":
        pyautogui.write(action["text"])
    # ... more actions

async def run_computer_agent(task: str):
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": task}]

    while True:
        # Add screenshot to context
        screenshot = take_screenshot()
        messages.append({
            "role": "user",
            "content": [{
                "type": "image",
                "source": {"type": "base64", "data": screenshot}
            }]
        })

        # Get next action
        response = client.messages.create(
            model="claude-sonnet-4-5-20250514",
            tools=[computer_tool],
            messages=messages,
            betas=["computer-use-2025-01-24"]
        )

        # Execute action
        for block in response.content:
            if block.type == "tool_use":
                execute_action(block.input)

        if response.stop_reason == "end_turn":
            break

OpenAI Computer-Using Agent (CUA)

Architecture

From OpenAI: "CUA combines the vision capabilities of GPT-4o with reasoning abilities from OpenAI's more advanced models."

Integration with Operator

CUA powers both Operator and ChatGPT Atlas:

From OpenAI: "Operator is powered by a new model called Computer-Using Agent (CUA)."

Performance: From research: "In benchmark assessments, Operator achieved 38.1% on OSWorld benchmarks and 58.1% on WebArena benchmarks."

CUA Infrastructure

The Cua Project

Open-source infrastructure for computer-use agents:

From GitHub: "Cua provides open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows)."

Features: From research: "Cloud VLM Platform support for Claude Opus, Qwen3 VL 235B, and UI-TARS-2 on Cua VLM cloud infrastructure, along with QEMU Container Support for native Linux and Windows container execution via QEMU virtualization."

Running CUAs Safely

Docker-based isolation:

Docker

FROM ubuntu:22.04

# Install desktop environment
RUN apt-get update && apt-get install -y \
    xvfb \
    x11vnc \
    fluxbox \
    firefox \
    python3-pip

# Install agent dependencies
RUN pip install ui-tars anthropic pyautogui

# Start virtual display
CMD ["Xvfb", ":99", "-screen", "0", "1920x1080x24"]

VM-based approach:

Python

import subprocess

# Launch QEMU VM
vm_process = subprocess.Popen([
    "qemu-system-x86_64",
    "-m", "8G",
    "-hda", "agent-vm.qcow2",
    "-vnc", ":1",
    "-enable-kvm"
])

# Connect agent to VM's VNC
agent = UITARSAgent(
    vnc_host="localhost",
    vnc_port=5901
)

Benchmarks and Evaluation

OSWorld

Operating system-level task benchmark:

Agent	Success Rate
UI-TARS-1.5	42.5%
OpenAI CUA	36.4%
Claude 3.7	28.0%
Human	~90%

WebArena

Web-based task benchmark:

Agent	Success Rate
UI-TARS-1.5	84.8%
OpenAI Operator	58.1%
Claude	~45%

AndroidWorld

Mobile GUI tasks:

Agent	Success Rate
UI-TARS-1.5	64.2%

Use Cases

Desktop Automation

Data entry across applications
Report generation
File organization
Software testing

Enterprise Workflows

ERP system navigation
Legacy application automation
Cross-application data transfer

Personal Productivity

Email management
Calendar scheduling
Document processing

Development

IDE automation
Build and deployment tasks
Testing workflows

Implementation Best Practices

1. Clear Task Specification

Python

# Bad: vague task
task = "Do something with Excel"

# Good: specific task
task = """
Open Microsoft Excel.
Create a new workbook.
In cell A1, type "Name".
In cell B1, type "Revenue".
Add sample data in rows 2-5.
Save the file as "report.xlsx" on the Desktop.
"""

2. Checkpoint Verification

Python

async def execute_with_checkpoints(agent, task, checkpoints):
    for i, checkpoint in enumerate(checkpoints):
        result = await agent.execute(checkpoint["action"])

        # Verify checkpoint reached
        screenshot = await agent.screenshot()
        verified = await agent.verify(
            screenshot,
            checkpoint["expected_state"]
        )

        if not verified:
            # Retry or escalate
            await handle_checkpoint_failure(agent, checkpoint)

3. Error Recovery

Python

async def robust_execution(agent, task, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = await agent.execute(task)
            if result.success:
                return result
        except Exception as e:
            # Capture state for debugging
            screenshot = await agent.screenshot()

            # Ask agent to diagnose
            diagnosis = await agent.diagnose(
                error=str(e),
                screenshot=screenshot,
                task=task
            )

            if diagnosis.recoverable:
                await agent.execute(diagnosis.recovery_steps)
            else:
                raise

    raise MaxRetriesExceeded(task)

4. Safety Boundaries

Python

class SafeComputerAgent:
    BLOCKED_PATTERNS = [
        r"rm\s+-rf",
        r"format\s+c:",
        r"sudo\s+shutdown",
        r"password|credential|secret",
    ]

    SENSITIVE_APPS = ["banking", "wallet", "keychain"]

    async def execute(self, action):
        # Check for dangerous commands
        if self.is_dangerous(action):
            raise SafetyViolation(action)

        # Check for sensitive applications
        current_app = await self.get_foreground_app()
        if any(s in current_app.lower() for s in self.SENSITIVE_APPS):
            return await self.request_human_approval(action)

        return await self._execute(action)

Comparison: CUA Approaches

Aspect	UI-TARS	Claude Computer Use	OpenAI CUA
Type	Open-source model	API service	Integrated product
Self-hosting	Yes	No	No
Desktop support	Full	Full	Browser-focused
Best benchmark	OSWorld 42.5%	OSWorld 28%	OSWorld 36.4%
Cost	Self-hosted	API pricing	Subscription
Customization	Full	Limited	None

Future Directions

Trends

Improving accuracy: Current agents achieve 30-40% on complex tasks; aiming for 80%+
Faster execution: Reducing screenshot-to-action latency
Better reasoning: System-2 thinking for multi-step planning
Cross-platform: Unified models for desktop, mobile, web

Challenges

Security: Agents with system access need robust sandboxing
Reliability: Current accuracy insufficient for critical tasks
Speed: Human-level speed requires optimization
Generalization: Handling unseen applications and edge cases

Conclusion

Computer Use Agents represent the next frontier in AI automation:

UI-TARS leads in open-source with best-in-class benchmarks
Claude Computer Use offers API-based access for developers
OpenAI CUA powers consumer products like Atlas

The technology is rapidly maturing—expect significant improvements in reliability and capability in the coming months.

Table of Contents

Beyond Browser Automation

How CUAs Work

The Perception-Action Loop

Vision-Language Models (VLMs)

ByteDance UI-TARS

Overview

Architecture

Key Innovations

Performance

Latest Version: UI-TARS-2

Using UI-TARS

Anthropic Claude Computer Use

API-Based Approach

Tool Configuration

Available Actions

Integration Example

OpenAI Computer-Using Agent (CUA)

Architecture

Integration with Operator

CUA Infrastructure

The Cua Project

Running CUAs Safely

Benchmarks and Evaluation

OSWorld

WebArena

AndroidWorld

Use Cases

Desktop Automation

Enterprise Workflows

Personal Productivity

Development

Implementation Best Practices

1. Clear Task Specification

2. Checkpoint Verification

3. Error Recovery

4. Safety Boundaries

Comparison: CUA Approaches

Future Directions

Trends

Challenges

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Agentic Browsing: AI Web Agents and Browser Automation

Building Agentic AI Systems: A Complete Implementation Guide

Multimodal LLMs: Vision, Audio, and Beyond