Skip to main content
Back to Blog

Computer Use Agents: UI-TARS, Claude, and Desktop Automation

Understanding Computer Use Agents (CUA)—from ByteDance's UI-TARS to Claude's desktop automation. How vision-language models are learning to control computers through screenshots and actions.

8 min read
Share:

Beyond Browser Automation

While web agents automate browser tasks, Computer Use Agents (CUAs) go further—they can control entire desktop environments, interact with native applications, and perform any task a human could do with a mouse and keyboard.

From research: "Prominent examples of computer-use agents include commercial products like OpenAI's Operator, Anthropic's Claude Computer Use, Google's Project Mariner and open-source projects like ByteDance's UI-TARS, Agent S2, InfantAgent, and Jedi."

How CUAs Work

The Perception-Action Loop

CUAs operate through a continuous cycle:

Code
1. Capture Screenshot → VLM processes visual state
2. Reason → Plan next action based on task
3. Execute → Mouse/keyboard action
4. Observe → Capture new screenshot
5. Verify → Check if action succeeded
6. Repeat → Until task complete

From OpenAI: "Given a user's instruction, CUA operates through an iterative loop that integrates perception, reasoning, and action: Screenshots from the computer are added to the model's context, CUA reasons through next steps using chain-of-thought, and it performs actions—clicking, scrolling, or typing—until the task is completed."

Vision-Language Models (VLMs)

CUAs are powered by multimodal models that understand both images and text.

Why vision is essential for computer use: Traditional automation (Selenium, PyAutoGUI scripts) relies on programmatic selectors—find the button with ID "submit" and click it. This breaks whenever the UI changes. CUAs use the same approach humans do: look at the screen and understand what you see. A button that says "Submit" is recognizable whether it has ID "submit", "btn-primary", or no ID at all. This visual understanding makes CUAs more robust to UI changes than traditional automation.

The see-reason-act loop in detail: Each iteration of a CUA involves: (1) Capturing a screenshot—the agent's only window into the computer state; (2) Sending the screenshot + task context to a VLM; (3) The VLM reasoning about what it sees and what action would progress toward the goal; (4) Executing the chosen action (click, type, scroll); (5) Waiting for the UI to respond; (6) Repeating until the task is complete or an error is detected. This loop typically runs 10-50 times for a multi-step task.

The latency challenge: Each loop iteration involves a VLM inference (200-2000ms depending on model), screenshot capture (~50ms), and action execution (varies). A 20-step task might take 10-40 seconds just in inference time. This is why CUAs feel slower than scripted automation—but they're more flexible.

ModelUse Case
GPT-4oOpenAI Operator/CUA
Claude SonnetAnthropic Computer Use
UI-TARSByteDance desktop agent
Qwen-VLOpen-source alternative

ByteDance UI-TARS

Overview

The leading open-source computer use model.

Why UI-TARS matters for the ecosystem: Before UI-TARS, computer use was proprietary—you could use Claude Computer Use or OpenAI Operator, but you couldn't run models locally or fine-tune them for your specific use case. UI-TARS provides a capable open-source alternative that you can run on your own hardware, customize for specific applications, and deploy without API costs. This democratizes computer use capabilities.

The unified model approach: Many earlier approaches used separate models for different tasks: one model for understanding what's on screen, another for deciding what to do, another for locating where to click. UI-TARS combines all these capabilities into a single vision-language model. This unified approach reduces latency (one model call instead of several) and enables better reasoning (the same model that understands the screen also decides actions).

From ByteDance: "UI-TARS is a native GUI agent model that solely perceives screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations)."

From VentureBeat: "UI-TARS, which stands for User Interface — Task Automation and Reasoning System, is engineered to enhance interactions with graphical user interfaces through sophisticated AI capabilities."

Architecture

From research: "Unlike conventional modular systems, UI-TARS consolidates essential elements—perception, reasoning, grounding, and memory—into a unified vision-language model (VLM)."

Model sizes:

  • UI-TARS-7B: Lightweight version
  • UI-TARS-72B: Full capability version

From research: "Trained on roughly 50B tokens and offered in 7B- and 72B-parameter versions."

Key Innovations

From the UI-TARS paper:

  1. Enhanced Perception: "Leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning."

  2. Unified Action Modeling: "Standardizes actions into a unified space across platforms."

  3. System-2 Reasoning: "Incorporates deliberate reasoning into multi-step decision making, involving task decomposition, reflection thinking, and milestone recognition."

Performance

From VentureBeat: "The PC/MacOS agent achieves state-of-the-art (SOTA) performance on 10+ GUI benchmarks across perception, grounding and overall agent capabilities, consistently beating out OpenAI's GPT-4o, Claude and Google's Gemini."

Benchmark comparison: From research: "In head-to-head tests against OpenAI CUA and Claude 3.7, UI-TARS-1.5 came out on top:"

BenchmarkUI-TARS-1.5OpenAI CUAClaude 3.7
OSWorld (computer)42.5%36.4%28%
WebVoyager (browser)84.8%--
AndroidWorld (phone)64.2%--

Latest Version: UI-TARS-2

From ByteDance: "On 2025.09.04, ByteDance announced the release of UI-TARS-2, which is a major upgrade featuring enhanced capabilities in GUI, Game, Code and Tool Use. It is an 'All In One' Agent model."

Agent TARS: From ByteDance: "Agent TARS is a multimodal AI agent that aims to explore a work form closer to human-like task completion through rich multimodal capabilities (such as GUI Agent, Vision) and seamless integration with various real-world tools."

Using UI-TARS

Desktop application:

Bash
# UI-TARS Desktop (Electron app)
git clone https://github.com/bytedance/UI-TARS-desktop
cd UI-TARS-desktop
npm install
npm start

Python SDK:

Python
from ui_tars import UITARSAgent

agent = UITARSAgent(
    model="ui-tars-72b",
    screenshot_interval=0.5
)

# Execute task
result = await agent.execute(
    task="Open Chrome and search for 'AI news'",
    max_steps=20
)

Anthropic Claude Computer Use

API-Based Approach

Claude's computer use is developer-focused.

The API-first philosophy: Unlike OpenAI's Operator (which is a consumer product), Claude Computer Use is designed for developers to build into their own applications. You get the computer use capability as a tool that Claude can use, but you're responsible for the surrounding infrastructure: capturing screenshots, executing actions, managing the environment. This gives maximum flexibility but requires more engineering work.

Security through isolation: Computer use is inherently risky—you're letting an AI control a computer. Anthropic's approach pushes security responsibility to the developer: run in a sandbox, use a VM, limit what the agent can access. The reference implementation uses Docker containers with display forwarding. For production, you'd want additional safeguards: network isolation, file system restrictions, action rate limiting, and human-in-the-loop approval for sensitive operations.

The tool-based architecture: Claude Computer Use is implemented as tools that Claude can call. This integrates naturally with Claude's existing function calling: Claude reasons about the task, decides to use the computer tool, specifies coordinates and action type, and your code executes it. This design means computer use works alongside other tools—Claude can use both computer use AND web search in the same task.

From Anthropic: "Developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text."

Tool Configuration

From Anthropic: "As of January 2025, Computer Use requires the API header anthropic-beta: computer-use-2025-01-24 with the claude-sonnet-4-5 model."

Python
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=4096,
    tools=[
        {
            "type": "computer_20250124",
            "name": "computer",
            "display_width_px": 1920,
            "display_height_px": 1080,
            "display_number": 1,
        },
        {
            "type": "text_editor_20250124",
            "name": "str_replace_editor",
        },
        {
            "type": "bash_20250124",
            "name": "bash",
        },
    ],
    messages=[
        {"role": "user", "content": "Open the calculator app and compute 15 * 37"}
    ],
    betas=["computer-use-2025-01-24"],
)

Available Actions

From Anthropic: "The computer use tool supports basic actions (all versions) like mouse_move to move cursor to coordinates. Enhanced actions (computer_20250124) are available in Claude 4 models."

ActionDescription
mouse_moveMove cursor to coordinates
left_clickClick left mouse button
right_clickRight click
double_clickDouble click
typeType text
keyPress keyboard keys
scrollScroll in direction
screenshotCapture current screen

Integration Example

Python
import anthropic
import pyautogui
import base64
from PIL import ImageGrab

def take_screenshot():
    screenshot = ImageGrab.grab()
    buffer = io.BytesIO()
    screenshot.save(buffer, format='PNG')
    return base64.standard_b64encode(buffer.getvalue()).decode()

def execute_action(action):
    if action["type"] == "mouse_move":
        pyautogui.moveTo(action["x"], action["y"])
    elif action["type"] == "left_click":
        pyautogui.click()
    elif action["type"] == "type":
        pyautogui.write(action["text"])
    # ... more actions

async def run_computer_agent(task: str):
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": task}]

    while True:
        # Add screenshot to context
        screenshot = take_screenshot()
        messages.append({
            "role": "user",
            "content": [{
                "type": "image",
                "source": {"type": "base64", "data": screenshot}
            }]
        })

        # Get next action
        response = client.messages.create(
            model="claude-sonnet-4-5-20250514",
            tools=[computer_tool],
            messages=messages,
            betas=["computer-use-2025-01-24"]
        )

        # Execute action
        for block in response.content:
            if block.type == "tool_use":
                execute_action(block.input)

        if response.stop_reason == "end_turn":
            break

OpenAI Computer-Using Agent (CUA)

Architecture

From OpenAI: "CUA combines the vision capabilities of GPT-4o with reasoning abilities from OpenAI's more advanced models."

Integration with Operator

CUA powers both Operator and ChatGPT Atlas:

From OpenAI: "Operator is powered by a new model called Computer-Using Agent (CUA)."

Performance: From research: "In benchmark assessments, Operator achieved 38.1% on OSWorld benchmarks and 58.1% on WebArena benchmarks."

CUA Infrastructure

The Cua Project

Open-source infrastructure for computer-use agents:

From GitHub: "Cua provides open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows)."

Features: From research: "Cloud VLM Platform support for Claude Opus, Qwen3 VL 235B, and UI-TARS-2 on Cua VLM cloud infrastructure, along with QEMU Container Support for native Linux and Windows container execution via QEMU virtualization."

Running CUAs Safely

Docker-based isolation:

Docker
FROM ubuntu:22.04

# Install desktop environment
RUN apt-get update && apt-get install -y \
    xvfb \
    x11vnc \
    fluxbox \
    firefox \
    python3-pip

# Install agent dependencies
RUN pip install ui-tars anthropic pyautogui

# Start virtual display
CMD ["Xvfb", ":99", "-screen", "0", "1920x1080x24"]

VM-based approach:

Python
import subprocess

# Launch QEMU VM
vm_process = subprocess.Popen([
    "qemu-system-x86_64",
    "-m", "8G",
    "-hda", "agent-vm.qcow2",
    "-vnc", ":1",
    "-enable-kvm"
])

# Connect agent to VM's VNC
agent = UITARSAgent(
    vnc_host="localhost",
    vnc_port=5901
)

Benchmarks and Evaluation

OSWorld

Operating system-level task benchmark:

AgentSuccess Rate
UI-TARS-1.542.5%
OpenAI CUA36.4%
Claude 3.728.0%
Human~90%

WebArena

Web-based task benchmark:

AgentSuccess Rate
UI-TARS-1.584.8%
OpenAI Operator58.1%
Claude~45%

AndroidWorld

Mobile GUI tasks:

AgentSuccess Rate
UI-TARS-1.564.2%

Use Cases

Desktop Automation

  • Data entry across applications
  • Report generation
  • File organization
  • Software testing

Enterprise Workflows

  • ERP system navigation
  • Legacy application automation
  • Cross-application data transfer

Personal Productivity

  • Email management
  • Calendar scheduling
  • Document processing

Development

  • IDE automation
  • Build and deployment tasks
  • Testing workflows

Implementation Best Practices

1. Clear Task Specification

Python
# Bad: vague task
task = "Do something with Excel"

# Good: specific task
task = """
Open Microsoft Excel.
Create a new workbook.
In cell A1, type "Name".
In cell B1, type "Revenue".
Add sample data in rows 2-5.
Save the file as "report.xlsx" on the Desktop.
"""

2. Checkpoint Verification

Python
async def execute_with_checkpoints(agent, task, checkpoints):
    for i, checkpoint in enumerate(checkpoints):
        result = await agent.execute(checkpoint["action"])

        # Verify checkpoint reached
        screenshot = await agent.screenshot()
        verified = await agent.verify(
            screenshot,
            checkpoint["expected_state"]
        )

        if not verified:
            # Retry or escalate
            await handle_checkpoint_failure(agent, checkpoint)

3. Error Recovery

Python
async def robust_execution(agent, task, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = await agent.execute(task)
            if result.success:
                return result
        except Exception as e:
            # Capture state for debugging
            screenshot = await agent.screenshot()

            # Ask agent to diagnose
            diagnosis = await agent.diagnose(
                error=str(e),
                screenshot=screenshot,
                task=task
            )

            if diagnosis.recoverable:
                await agent.execute(diagnosis.recovery_steps)
            else:
                raise

    raise MaxRetriesExceeded(task)

4. Safety Boundaries

Python
class SafeComputerAgent:
    BLOCKED_PATTERNS = [
        r"rm\s+-rf",
        r"format\s+c:",
        r"sudo\s+shutdown",
        r"password|credential|secret",
    ]

    SENSITIVE_APPS = ["banking", "wallet", "keychain"]

    async def execute(self, action):
        # Check for dangerous commands
        if self.is_dangerous(action):
            raise SafetyViolation(action)

        # Check for sensitive applications
        current_app = await self.get_foreground_app()
        if any(s in current_app.lower() for s in self.SENSITIVE_APPS):
            return await self.request_human_approval(action)

        return await self._execute(action)

Comparison: CUA Approaches

AspectUI-TARSClaude Computer UseOpenAI CUA
TypeOpen-source modelAPI serviceIntegrated product
Self-hostingYesNoNo
Desktop supportFullFullBrowser-focused
Best benchmarkOSWorld 42.5%OSWorld 28%OSWorld 36.4%
CostSelf-hostedAPI pricingSubscription
CustomizationFullLimitedNone

Future Directions

  1. Improving accuracy: Current agents achieve 30-40% on complex tasks; aiming for 80%+
  2. Faster execution: Reducing screenshot-to-action latency
  3. Better reasoning: System-2 thinking for multi-step planning
  4. Cross-platform: Unified models for desktop, mobile, web

Challenges

  • Security: Agents with system access need robust sandboxing
  • Reliability: Current accuracy insufficient for critical tasks
  • Speed: Human-level speed requires optimization
  • Generalization: Handling unseen applications and edge cases

Conclusion

Computer Use Agents represent the next frontier in AI automation:

  1. UI-TARS leads in open-source with best-in-class benchmarks
  2. Claude Computer Use offers API-based access for developers
  3. OpenAI CUA powers consumer products like Atlas

The technology is rapidly maturing—expect significant improvements in reliability and capability in the coming months.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles