Which browser automation framework should I use?

For quick prototyping: Browser-Use (Python, simple API). For production: Stagehand + Browserbase (more control, better infrastructure). For consumer tasks: ChatGPT agent / Claude Computer Use.

How reliable are AI web agents?

Improving rapidly but not perfect. OpenAI Operator achieves ~58% on web benchmarks. Best for repetitive tasks on familiar sites. Always have fallback plans for critical workflows.

Can websites detect AI agents?

Yes, through various signals. Use stealth modes, residential proxies, and human-like delays. Some sites actively block automation—respect terms of service.

How do I handle authentication?

Store credentials securely (never in code), use session persistence, and require human confirmation for login flows. Consider OAuth where possible.

Most agents pause for human input on CAPTCHAs. Some services offer CAPTCHA solving, but this may violate terms of service. Design workflows to minimize CAPTCHA encounters.

Agentic Browsing: AI Web Agents and Browser Automation | Enrico Piovano

The Agentic Browser Revolution

AI agents are learning to browse the web like humans do—looking at screens, clicking buttons, filling forms, and navigating complex workflows.

From research: "These are not just browsers with AI add-ons. They are full-fledged platforms where AI agents act on your behalf, executing tasks like browsing, searching, extracting, and automating with minimal user input."

This post covers the tools, frameworks, and approaches enabling agentic web interaction.

Why Agentic Browsing?

Traditional web automation (Playwright, Puppeteer, Selenium) is brittle:

Breaks when sites change
Requires CSS selectors that become outdated
Can't handle dynamic content well
Needs constant maintenance

AI web agents solve this by understanding pages visually and semantically, adapting to changes automatically.

From research: "Models are becoming more capable, soon operating a browser faster than a human can. Instead of rebuilding the internet for AI, soon AI will be browsing alongside us."

Commercial Solutions

The commercial landscape for web agents is evolving rapidly. Each major AI lab has a different philosophy: OpenAI is building consumer products (Operator, Atlas), Anthropic provides developer APIs (Claude Computer Use), and Perplexity is creating an entirely new browser category. Understanding these differences helps you choose the right approach for your use case.

The API vs. product distinction matters: Consumer products like Operator and Atlas are easy to use but offer limited customization—you get what they give you. API-based approaches like Claude Computer Use require more engineering but give you full control over the agent's environment, safety rails, and behavior. For enterprise automation, API-based approaches are typically preferred.

OpenAI Operator / ChatGPT Agent

OpenAI's consumer-facing web agent.

From OpenAI: "Operator is a research preview of an agent that can use its own browser to perform tasks for users."

Capabilities: From OpenAI: "Operator can be asked to handle a wide variety of repetitive browser tasks such as filling out forms, ordering groceries, and even creating memes."

Technology (CUA - Computer-Using Agent): From OpenAI: "CUA combines the vision capabilities of GPT-4o with reasoning abilities from OpenAI's more advanced models."

How it works: From OpenAI: "Given a user's instruction, CUA operates through an iterative loop: Screenshots from the computer are added to the model's context, CUA reasons through next steps using chain-of-thought, and it performs actions—clicking, scrolling, or typing—until the task is completed."

Performance: From research: "In benchmark assessments, Operator achieved 38.1% on OSWorld benchmarks and 58.1% on WebArena benchmarks."

Current status: From OpenAI: "As of July 17, 2025, Operator is now fully integrated into ChatGPT as ChatGPT agent. Pro, Plus, and Team users can activate agentic capabilities through the tools dropdown by selecting 'agent mode'."

Claude Computer Use

Anthropic's approach to web and desktop automation:

From Anthropic: "With computer use, Anthropic is trying something fundamentally new. Instead of making specific tools to help Claude complete individual tasks, they're teaching it general computer skills—allowing it to use a wide range of standard tools and software programs designed for people."

API access: From Anthropic: "As of January 2025, Computer Use requires the API header anthropic-beta: computer-use-2025-01-24 with the claude-sonnet-4-5 model."

Core tools: From Anthropic: "Three core tools work together for comprehensive automation: Computer tool for mouse/keyboard input, Text Editor for file operations, and Bash tool for system commands."

Safety: From Anthropic: "Anthropic recommends limiting computer use to trusted environments such as virtual machines or containers with minimal privileges, and avoiding giving computer use access to sensitive accounts or data."

AI-Native Browsers

Perplexity Comet

The first major AI-native browser:

From Perplexity: "Comet browser by Perplexity is the AI browser that acts as a personal assistant. It automates tasks, researches the web, organizes your email, and more."

From Wikipedia: "Comet is an AI-powered web browser based on Chromium. It was released by Perplexity AI for Microsoft Windows and macOS on July 9, 2025."

Key capabilities: From research: "Comet Browser is the first truly agentic browser, meaning you can not only talk to the built-in Assistant (powered by Perplexity), but you can have it interact with your tabs, emails, calendar, and even navigate the web for you as if an assistant took over your screen."

Automation features: From research: "Perplexity Comet is considered the most capable AI agent browser in 2025, successfully automating multi-step web tasks. While ChatGPT excels at conversation, Comet specializes in web-based automation: clicking buttons, filling forms, comparing information across tabs, and executing multi-step browsing workflows."

Specialized agents:

Live Marketing Intelligence Agent
Research Agent for competitor analysis
Data collection automation

Availability: From research: "The browser was released for free download in October 2025. Comet is based on Chromium, supporting Chrome extensions and bookmark imports."

Security note: From research: "Researchers at LayerX Security identified a malicious attack vector called CometJacking that could exfiltrate user data. The exploit was responsibly disclosed in August 2025."

OpenAI ChatGPT Atlas

OpenAI's answer to Comet:

From OpenAI: "ChatGPT Atlas officially launched on October 21, 2025, as a next-generation browser that fuses the power of ChatGPT directly into web navigation. It's the first web browser built from the ground up around artificial intelligence."

Agent Mode: From OpenAI: "Agent Mode is the part of Atlas that can autonomously perform sequences of web actions under your supervision. Think of it as an attentive coworker using a managed browser on your behalf: it plans steps, navigates websites, and pauses to confirm sensitive actions."

Demo capabilities: From OpenAI: "During the macOS demo, OpenAI showed Atlas automatically: opening a recipe page, identifying ingredients, and adding them to an Instacart cart; navigating e-commerce sites, comparing listings, and checking delivery options; executing multistep workflows—booking, purchasing, or form submissions—with user consent."

Browser Memories: From OpenAI: "Optional AI memory that recalls context from past sessions, like summarizing job listings you viewed last week."

Safety guardrails: From OpenAI: "It cannot run code in the browser, download files, or install extensions; it cannot access other apps on your computer or file system; it will pause to ensure you're watching it take actions on specific sensitive sites such as financial institutions."

Availability: From research: "Platforms: macOS (Windows/iOS/Android coming soon). Agent Mode requires paid ChatGPT subscription."

The AI Browser Wars

From research: "Three players are racing to redefine how we interact with the internet: Perplexity just launched Comet, The Browser Company has pivoted from Arc to Dia (an 'AI-native' browser experience), and OpenAI launched ChatGPT Atlas on October 21, 2025."

Browser	Company	Focus	Launch
Comet	Perplexity	Research + automation	July 2025
Atlas	OpenAI	General agent tasks	October 2025
Dia	The Browser Company	AI-native UX	Coming soon

Open-Source Frameworks

Open-source frameworks democratize web agent capabilities. You can run these locally, customize them for your specific use case, and avoid per-request API costs for the automation layer (though you still pay for LLM inference).

Why open-source matters for web automation: Commercial web agents run in the vendor's environment—you send tasks, they execute them. For sensitive workflows (logging into corporate systems, handling customer data), this may not be acceptable. Open-source frameworks let you run agents in your own infrastructure with your own security controls.

The three main approaches:

Browser-Use: General-purpose browser automation with LLM control
Stagehand: Developer-friendly SDK with Playwright integration
Agent-E/LaVague: More experimental, research-oriented frameworks

Browser-Use

The most popular open-source browser automation framework.

From research: "Browser-Use aims to make websites accessible for AI agents and automate tasks online with ease. Backed by Y Combinator with millions in funding."

Installation:

Bash

pip install browser-use
playwright install

Basic usage:

Python

from browser_use import Agent
from langchain_openai import ChatOpenAI

agent = Agent(
    task="Find flights from NYC to London for next week",
    llm=ChatOpenAI(model="gpt-4o"),
)

result = await agent.run()
print(result)

With custom browser:

Python

from browser_use import Agent, Browser, BrowserConfig

browser = Browser(
    config=BrowserConfig(
        headless=False,  # See the browser
        disable_security=True,
    )
)

agent = Agent(
    task="Log into my email and summarize unread messages",
    llm=ChatOpenAI(model="gpt-4o"),
    browser=browser,
)

Use cases: From research: "Whether you're running a sales pipeline or automating competitive research, Browser Use gives you the power of agents without the complexity."

Stagehand (by Browserbase)

A more controlled approach to browser automation:

From Browserbase: "Stagehand is an AI web agent framework that bridges the gap between brittle traditional automation and unpredictable full-agent solutions."

Key concept: From Browserbase: "Think of it as Playwright with an AI copilot. If a script breaks because a button moves or changes, the LLM can adapt in real-time. You can tell it to 'click the first product listed,' and it'll understand and act accordingly."

Atomic primitives:

act() - Perform actions
extract() - Get data from pages
observe() - Understand page state

Usage:

TypeScript

import { Stagehand } from "@browserbasehq/stagehand";

const stagehand = new Stagehand({
  env: "LOCAL",
  modelName: "gpt-4o",
});

await stagehand.init();
await stagehand.page.goto("https://example.com");

// Natural language actions
await stagehand.act({ action: "click the login button" });

// Extract structured data
const products = await stagehand.extract({
  instruction: "extract all product names and prices",
  schema: z.array(z.object({
    name: z.string(),
    price: z.number(),
  })),
});

Browserbase

Infrastructure for running browser agents at scale:

From Browserbase: "Browserbase offers serverless browsers that are reliable, fast, and scalable. They manage the infrastructure so you can focus on building."

Key features:

Serverless browser instances
Compatible with Playwright, Puppeteer, Selenium
Session management
Proxy support
Stealth mode for anti-bot bypass

Integration:

Python

from playwright.async_api import async_playwright
import os

async def run():
    async with async_playwright() as p:
        browser = await p.chromium.connect_over_cdp(
            f"wss://connect.browserbase.com?apiKey={os.environ['BROWSERBASE_API_KEY']}"
        )
        page = await browser.new_page()
        await page.goto("https://example.com")
        # ... automation logic

Other Frameworks

LaVague: Natural language web automation framework:

Python

from lavague.core import WorldModel, ActionEngine
from lavague.drivers.selenium import SeleniumDriver

driver = SeleniumDriver()
action_engine = ActionEngine(driver)
world_model = WorldModel()

agent = WebAgent(world_model, action_engine)
agent.get("https://example.com")
agent.run("Click on the 'Products' menu")

Fellou: From research: "The world's first self-driving browser where AI doesn't just chat, it acts. It can automate multi-step tasks across websites, analyze data, and get things done."

Architecture Patterns

Vision-Based vs DOM-Based

Vision-based (Screenshot):

Python

# Agent sees screenshot, identifies elements visually
screenshot = page.screenshot()
action = model.identify_action(screenshot, task)
# More robust to page changes

DOM-based (HTML):

Python

# Agent parses HTML, finds elements by selectors
html = page.content()
action = model.plan_action(html, task)
# More precise but brittle

Hybrid approach (recommended):

Python

# Combine both for best results
screenshot = page.screenshot()
html = page.content()
accessibility_tree = page.accessibility_tree()

action = model.plan_action(
    screenshot=screenshot,
    html_context=html,
    a11y_tree=accessibility_tree,
    task=task
)

Multi-Step Planning

Complex tasks require planning:

Python

class WebAgentWithPlanning:
    def __init__(self, model, browser):
        self.model = model
        self.browser = browser

    async def execute_task(self, task: str):
        # Step 1: Plan
        plan = await self.model.create_plan(task)

        # Step 2: Execute steps
        for step in plan.steps:
            observation = await self.observe_page()
            action = await self.model.decide_action(step, observation)
            result = await self.execute_action(action)

            # Step 3: Verify and adapt
            if not await self.verify_step(step, result):
                # Replan if needed
                plan = await self.model.replan(task, step, result)

        return await self.get_final_result()

Error Recovery

Python

async def robust_action(agent, action, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = await agent.execute(action)
            if result.success:
                return result
        except Exception as e:
            # Take screenshot for debugging
            screenshot = await agent.screenshot()

            # Ask model to diagnose and recover
            recovery = await agent.model.diagnose_error(
                action=action,
                error=str(e),
                screenshot=screenshot
            )

            if recovery.should_retry:
                action = recovery.modified_action
            else:
                raise

    raise MaxRetriesExceeded(action)

Production Considerations

Anti-Bot Detection

Many sites block automated browsers:

Python

# Use stealth plugins
from playwright_stealth import stealth_async

browser = await playwright.chromium.launch()
page = await browser.new_page()
await stealth_async(page)

# Or use Browserbase's stealth mode
browser = Browser(
    config=BrowserConfig(
        stealth_mode=True,
        proxy="residential"  # Rotating residential proxies
    )
)

Rate Limiting

Python

import asyncio

class RateLimitedAgent:
    def __init__(self, agent, requests_per_minute=10):
        self.agent = agent
        self.delay = 60 / requests_per_minute

    async def act(self, action):
        result = await self.agent.act(action)
        await asyncio.sleep(self.delay)
        return result

Session Management

Python

# Persist cookies/state between runs
async def save_session(page, session_file):
    cookies = await page.context.cookies()
    storage = await page.context.storage_state()
    with open(session_file, 'w') as f:
        json.dump({"cookies": cookies, "storage": storage}, f)

async def load_session(browser, session_file):
    with open(session_file) as f:
        state = json.load(f)
    context = await browser.new_context(storage_state=state)
    return context

Monitoring and Logging

Python

class MonitoredAgent:
    def __init__(self, agent, logger):
        self.agent = agent
        self.logger = logger

    async def act(self, action):
        # Log action
        self.logger.info(f"Action: {action}")

        # Take before screenshot
        before = await self.agent.screenshot()

        # Execute
        start = time.time()
        result = await self.agent.act(action)
        duration = time.time() - start

        # Take after screenshot
        after = await self.agent.screenshot()

        # Log result
        self.logger.info(f"Result: {result.success}, Duration: {duration:.2f}s")

        # Store trace for debugging
        self.store_trace(action, before, after, result)

        return result

Use Cases

E-Commerce Automation

Price monitoring across competitors
Automated purchasing
Inventory tracking

Data Collection

Lead generation
Research aggregation
Content monitoring

Testing

End-to-end testing with natural language
Visual regression testing
Accessibility testing

Personal Automation

Form filling
Appointment booking
Social media management

Security Considerations

From Anthropic: "Vulnerabilities like jailbreaking or prompt injection may persist across frontier AI systems. Claude instructions on webpages or contained in images may override instructions or cause Claude to make mistakes."

Best practices:

Run in sandboxed environments (VMs, containers)
Don't give access to sensitive accounts
Require human confirmation for critical actions
Monitor all agent activity
Use separate credentials for automation

Conclusion

Agentic browsing is rapidly maturing:

Commercial solutions (Operator, Claude Computer Use) for consumer tasks
Open-source frameworks (Browser-Use, Stagehand) for developers
Infrastructure (Browserbase) for production scale

The future: natural language becomes the interface for web interaction.

Agentic Browsing: AI Web Agents and Browser Automation

Table of Contents

The Agentic Browser Revolution

Why Agentic Browsing?

Commercial Solutions

OpenAI Operator / ChatGPT Agent

Claude Computer Use

AI-Native Browsers

Perplexity Comet

OpenAI ChatGPT Atlas

The AI Browser Wars

Open-Source Frameworks

Browser-Use

Stagehand (by Browserbase)

Browserbase

Other Frameworks

Architecture Patterns

Vision-Based vs DOM-Based

Multi-Step Planning

Error Recovery

Production Considerations

Anti-Bot Detection

Rate Limiting

Session Management

Monitoring and Logging

Use Cases

E-Commerce Automation

Data Collection

Testing

Personal Automation

Security Considerations

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Computer Use Agents: UI-TARS, Claude, and Desktop Automation

Building Agentic AI Systems: A Complete Implementation Guide

Agentic Engine Optimization (AEO): Preparing for AI Agents in 2025