Skip to main content
Back to Blog

Browser-Use: How AI Agents Control Web Browsers

A comprehensive technical analysis of Browser-Use—the library powering browser automation in AI agents like OpenManus and Cline. Understanding DOM extraction, element indexing, action execution, and how LLMs interact with web pages.

18 min read
Share:

The Browser Automation Challenge

When AI agents need to interact with the web, they face a fundamental challenge: web pages are designed for humans, not machines. Unlike APIs with structured endpoints, web interfaces present visual layouts, interactive elements scattered across complex DOM trees, and dynamic content that changes based on user actions.

Browser-Use solves this challenge by providing a sophisticated abstraction layer between AI agents and web browsers. Rather than forcing agents to work with raw HTML or pixel coordinates, it extracts meaningful structure from web pages, indexes interactive elements for easy reference, and translates agent intentions into precise browser actions.

This library powers browser automation in frameworks like OpenManus, Cline, and numerous other AI agent implementations. Understanding how it works reveals the engineering required to make AI agents effective web users.

Architecture Overview

Browser-Use implements an event-driven architecture built on the Chrome DevTools Protocol. Rather than using high-level browser automation APIs directly, it connects to Chrome's debugging interface, enabling fine-grained control over browser behavior while maintaining the flexibility needed for AI agent integration.

The architecture comprises several interconnected systems. The browser session manages the connection to Chrome and coordinates all browser operations. The DOM service extracts and processes page structure into agent-friendly representations. The tools system defines the action vocabulary available to agents and executes their commands. The watchdog system monitors browser events and handles background tasks like download tracking and crash detection.

These systems communicate through an event bus that enables loose coupling while maintaining coordination. When an agent requests an action, the tools system dispatches an event that watchdogs and handlers process asynchronously. This event-driven design enables responsive, non-blocking operation even during complex multi-step interactions.

Connecting to Chrome via CDP

Why Chrome DevTools Protocol

Browser-Use communicates with Chrome through the Chrome DevTools Protocol rather than higher-level automation APIs. CDP provides direct access to Chrome's internal capabilities—DOM inspection, JavaScript execution, network interception, and input simulation—with granular control impossible through standard automation interfaces.

This low-level access enables capabilities that high-level APIs can't provide. Accessibility tree extraction reveals how screen readers perceive page structure. Layout metrics expose precise element positioning. Input event simulation mimics real user interactions at the hardware level. These capabilities prove essential for robust browser automation.

Session Management

The browser session establishes and maintains the CDP connection. When initialized, it launches or connects to a Chrome instance, negotiates the debugging protocol version, and begins monitoring for browser events.

Session management handles the complexity of modern web applications. A single page might contain multiple iframes, each requiring its own CDP session. Pop-up windows create new targets that need tracking. Service workers and web workers add additional contexts. The session manager maintains a registry of all targets and their associated CDP sessions, routing commands appropriately.

Target discovery runs continuously in the background. When new tabs open, iframes load, or workers spawn, the session manager detects these changes and establishes appropriate connections. When targets close, it cleans up associated resources. This dynamic management ensures the agent always has access to all relevant browser contexts.

DOM Extraction and Processing

The Extraction Pipeline

Converting a web page into a representation that AI agents can understand requires multi-stage processing. Raw DOM trees contain thousands of nodes, most irrelevant for interaction. The extraction pipeline filters, enhances, and transforms this data into a focused representation of interactive elements.

The first stage collects data from Chrome through multiple CDP commands. A DOM snapshot captures the document structure with layout and visibility information. The full DOM tree provides detailed node properties and relationships. The accessibility tree reveals how assistive technologies interpret the page. Layout metrics establish viewport dimensions and scroll position.

The second stage builds enhanced node representations that combine information from all sources. Each node gains properties describing its visibility, interactivity, position, and accessibility role. Parent-child relationships enable understanding of document structure. This enhanced representation contains everything needed for downstream processing.

The third stage serializes the enhanced tree into agent-friendly format. Interactive elements receive sequential indices that agents use for targeting. Non-interactive elements are summarized or omitted to reduce noise. The result is a compact representation that captures what agents need without overwhelming context windows.

Element Indexing

Element indexing transforms complex DOM trees into simple numbered references that agents can use reliably. When an agent sees a button labeled "Submit" at index 5, it can click that button by specifying index 5 without needing to understand CSS selectors, XPath expressions, or DOM structure.

The indexing algorithm processes the enhanced DOM tree in document order. When it encounters an interactive element—something clickable, typeable, or otherwise actionable—it assigns the next available index and records the mapping. Non-interactive elements like plain text or structural containers pass through without indices.

Interactivity detection examines multiple signals. Semantic HTML elements like buttons, links, and inputs are inherently interactive. ARIA roles indicate interactive behavior regardless of underlying HTML. Event listeners reveal programmatically interactive elements. Class names and IDs sometimes signal interactivity through conventional naming. The detection algorithm combines these signals to identify elements agents might want to interact with.

The selector map maintains bidirectional mapping between indices and DOM nodes. Given an index, the system retrieves the full enhanced node with all its properties. Given a node, the system finds its assigned index if one exists. This mapping persists across the agent interaction, ensuring consistent references within a session.

Visibility and Filtering

Not all DOM elements matter for agent interaction. Elements hidden by CSS, positioned outside the viewport, or obscured by overlapping content shouldn't clutter the agent's view. The extraction pipeline applies multiple filtering stages to focus on what's relevant.

Visibility filtering examines computed styles to identify hidden elements. Display none, visibility hidden, and zero opacity all indicate non-visible content. These elements and their descendants are excluded from the serialized representation.

Viewport filtering removes elements positioned entirely outside the current view. An element at coordinates far below the current scroll position can't be seen or clicked without scrolling. Including such elements would waste context and confuse agents about what's currently accessible.

Paint order filtering handles the subtlety of overlapping elements. When one element completely covers another, the hidden element isn't truly interactive even if it has the technical properties of interactivity. The filtering algorithm uses paint order information to identify and remove elements hidden behind others.

Serialization Format

The final serialization produces an HTML-like string representation that LLMs can parse and understand. Interactive elements include their assigned indices as data attributes. Element content, attributes, and structure remain visible for semantic understanding.

A serialized fragment might look like a simplified HTML document where buttons show their index and label, inputs show their type and placeholder, and links show their text and destination. This format leverages LLMs' training on web content while adding the indexing layer needed for action targeting.

The serialization balances completeness with conciseness. Including every DOM attribute would overwhelm context windows. Omitting semantic information would prevent agents from understanding what elements do. The serializer selects attributes that convey meaning—labels, types, placeholders, values—while omitting purely technical attributes like internal IDs or framework-specific annotations.

The Action System

Action Vocabulary

Browser-Use defines a comprehensive vocabulary of actions that agents can perform. Each action type specifies the parameters it requires and the behavior it produces. This structured vocabulary enables precise communication between agents and the browser.

Navigation actions control what page the browser displays. The navigate action loads a specified URL, optionally in a new tab. Tab switching changes which of multiple open tabs is active. Tab closing removes tabs from the session. These actions give agents control over the browser's overall state.

Element interaction actions target specific indexed elements. Click actions simulate mouse clicks on elements identified by their index. Input actions type text into form fields. Scroll actions move the viewport up or down. These actions translate agent intentions into precise element manipulations.

Keyboard actions provide lower-level input control. Sending specific key sequences enables shortcuts, form submissions, and special characters. This capability handles scenarios where element-level interaction isn't sufficient.

Extraction actions retrieve information from pages. Text extraction pulls visible content for agent processing. These actions support information gathering tasks alongside interactive automation.

Completion actions signal task outcomes. The done action indicates that the agent has finished its work, either successfully or unsuccessfully. This explicit completion prevents agents from continuing indefinitely.

Action Execution Flow

When an agent specifies an action, execution follows a defined pipeline. The tools system validates the action parameters, resolves any element references, dispatches the appropriate event, and collects results.

For element-targeted actions, resolution translates indices to actual DOM nodes. The selector map lookup finds the enhanced node corresponding to the specified index. From this node, the system extracts whatever information is needed for execution—coordinates for clicking, input type for typing, scroll container for scrolling.

Event dispatch publishes the action request to the event bus. Registered handlers receive the event and perform the actual browser manipulation. This indirection enables multiple handlers to react to actions—one handler performs the click while another captures a post-action screenshot.

Result collection aggregates outcomes from all handlers. The primary handler reports success or failure. Screenshot handlers provide visual confirmation. Error handlers capture any exceptions. These results flow back to the agent as structured feedback.

Coordinate Handling

Browser automation requires careful coordinate management. Elements have positions in the page coordinate system. The viewport shows a portion of that space. Screenshots might be resized for LLM consumption. Clicks must land in precisely the right place despite these transformations.

Element coordinates come from CDP layout queries. The bounding box query returns an element's position and size in page coordinates. For clicking, the system typically targets the element's center—calculating the midpoint of the bounding box ensures clicks land within element boundaries.

Viewport transformation accounts for scroll position. Page coordinates are absolute—an element at page position 2000 might be visible in a viewport scrolled to position 1800. Click execution must account for this offset to generate correct screen coordinates.

Screenshot scaling adds another transformation layer. When screenshots are resized for LLM token efficiency—perhaps from 1920x1080 to 1400x850—coordinates provided by vision models reference the scaled dimensions. The system must inverse-scale these coordinates back to actual viewport coordinates before executing clicks.

Iframe handling requires additional transformation. Elements within iframes have coordinates relative to the iframe's content area, not the main page. Clicking such elements requires translating through the iframe's position in its parent document, potentially through multiple nesting levels.

State Representation for LLMs

Browser State Summary

Each time an agent needs to decide what to do, it receives a comprehensive summary of current browser state. This summary combines structural information from DOM extraction, visual information from screenshots, and contextual information about the browser session.

The DOM state provides the serialized, indexed representation of interactive elements. Agents parse this to understand what actions are available and how to target them. The format balances human readability—enabling inspection and debugging—with machine parseability—enabling reliable action specification.

Screenshot data offers visual understanding when vision-capable models are available. The base64-encoded image shows exactly what a human user would see. This visual grounding helps agents understand spatial relationships, visual feedback, and content that isn't well-represented in DOM structure alone.

Session context includes the current URL, page title, list of open tabs, viewport dimensions, and scroll position. This metadata helps agents understand where they are in a workflow and what navigation options exist.

Diagnostic information reports browser errors, pending network requests, and recent pop-up messages. This information helps agents understand when pages are still loading, when errors have occurred, and when they need to handle interruptions.

Message Construction

The state summary becomes part of messages sent to LLMs during agent execution. System prompts establish the agent's capabilities and action format. User messages present the current task and browser state. The LLM responds with reasoning and action specifications.

System prompts explain the available actions and how to use element indices. They establish conventions for action formatting, typically JSON structures that parsers can reliably extract. They set expectations for when tasks are complete and how to signal success or failure.

User messages combine task description with current state. The task provides the goal; the state provides the context for achieving it. Including recent action results helps agents understand the effects of their previous decisions and adjust accordingly.

Response parsing extracts action specifications from LLM output. Structured output mechanisms or regex extraction identify action JSON within generated text. Validation ensures specified actions are well-formed before execution.

Multi-Tab and Multi-Context Support

Tab Management

Modern web tasks often span multiple tabs. A research task might open several sources simultaneously. A comparison task might view alternatives side by side. A workflow might spawn tabs for sub-tasks while maintaining a main working tab.

Browser-Use tracks all open tabs through the session manager. Each tab corresponds to a CDP target with its own session. The agent focus target identifies which tab is currently active for actions. Tab information flows to agents through the browser state summary, enabling informed decisions about tab usage.

Opening new tabs happens either through navigation actions with the new tab flag or through links that specify target blank. The session manager detects new targets as they appear, establishes CDP sessions, and makes them available for agent interaction.

Switching tabs changes the agent focus without affecting browser display. The agent can gather information from multiple tabs by switching focus, extracting state, and accumulating findings. This capability enables complex multi-source workflows.

Closing tabs removes them from the session. Agents might close tabs they've finished with, keeping the browser session manageable. Alternatively, tabs might remain open for potential return.

Iframe Handling

Iframes embed documents within documents, creating nested browsing contexts with separate DOM trees. A page might use iframes for embedded content, third-party widgets, or application architecture.

Browser-Use treats iframes as additional targets requiring their own CDP sessions. When DOM extraction encounters an iframe, it recursively extracts the iframe's content and integrates it into the overall page representation. Element indices span across iframe boundaries, allowing agents to interact with embedded content naturally.

Coordinate transformation handles the nesting. An element in an iframe has coordinates relative to the iframe's internal coordinate system. Clicking that element requires translating through each nesting level to produce final screen coordinates.

Cross-origin iframes present additional challenges. Security restrictions limit what information can be extracted from cross-origin content. Browser-Use works within these constraints, extracting what's available while gracefully handling restricted access.

Event-Driven Watchdog System

Watchdog Architecture

Watchdogs are background monitors that react to browser events and maintain system state. They implement the observer pattern, registering interest in specific events and executing handlers when those events occur.

Each watchdog focuses on a specific concern. The downloads watchdog tracks file downloads, capturing paths and making them available to agents. The crash watchdog detects browser crashes and enables recovery. The security watchdog blocks navigation to dangerous domains. The screenshot watchdog captures images after significant actions.

Watchdogs operate asynchronously alongside the main agent loop. When events they monitor occur, they execute their handlers without blocking other operations. This parallelism keeps the system responsive while maintaining comprehensive monitoring.

Registration happens during session initialization. Each watchdog registers its event handlers with the event bus. Throughout the session, it receives relevant events and processes them according to its logic.

Key Watchdog Implementations

The downloads watchdog monitors file download events. When Chrome starts downloading a file, the watchdog tracks progress. When downloads complete, it records file paths and makes them available through action results. Agents performing download tasks receive the paths they need without explicit file system interaction.

The screenshot watchdog captures visual state at strategic moments. After actions that change the page—clicks, navigation, scrolling—it captures fresh screenshots for the next state summary. This ensures agents always see current visual state rather than stale images.

The DOM watchdog maintains cached DOM state. Full DOM extraction is expensive, so caching enables efficient repeated access within a step. The watchdog invalidates caches when actions might have changed page structure, ensuring freshness while minimizing redundant extraction.

The popup watchdog handles JavaScript alerts, confirms, and prompts. These dialogs block normal interaction until dismissed. The watchdog can auto-dismiss them or capture their messages for agent awareness. This prevents dialogs from stalling automation.

The permissions watchdog handles permission requests for geolocation, camera, notifications, and other sensitive capabilities. It can auto-grant, auto-deny, or defer to configuration. This enables automation of sites requiring permissions without manual intervention.

Configuration and Extensibility

Browser Profile Configuration

Browser sessions accept extensive configuration controlling browser behavior. Headless mode runs without visible UI for server deployment. User data directories enable persistent profiles with saved logins and preferences. Viewport dimensions control the visible area and screenshot size.

Proxy configuration routes traffic through specified servers. This enables geographic targeting, anonymization, or traffic inspection. The configuration accepts proxy URLs with optional authentication.

Permission defaults establish how to handle capability requests. Auto-granting all permissions enables automation of permission-dependent sites. Selective grants enable specific capabilities while denying others. These defaults apply unless overridden during execution.

Download configuration controls file download behavior. The download directory specifies where files land. Auto-download settings determine whether downloads proceed automatically or require explicit acceptance.

Custom Action Registration

The action system supports custom actions beyond built-in capabilities. Applications can register domain-specific actions that integrate seamlessly with the standard action vocabulary.

Custom action registration specifies an action name, parameter model, and handler function. The name appears in action schemas sent to LLMs. The parameter model validates action specifications. The handler function executes when agents invoke the action.

Handler functions receive action parameters and the browser session. They can perform arbitrary operations—CDP commands, multi-step sequences, or external integrations—and return structured results. This extensibility enables domain-specific automation without modifying library code.

Prompt Customization

Agent behavior depends heavily on system prompts that establish capabilities and conventions. Browser-Use supports prompt customization for application-specific needs.

Override prompts replace the default system message entirely. This enables complete control over agent instructions when default behavior isn't appropriate.

Extension prompts add to the default message. Application-specific guidelines, additional context, or modified conventions append to standard instructions. This approach preserves base functionality while adding customization.

Task-specific prompts vary the user message content. Beyond the standard state representation, applications might add domain knowledge, workflow context, or specialized instructions relevant to specific tasks.

Performance Optimization

Screenshot Management

Screenshots represent a significant token cost when using vision models. A full-size screenshot encoded as base64 might consume thousands of tokens per agent step. Browser-Use implements several optimizations to manage this cost.

Screenshot resizing reduces visual token consumption. Rather than sending full-viewport images, screenshots can be scaled to smaller dimensions—perhaps 1400x850 instead of 1920x1080. This reduction often preserves sufficient visual information while dramatically cutting token usage.

Coordinate scaling maintains interaction precision despite resizing. When agents specify click coordinates based on resized screenshots, the system scales these coordinates back to actual viewport positions. This transformation ensures clicks land correctly regardless of screenshot size.

Selective screenshot capture skips captures when unchanged. If an action didn't affect visual state—perhaps a failed click or a pure data extraction—the previous screenshot remains valid. Avoiding redundant capture saves both capture time and transmission bandwidth.

DOM Caching

Full DOM extraction involves multiple CDP round-trips and substantial processing. Caching enables efficient repeated access while maintaining freshness guarantees.

The DOM cache stores extraction results keyed by page state. Within a single agent step, multiple accesses to DOM state can share a single extraction. This sharing significantly reduces extraction overhead during complex decision-making.

Cache invalidation ensures agents see current state. Actions that might change DOM—clicks, navigation, form submission—invalidate the cache. Subsequent access triggers fresh extraction reflecting any changes.

Incremental update strategies could potentially update caches rather than fully invalidating them. If an action affects only a small DOM region, updating just that region would be more efficient than re-extracting everything. This optimization remains an area for future enhancement.

Connection Pooling

CDP sessions represent network connections with establishment overhead. Pooling reuses connections across targets, reducing connection churn during multi-tab operation.

The session manager maintains the pool, tracking which sessions serve which targets. When new targets appear, it preferentially reuses existing sessions where protocol compatibility allows. When targets close, it returns their sessions to the pool rather than immediately disconnecting.

Pool sizing balances resource usage against connection establishment latency. Too few pooled connections force frequent establishment. Too many waste resources on idle connections. Adaptive sizing based on recent usage patterns would optimize this tradeoff.

Error Handling and Recovery

Error Categories

Browser automation encounters various failure modes requiring different handling strategies. Browser crashes terminate the entire session, requiring full restart. Page errors might affect a single tab while leaving others operational. Action failures indicate that specific interactions didn't succeed but the browser remains functional.

Browser crashes produce characteristic CDP disconnection events. The crash watchdog detects these events and can trigger recovery procedures. Recovery might restart the browser, reload previous state, and continue execution.

Page errors include JavaScript exceptions, network failures, and rendering problems. These errors affect page functionality without crashing the browser. Error information flows to agents through diagnostic state, enabling informed decisions about retry or alternative approaches.

Action failures indicate that interactions didn't produce expected results. A click might miss its target. A navigation might be blocked. An input might be rejected. These failures return through action results, enabling agent retry with adjusted parameters.

Recovery Strategies

Automatic retry handles transient failures. Network hiccups, timing races, and temporary unavailability often resolve on retry. The system can automatically retry failed actions with configurable limits before reporting failure to agents.

State recovery restores session context after disruptions. If a browser restart is necessary, saved state—cookies, local storage, navigation history—can be reloaded to resume where execution left off. This recovery minimizes the impact of crashes on long-running tasks.

Graceful degradation continues operation with reduced capability when full functionality isn't available. If certain features fail—perhaps screenshot capture—the system can continue with text-only state. This flexibility maximizes successful task completion despite partial failures.

Agent-level recovery enables intelligent response to failures. Rather than blindly retrying, agents can assess failures, adjust strategies, and attempt alternatives. The detailed failure information in action results enables this intelligent adaptation.

Conclusion

Browser-Use demonstrates the engineering complexity required for effective AI browser automation. Converting visual, interactive web pages into representations that language models can understand and act upon requires sophisticated DOM processing, careful coordinate management, and robust error handling.

The library's architecture balances multiple concerns: LLM compatibility through serialized, indexed DOM representations; interaction precision through CDP-level browser control; flexibility through configurable profiles and extensible actions; robustness through event-driven monitoring and error recovery.

For developers building AI agents that interact with the web, Browser-Use provides a production-ready foundation. For those seeking to understand how browser automation works, its architecture illustrates the patterns and techniques that enable AI systems to navigate the web alongside human users.

The capability to control browsers programmatically through natural language represents a significant expansion of what AI agents can accomplish. Browser-Use makes this capability accessible, enabling applications from automated research to workflow automation to testing and beyond.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles