Skip to main content
Back to Blog

OpenManus: Deep Dive into the Open-Source AI Agent Framework

A comprehensive technical analysis of OpenManus—the open-source alternative to Manus AI. Understanding its multi-agent architecture, ReAct implementation, tool system, planning flows, and how it orchestrates complex autonomous tasks.

30 min read
Share:

The Rise of Open-Source AI Agents

When Manus AI captured the imagination of the AI community in early 2025 with its viral demos of autonomous task completion, one question echoed through developer forums and social media: how does it actually work? While Manus AI remained proprietary, the open-source community responded with OpenManus—a fully transparent implementation that lets developers understand, customize, and deploy similar capabilities.

OpenManus represents more than just a clone. It's a well-architected framework that embodies the best practices emerging from the agentic AI landscape: the ReAct pattern for reasoning and acting, hierarchical agent design, comprehensive tool systems, and flow-based orchestration for complex multi-step tasks.

This deep dive explores every layer of OpenManus—from its foundational abstractions to the intricate details of how an autonomous agent perceives, reasons, and acts in the world.

Project Structure and Entry Points

Understanding OpenManus begins with its directory layout, which reflects a clean separation of concerns.

The app directory contains the framework's core implementation. Within it, the agent subdirectory houses all agent implementations from the base abstractions to specialized variants. The tool subdirectory defines the complete toolkit available to agents. The flow subdirectory manages orchestration patterns for multi-step execution. The prompt subdirectory stores the carefully crafted prompt templates that guide agent behavior.

Supporting files in the app directory handle cross-cutting concerns. The config module loads TOML configuration files. The llm module provides the LLM client abstraction. The schema module defines core data models including messages, tool calls, and memory structures. The logger module configures the Loguru-based logging system. The exceptions module defines framework-specific error types.

Three entry points provide different ways to interact with OpenManus. The main module offers a simple command-line interface for interactive agent sessions—users type requests and watch the Manus agent respond in real time. The app module provides a FastAPI-based web server with REST endpoints and server-sent events for building web applications around agent capabilities. The run_flow module enables flow-based execution for complex multi-step tasks requiring planning and coordination.

This structure supports both exploration and production use. Developers learning the framework can trace execution from entry points through the agent hierarchy. Teams deploying OpenManus can choose the interface that best fits their integration needs.

Architectural Philosophy

The Agent as a State Machine

At its heart, OpenManus treats agents as state machines that transition through well-defined phases. An agent begins in an idle state, ready to accept tasks. When execution starts, it transitions to running and remains there throughout active processing. Successful completion moves the agent to finished, while exceptions trigger an error state that serves as a transient indicator of problems.

The four states form a complete lifecycle. Idle represents readiness—the agent has been initialized and awaits instructions. Running indicates active work—the agent is cycling through think-act iterations. Finished signals successful completion—the task has been accomplished or the agent has determined it cannot proceed further. Error captures exceptional conditions—something unexpected occurred during execution.

State transitions happen through a dedicated context manager that ensures consistency even when exceptions occur. When entering the running state, the context manager records the previous state. If execution completes normally, the state remains as set by the agent logic. If an exception interrupts execution, the context manager catches it, briefly sets the error state for observability, then restores the previous state so the agent remains usable. This pattern prevents agents from becoming stuck in inconsistent states after failures.

The state machine approach serves multiple purposes. It provides clear checkpoints for monitoring and debugging—you always know exactly what phase your agent is in. It enables safe error recovery by ensuring the system can gracefully handle failures without corrupting ongoing operations. And it creates natural points for human intervention when autonomous execution needs oversight. External systems can query agent state to make routing decisions, trigger alerts, or update user interfaces.

Separation of Thinking and Acting

OpenManus implements the ReAct pattern—Reasoning plus Acting—as its core execution paradigm. Rather than having agents generate monolithic responses, the framework explicitly separates the cognitive process into two distinct phases that alternate in a continuous loop.

During the thinking phase, the agent examines its current context, considers what tools are available, and decides what action would best advance toward its goal. This phase produces a structured decision rather than free-form text. The agent doesn't just say what it plans to do; it commits to specific tool invocations with concrete parameters.

The acting phase then executes those decisions. Tools run, results are captured, and the outcomes flow back into the agent's memory to inform the next thinking cycle. This separation prevents the common failure mode where LLMs generate plausible-sounding but ultimately fictional action descriptions without actually doing anything.

Hierarchical Agent Design

OpenManus organizes its agents in a carefully designed class hierarchy that promotes code reuse while allowing specialization.

At the foundation sits the base agent abstraction, which handles universal concerns. Every agent has a name and description for identification. Every agent maintains a reference to an LLM instance for reasoning. Every agent has a memory that accumulates messages throughout execution. Every agent tracks its current state and step count. The base agent defines the core run method that orchestrates the execution loop, incrementing steps and checking termination conditions, but delegates the actual work of each step to abstract methods that subclasses must implement.

The base agent also manages execution boundaries. A configurable maximum step count prevents infinite loops—agents that haven't finished after the limit simply stop. A duplicate threshold controls stuck detection sensitivity. The base agent provides methods to update memory with new messages and to check whether the agent appears stuck in a repetitive pattern.

The ReAct agent layer adds the alternating think-act pattern. It defines abstract methods for thinking and acting that concrete agents must implement, but it handles the orchestration between these phases. Each step calls think first, which returns a boolean indicating whether action is needed. If thinking produces actions to take, the step proceeds to call act, which executes those actions and returns results. This layer also implements stuck detection, recognizing when an agent produces identical outputs repeatedly and intervening by prepending strategy-change guidance to subsequent prompts.

The tool-calling agent layer makes the framework concrete by implementing thinking via LLM function calling and acting via tool execution. During thinking, this layer constructs prompts from the system prompt, accumulated memory, and next-step guidance, then calls the LLM with available tool schemas. It parses the LLM's response to extract any tool calls, logging both the model's reasoning and its tool selections. During acting, this layer iterates through requested tool calls, parses their JSON arguments, executes the corresponding tools, and collects results into tool messages that flow back into memory. Special handling exists for the terminate tool, which signals task completion and triggers the transition to finished state.

Finally, specialized agents customize prompts and available tools for their specific domains. The Manus agent serves as the general-purpose workhorse, equipped with Python execution, Google search, browser automation, and file operations. Its system prompt positions it as an all-capable AI assistant ready to handle diverse tasks. The software engineering agent focuses on programming tasks, equipped with Bash execution and the string-replace editor, with prompts that emphasize code quality and systematic debugging. The planning agent adds sophisticated multi-step planning capabilities, using the planning tool to create, track, and update execution plans while mapping tool calls to plan steps for progress tracking.

The Tool System

Tools as First-Class Citizens

Tools transform agents from sophisticated chatbots into systems capable of affecting the real world. OpenManus treats tools as first-class architectural components with well-defined interfaces, making it straightforward to add new capabilities without touching agent code.

Every tool in OpenManus shares a common structure. It has a name that the LLM uses to invoke it, a description that helps the model understand when to use it, and a parameter schema that defines what inputs it accepts using JSON Schema format. This standardization means the framework can automatically convert any tool into the format expected by OpenAI's function calling API through a dedicated conversion method, presenting a uniform interface regardless of what the tool actually does.

The tool execution model is deliberately simple: a tool receives validated parameters and returns a result. OpenManus defines distinct result types to capture different outcomes. A standard tool result carries output content, optional error information, and optional system messages. A tool failure explicitly represents failed execution with error details. A CLI result provides specialized formatting for command-line operations. These typed results enable agents to reason precisely about what happened—distinguishing between successful execution with empty output versus actual failures.

Tool Collections and Orchestration

Individual tools gain power through aggregation. The tool collection abstraction groups related tools and provides operations over the group. Converting the entire collection to API parameters happens with a single method call, producing the array of function definitions that LLMs expect. Executing a specific tool by name routes through the collection, which locates the appropriate tool and invokes it. Batch execution runs all tools in the collection sequentially when needed.

Tool collections also support dynamic modification. Adding new tools at runtime enables agents that expand their capabilities based on context. Retrieving tools by name supports introspection and conditional logic. This flexibility means agent configurations can be assembled programmatically rather than hardcoded.

The Terminate Tool

Among OpenManus's tools, terminate holds special status. Rather than affecting external state, it controls agent execution flow. When an agent invokes terminate, it signals that the task is complete—either successfully or unsuccessfully, as indicated by a status parameter.

The terminate tool accepts a status value: success indicates the task was accomplished, while failure indicates the agent determined it cannot proceed. Upon invocation, the tool triggers the agent's transition to finished state, breaking the execution loop. This explicit termination mechanism ensures agents don't spin indefinitely; they must eventually decide they're done.

The terminate tool's design reflects a key principle: agents should have agency over their own lifecycle. Rather than relying solely on external step limits, agents can recognize completion and stop themselves. This produces more natural behavior where agents finish when appropriate rather than always exhausting their step budget.

Nested LLM Calls

The create chat completion tool enables a sophisticated pattern: tools that themselves use LLM reasoning. When a tool needs to make decisions or generate content beyond simple computation, it can invoke this tool to get LLM assistance.

This capability supports meta-cognitive patterns where agents reason about their reasoning. A tool might use nested LLM calls to validate its own output, generate alternative approaches, or produce human-readable summaries of complex results. The nested calls use the same LLM configuration as the parent agent, maintaining consistency in reasoning style.

Browser Automation

The browser use tool demonstrates OpenManus's approach to complex tool implementation. Built on the browser-use library which itself wraps Playwright, it provides a high-level action vocabulary that lets agents interact with web pages the way humans conceptualize the task.

The browser tool supports over fifteen distinct actions organized into functional categories. Navigation actions include navigating to a URL, opening a new tab with a specified address, and refreshing the current page. These let the agent move through the web as a user would, thinking in terms of "go to this website" rather than wrestling with HTTP requests and response parsing.

Interaction actions handle the mechanics of engaging with page elements. The agent can click on elements identified by their index in a DOM element list, enter text into form fields at specified indices, and trigger arbitrary JavaScript execution for advanced scenarios. The framework also detects file downloads triggered by clicks and handles them appropriately.

Observation actions let the agent understand what's on screen. Taking a screenshot captures the full page as a base64-encoded image for vision-capable models to interpret. Getting HTML retrieves page source, though truncated to two thousand characters to avoid overwhelming context windows. Getting text extracts the body's inner text for simpler analysis. Reading links enumerates all hyperlinks on the page for navigation planning.

State management actions provide situational awareness. Getting current state returns comprehensive information including the current URL, page title, list of open tabs, and clickable elements indexed for interaction. Scrolling moves the viewport up or down by a specified pixel amount. Tab management allows switching between open tabs by ID or closing the current tab.

Internally, the browser tool maintains significant state across invocations. A single browser instance persists throughout an agent's execution, with a browser context maintaining cookies, storage, and session information. A DOM service indexes interactive elements, assigning the numeric indices that interaction actions reference. Async locks prevent concurrent access that could corrupt browser state. Lazy initialization means the browser only starts on first use, avoiding resource consumption for agents that never need web access.

Cleanup receives careful attention. The browser tool implements explicit cleanup methods to close the browser context and release resources. A destructor ensures cleanup happens even if explicit calls are missed. The cleanup logic handles both async and sync contexts, ensuring proper resource release regardless of how the agent terminates.

Code Execution

OpenManus provides two flavors of code execution: Python for computational tasks and Bash for system operations. Each addresses different needs while sharing common safety considerations.

The Python execution tool runs code in an isolated environment with significant restrictions. Execution happens in a separate thread with a strict five-second timeout, preventing runaway computations from freezing the agent. The execution environment intentionally limits available built-in functions, reducing the attack surface for malicious or accidentally dangerous operations. Output captures only what the code prints to standard output, enforcing a clear boundary between computation and communication.

The Bash execution tool provides system-level capabilities with its own safeguards. Unlike Python, Bash execution maintains a persistent session across invocations—environment variables set in one command remain available in subsequent commands, and directory changes persist. This session persistence enables multi-step system operations that would be impossible if each command started fresh. A two-minute timeout prevents stuck commands from blocking execution indefinitely, and the tool supports running commands in the background for long-running operations.

File Operations

File manipulation in OpenManus goes beyond simple read and write operations. The string replace editor tool implements a sophisticated editing model inspired by how developers actually modify code.

The editor supports five distinct operations. View displays file contents with line numbers, supporting optional line range parameters for examining specific sections. The output truncates at sixteen thousand characters to prevent overwhelming agent context. Create makes new files, optionally populating them with initial content. String replace performs surgical text substitutions—you specify the exact text to find and what to replace it with. Insert adds new content at a specific line number, pushing existing content down. Undo reverses the most recent edit operation on a file, enabling recovery from mistakes.

The string replace operation deserves special attention because it embodies a key design principle. Rather than overwriting entire files, the editor operates on specific text patterns. This approach has several advantages: it's surgical rather than destructive, it fails safely when the expected text doesn't exist (indicating the file changed unexpectedly), and it produces minimal diffs that are easy to review. When the specified text appears multiple times in the file, the operation fails with an ambiguity error, requiring the agent to provide more context to uniquely identify the target.

Undo support maintains history for each edited file. When an agent makes a mistake—replacing the wrong text or inserting in the wrong location—it can undo and try again. This history persists across tool invocations within a session, supporting iterative refinement of changes.

The file saver tool provides simpler write capabilities for cases where full editor functionality isn't needed. It creates files with specified content, automatically creating parent directories as needed. It supports both write mode for creating or overwriting files and append mode for adding to existing content. Async file operations via the aiofiles library ensure file writes don't block the agent's event loop.

Web Search and Scraping

Information gathering tools extend agent awareness beyond local files and immediate context. The Google search tool provides web search capabilities, returning URLs that the agent can then investigate further with other tools. The Firecrawl integration offers more sophisticated web scraping, extracting structured content from web pages including text, links, and even screenshots for visual analysis.

These tools exemplify OpenManus's composable design philosophy. A search returns URLs, browser tools can visit those URLs, and file tools can save the results—complex research workflows emerge from combining simple capabilities.

The Planning Tool

The planning tool occupies a unique position in OpenManus's architecture. Unlike tools that affect external state, it manages internal state—specifically, the multi-step plans that guide complex task execution.

Plans in OpenManus have a rich structure stored in memory. Each plan has a unique identifier for reference, a title summarizing the objective, an ordered list of steps describing what must be done, parallel arrays tracking each step's status and any associated notes, and metadata about creation and modification times. This structure captures not just what needs to happen but the evolving state of execution.

The planning tool provides seven distinct operations. Create generates a new plan from a title and list of steps, initializing all steps to not-started status. Update modifies an existing plan's title or steps. List shows all plans the agent has created. Get retrieves detailed information about a specific plan including current status. Set-active designates which plan is currently being executed. Mark-step updates a specific step's status and optionally adds notes. Delete removes a plan that's no longer needed.

Step statuses form a simple state machine. Steps begin as not-started, indicating work hasn't begun. When an agent begins working on a step, it transitions to in-progress. Successful completion moves the step to completed. If a step cannot proceed due to dependencies or blockers, it can be marked as blocked with explanatory notes.

The planning tool stores everything in memory using dictionary structures. This means plans don't persist across sessions—when the agent process terminates, plans disappear. For the typical use case of single-session task execution, this is sufficient. Production deployments needing plan persistence would need to add external storage.

The planning tool's design reflects a key insight: LLMs excel at decomposing problems but struggle to maintain coherent execution across many steps without external scaffolding. By externalizing the plan into a data structure that persists across LLM invocations, OpenManus ensures that multi-step reasoning doesn't degrade as tasks grow complex. The agent can always query its current plan to remember where it is in a complex task.

Memory and Context Management

The Message-Based Memory Model

OpenManus implements memory as a sequence of messages that grows throughout task execution. Each message has a role—user, system, assistant, or tool—and content appropriate to that role. This design directly mirrors the conversation format expected by LLM APIs, eliminating translation overhead when constructing prompts.

The message abstraction provides rich structure beyond simple role and content. Messages can carry tool calls—structured requests for tool execution that the assistant generates. Messages can carry a name field identifying which tool produced a result. Messages can carry a tool call identifier that links tool results back to the specific invocations that requested them. This linking is essential when agents invoke multiple tools in a single turn; subsequent reasoning must understand which result came from which request.

Message creation happens through factory methods that enforce correct structure. Creating a user message requires only content—the user's input. Creating a system message similarly needs only content—the standing instructions. Creating an assistant message captures the agent's response, which might include both textual content and tool call requests. Creating a tool message requires the result content, the tool's name, and the tool call identifier for linking. A specialized factory creates assistant messages from tool call objects, handling the structural transformation automatically.

Converting messages for API consumption happens through a dedicated method that produces dictionary representations matching LLM API expectations. This conversion handles the various optional fields appropriately, omitting tool-related fields for simple text messages while including them for tool interactions.

User messages capture input from whoever initiated the task. System messages provide standing instructions that guide agent behavior. Assistant messages record what the agent said or decided, potentially including tool invocation requests. Tool messages contain the results of tool executions, linked back to the specific tool calls that generated them.

This message history serves as the agent's working memory. When deciding what to do next, the agent sees the full conversation leading to the current moment. Early messages provide context about the overall goal, recent messages show what was just attempted and what happened, and the accumulation captures the complete reasoning trajectory.

Sliding Window Management

Unbounded memory growth would eventually exceed context window limits, causing failures or degraded performance. OpenManus addresses this with a sliding window approach that maintains the most recent one hundred messages while older messages drop off.

The hundred-message limit represents a practical balance. It's large enough to capture multi-step reasoning chains without truncation in typical scenarios, yet bounded enough to prevent runaway context accumulation during long-running tasks. For most tasks, agents complete well before hitting this limit.

The sliding window creates an implicit forgetting mechanism. If an agent takes many steps, early context eventually disappears from its direct awareness. This can cause coherence problems in very long tasks, though in practice most tasks either complete or fail before forgetting becomes significant.

Tool Result Integration

When tools execute, their results must flow back into the conversation in a way the LLM can understand and use. OpenManus handles this by creating tool messages that include the tool's name, a reference to the specific invocation that triggered them, and the result content.

This linking between tool calls and tool results enables the LLM to understand which of its requests produced which outcomes. If an agent calls multiple tools in a single thinking cycle, the subsequent tool messages clearly indicate which result belongs to which call. The LLM can then reason about successes and failures on a per-tool basis rather than trying to disentangle a jumbled result stream.

The LLM Abstraction

Client Architecture

OpenManus abstracts LLM interactions through a dedicated class that handles the complexity of API communication. This abstraction supports multiple providers—OpenAI's API directly, Azure OpenAI deployments, and any compatible endpoint—through a unified interface.

The LLM class implements a singleton pattern keyed by configuration name. When code requests an LLM instance for a specific configuration, the class first checks whether an instance already exists for that configuration. If so, it returns the existing instance. If not, it creates a new instance, caches it, and returns it. This pattern ensures that configuration parsing and client initialization happen only once per configuration, and that all code using the same configuration shares connection resources.

Client selection happens based on API type specified in configuration. Standard OpenAI configuration creates an async OpenAI client pointing at the specified base URL. Azure configuration creates an async Azure OpenAI client with Azure-specific parameters including API version. Both client types present the same interface for making requests, so the rest of the framework doesn't need to know which provider is in use.

Request Methods

The LLM class provides two primary methods for making requests, corresponding to different interaction patterns.

The ask method handles simple text generation. It takes a list of messages, constructs the appropriate API request, and returns the generated text. This method supports streaming output, allowing calling code to process generated text as it arrives rather than waiting for completion. Streaming is valuable for user-facing applications where showing progress improves perceived responsiveness.

The ask-tool method handles function calling interactions. Beyond the message list, it accepts tool definitions in OpenAI's function calling format and a tool choice parameter controlling whether tool use is required, optional, or disabled. The method returns the full completion message object, preserving tool call information for the calling agent to process. A sixty-second timeout prevents hung requests from blocking execution indefinitely.

Both methods wrap API calls with retry logic using the Tenacity library. Up to six attempts occur with exponential backoff between failures. This retry pattern handles transient failures gracefully—rate limits, network hiccups, and temporary service issues resolve automatically without requiring manual intervention. The exponential backoff prevents retry storms from worsening overload conditions.

Message Formatting

A static utility method handles message format conversion. It accepts messages as either message objects or dictionaries and produces the list-of-dictionaries format that OpenAI's API expects. The method validates message structure, ensuring required fields are present and roles are valid. This validation catches configuration errors early rather than letting malformed requests reach the API.

Configuration Integration

LLM instances draw their settings from the configuration system. Model name, base URL, API key, maximum tokens, and temperature all come from configuration. Default values provide sensible starting points—4096 max tokens and 1.0 temperature—that work across common use cases.

Multiple LLM configurations can coexist. A default configuration handles most agent operations. A vision configuration might specify a model with image understanding capabilities. Custom configurations support specialized use cases. Agents request LLM instances by configuration name, enabling different agents or different operations to use different models as appropriate.

Flow-Based Orchestration

Beyond Single-Agent Execution

While individual agents handle focused tasks effectively, complex real-world objectives often require coordination across multiple agents with different specializations. OpenManus's flow system provides this coordination layer, orchestrating multi-step execution plans that may involve different agents at different stages.

The flow abstraction manages collections of agents, selecting which agent handles which step and ensuring that context flows appropriately between steps. This design separates the "what" of multi-step execution—the plan—from the "how" of individual step completion—the agents.

Planning Flow Architecture

The planning flow implementation demonstrates sophisticated multi-agent coordination. It begins by using an LLM to decompose a user's request into a structured plan with discrete, achievable steps. This plan becomes the roadmap for subsequent execution.

With a plan established, the flow iterates through steps. For each step, it identifies an appropriate executor agent, prepares context that includes both the specific step requirements and awareness of the overall plan, and delegates execution to the selected agent. After each step completes, the flow updates plan status and determines what comes next.

This architecture enables different agents to handle steps matching their specializations. A research step might route to an agent configured with web search tools, while a coding step routes to the software engineering agent. The flow handles routing and context transfer; agents focus on their domains.

Progress Tracking and Transparency

Throughout flow execution, the planning tool maintains explicit status for every step. Steps transition from not started to in progress when an agent begins working on them, then to completed when they finish successfully or to blocked if they encounter obstacles.

This status tracking serves multiple purposes. It enables progress reporting to users, showing exactly where in a complex task the system currently operates. It supports recovery from failures by identifying which steps succeeded and which need retry. And it provides audit trails for understanding how the system approached a problem.

Execution Dynamics

Prompt Construction

Before diving into the think-act cycle, understanding prompt construction illuminates how agents receive context. Each thinking phase constructs a prompt from three components that serve distinct purposes.

The system prompt establishes the agent's identity and standing instructions. For the Manus agent, this positions it as an all-capable AI assistant. For the software engineering agent, this emphasizes code quality and systematic approaches. System prompts remain constant throughout execution, providing stable grounding for the agent's behavior.

The message history provides accumulated context—everything that has happened since task initiation. User messages show what was requested. Assistant messages show what the agent decided. Tool messages show what happened when tools executed. This history grows throughout execution, giving later decisions access to earlier context.

The next-step prompt provides dynamic guidance for the current iteration. While the system prompt is static, the next-step prompt can change based on circumstances. When stuck detection triggers, the next-step prompt receives additional guidance encouraging strategy changes. This dynamic element allows the framework to influence agent behavior without modifying the core system prompt.

These three components concatenate to form the complete prompt sent to the LLM. The system prompt frames the interaction. The message history provides context. The next-step prompt focuses attention on immediate needs. Tool schemas accompany the prompt, informing the LLM what capabilities are available.

The Think-Act Cycle in Detail

A single iteration through OpenManus's core loop involves substantial orchestration. During the thinking phase, the system constructs the prompt as described above—system instructions, message history, and next-step guidance—and sends it to the LLM along with schemas for all available tools.

The LLM responds with its reasoning and, critically, its tool selection. OpenManus uses the function calling capability of modern LLMs, which provides structured tool invocations rather than free-form text describing intended actions. This structure eliminates parsing ambiguity—either the LLM selected a tool with specific parameters or it didn't. The tool choice parameter can be set to auto (the LLM decides whether to use tools), required (the LLM must use at least one tool), or none (tools are disabled for this call).

The thinking phase returns a boolean indicating whether tools were selected. If the LLM chose tools, thinking returns true and the agent proceeds to acting. If the LLM generated only text without tool calls—perhaps a final answer or a clarifying question—thinking returns false and the step completes without an acting phase.

If the LLM selected tools, execution enters the acting phase. Each tool call extracts parameters from the LLM's response—the tool name identifies which tool to invoke, and the arguments JSON provides the parameters. The agent locates the corresponding tool in its collection, parses the JSON arguments, and invokes the tool's execute method. Results flow back as tool messages added to memory, setting up context for the next thinking cycle.

Sequential tool execution means tools run one after another within a single acting phase. If the LLM requested three tools, they execute in order, with each result captured before the next tool runs. This sequential model simplifies reasoning about tool interactions but limits parallelism.

If the LLM didn't select tools—instead generating a final response—the agent recognizes this as potential task completion. The step completes without acting, and the agent's loop continues to the next iteration where it can decide whether more work is needed or whether to invoke the terminate tool explicitly.

Stuck Detection and Recovery

LLMs can enter degenerate states where they repeat the same action or reasoning indefinitely. OpenManus implements explicit stuck detection by monitoring recent assistant messages for repetition.

When the agent detects that consecutive responses are essentially identical—indicating it's stuck in a loop—it intervenes by modifying the next prompt to encourage strategy changes. This intervention breaks the repetition pattern, pushing the agent to try alternative approaches rather than spinning forever.

The detection threshold is configurable but defaults to catching loops after two consecutive duplicates. More aggressive detection might interrupt legitimate repeated operations (like processing multiple similar items), while more lenient detection allows longer stuck periods before intervention.

Error Handling Philosophy

OpenManus treats errors as information rather than failures. When a tool execution fails, the error message flows back into the conversation as a tool result. The agent sees what went wrong and can reason about alternatives, retry with different parameters, or acknowledge the limitation and work around it.

This approach reflects how capable humans handle tool failures. A broken web search doesn't end a research task; it prompts trying different search terms or alternative sources. By presenting errors as context rather than exceptions, OpenManus enables agents to exhibit similar resilience.

State machine transitions ensure that errors don't corrupt the agent's operational state. When exceptions occur during execution, the agent transitions to an error state temporarily, then recovers to idle—ready for the next attempt—rather than remaining stuck in an inconsistent state.

Configuration and Deployment

TOML-Based Configuration

OpenManus centralizes configuration in human-readable TOML files. The configuration specifies LLM connection details, model selection, and operational parameters like token limits and temperature settings.

Multiple LLM configurations can coexist in a single configuration file. A default configuration handles most operations, while specialized configurations can override settings for specific purposes. A vision-capable model might serve agents that need to interpret screenshots, while a faster model handles simpler reasoning tasks.

This configuration approach separates code from deployment concerns. The same agent codebase can connect to different models, endpoints, and providers by changing configuration rather than modifying code. This separation simplifies deployment across development, staging, and production environments.

Singleton Pattern for Resources

Expensive resources like LLM client connections use a singleton pattern, ensuring that multiple agents or repeated operations share connections rather than creating new ones. This design prevents resource exhaustion during intensive operations and ensures consistent behavior across a deployment.

The singleton pattern also enables configuration caching. Once a particular LLM configuration loads, subsequent requests for the same configuration receive the existing instance. This caching eliminates redundant configuration parsing and connection establishment.

Web Interface

For interactive use cases, OpenManus provides a FastAPI-based web interface that exposes agent capabilities through REST endpoints and real-time event streaming.

The interface provides several endpoints. Creating a task accepts a prompt and returns a task identifier that clients use for subsequent interactions. Listing tasks returns all tasks in reverse chronological order. Getting a specific task returns its current details including status and results. The events endpoint provides a server-sent events stream for real-time monitoring of task execution.

Server-sent events stream distinct event types corresponding to different phases of agent operation. Think events carry the agent's reasoning—what it's considering and why. Tool events indicate which tools the agent has selected for execution. Act events report on tool execution progress. Run events provide step-level summaries as iterations complete. Status events communicate task status changes. Error events report problems that occur during execution. Complete events signal task termination with final results.

Each event includes a heartbeat mechanism ensuring connections stay alive during long operations. Clients can reconnect and resume event streams if connections drop, though they may miss events that occurred during disconnection.

The web interface spawns agent execution asynchronously, allowing the HTTP request to return immediately with a task identifier while the agent works in the background. This design supports both fire-and-forget patterns where clients don't need results immediately and monitoring patterns where clients follow progress through the event stream.

This streaming approach provides transparency into agent operation. Users see thinking as it happens, tool selections as they're made, and results as they arrive. This visibility builds trust and enables intervention when agents head in unproductive directions.

Design Patterns and Principles

Abstract Base Classes for Extension

OpenManus uses abstract base classes throughout its architecture to define interfaces while leaving implementation flexible. The base tool class specifies what every tool must provide—name, description, parameter schema, and an execute method—without constraining how tools achieve their functionality.

This abstraction enables extension without modification. Adding a new tool means implementing the tool interface; the rest of the framework integrates it automatically. Similarly, new agent types implement the required abstract methods and inherit all surrounding infrastructure.

Factory Pattern for Flow Creation

The flow factory pattern centralizes flow instantiation, mapping flow types to their implementations. When the system needs a planning flow, it asks the factory rather than directly instantiating. This indirection simplifies adding new flow types and enables runtime flow selection based on task characteristics.

Retry Logic with Exponential Backoff

LLM API calls can fail transiently due to rate limits, network issues, or service instability. OpenManus wraps LLM operations with retry logic that attempts operations multiple times with exponentially increasing delays between attempts.

This pattern handles the reality of distributed systems: failures happen, but many are temporary. By retrying with backoff, the system recovers from transient issues without user intervention while avoiding aggressive retry storms that would worsen overload conditions.

Async-First Implementation

OpenManus implements its core operations asynchronously, enabling efficient handling of IO-bound operations like LLM API calls, web requests, and file operations. The async design prevents blocking during these operations, allowing the system to remain responsive even during long-running tasks.

The async approach particularly benefits web interface deployments where multiple users might submit tasks simultaneously. Each task can progress independently without blocking others, and the event loop efficiently schedules work across all active operations.

Practical Considerations

Safety Boundaries

Giving AI agents the ability to execute code and manipulate files creates obvious safety concerns. OpenManus implements multiple safety boundaries to limit blast radius.

Code execution timeouts prevent runaway computations from consuming resources indefinitely. Python execution restricts available built-in functions, eliminating easy access to dangerous operations. The browser tool operates in its own context, isolated from the broader system. These boundaries don't make the system perfectly safe—determined attacks could still cause harm—but they prevent accidents and casual misuse.

For production deployments, additional isolation through containers or virtual machines provides defense in depth. Running OpenManus inside a restricted environment limits what even a compromised agent could accomplish.

Logging and Observability

OpenManus provides detailed logging throughout execution, capturing agent thinking, tool selection, execution results, and state transitions. These logs serve debugging during development and monitoring in production.

The logging uses semantic markers to distinguish different event types. Thinking appears with one marker, tool selection with another, results with a third. This consistency enables log parsing and aggregation, supporting dashboards and alerts for production monitoring.

Performance Characteristics

Agent performance depends heavily on LLM latency, which typically dominates execution time. Each think-act cycle requires at least one LLM call, and complex reasoning may require many. Browser operations add network latency for page loads and interactions.

The sequential nature of agent loops means parallelization opportunities are limited. While internal operations use async IO efficiently, the fundamental constraint is the serial dependency between steps: you must think before acting, and you must see results before thinking again.

Comparison with Alternatives

Versus Browser-Use Alone

Browser-use provides browser automation primitives; OpenManus builds comprehensive agent infrastructure around them. Using browser-use directly gives fine-grained control but requires implementing agent loops, memory management, multi-tool coordination, and everything else from scratch.

OpenManus makes sense when you want a complete agent framework. Browser-use alone makes sense when you want just browser automation without the agent overhead, or when you're building a very different agent architecture.

Versus Commercial Solutions

Commercial offerings like Claude Computer Use and OpenAI Operator provide similar capabilities with different tradeoffs. Commercial solutions handle infrastructure, offer polished user experiences, and include vendor support. They're faster to start with and require less technical depth.

OpenManus offers transparency, customization, and cost control. You can inspect every line of code, modify behavior arbitrarily, and avoid per-request API costs for the automation layer. For production systems with specialized requirements, regulatory constraints, or cost sensitivity, these factors often favor open-source approaches.

Versus Other Open-Source Frameworks

The open-source agent landscape includes many alternatives: LangChain for chain-based workflows, AutoGen for multi-agent conversations, CrewAI for role-based agent teams. Each embodies different design philosophies and optimizes for different use cases.

OpenManus occupies a particular niche: practical general-purpose agents with browser automation capabilities and transparent implementation. It's not the most feature-rich option or the most minimalist, but it provides a solid foundation for building and understanding agentic systems.

Dependencies and Technology Stack

OpenManus builds on a carefully chosen set of libraries that handle specific concerns well.

The OpenAI library provides the async client for LLM communication, supporting both OpenAI's API directly and compatible endpoints. This library handles authentication, request formatting, response parsing, and the streaming protocol for real-time output.

Pydantic provides data validation and settings management throughout the framework. Agent configurations, tool parameters, message structures, and API responses all use Pydantic models for type safety and validation. This catches configuration errors and malformed data early rather than allowing them to cause mysterious failures downstream.

Tenacity handles retry logic with exponential backoff. Wrapping LLM calls with Tenacity's retry decorator provides automatic recovery from transient failures without cluttering business logic with retry loops.

Browser-use and Playwright together provide browser automation capabilities. Browser-use offers the high-level agent-friendly interface while Playwright provides the robust browser automation engine underneath. This combination gives agents human-like web interaction abilities.

The googlesearch-python library provides web search capabilities, turning natural language queries into relevant URLs for further investigation.

Firecrawl-py enables sophisticated web scraping when simple HTTP requests aren't enough. Dynamic content, JavaScript-rendered pages, and complex extraction tasks benefit from Firecrawl's processing.

FastAPI serves the web interface, providing async request handling, automatic API documentation, and server-sent events support. Uvicorn runs as the ASGI server, enabling high-performance async web serving.

Aiofiles provides async file operations, ensuring file reads and writes don't block the event loop during IO-bound operations.

Loguru handles logging with a clean API and sensible defaults. Semantic log markers distinguish different event types, supporting both human reading during development and machine parsing in production.

Conclusion

OpenManus demonstrates that sophisticated AI agent capabilities don't require proprietary systems or opaque implementations. Through careful architecture—hierarchical agents, composable tools, explicit state management, and flow-based orchestration—it achieves autonomous task execution that rivals commercial offerings.

For developers seeking to understand how modern AI agents work, OpenManus provides a readable, well-structured codebase to study. For teams building agentic applications, it offers a foundation that can be customized and extended without vendor dependencies. For the broader AI community, it represents the continuing vitality of open-source approaches to advancing the field.

The agent paradigm is still young, and much remains to be discovered about effective agent architectures. OpenManus contributes to this exploration by making one successful approach transparent and accessible to all.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles