What's the difference between online RL and offline RL for agents?

Offline RL learns from a fixed dataset of pre-collected trajectories—the agent never interacts with the environment during training. Online RL learns through live interaction, receiving immediate feedback for each action. Online RL adapts to the actual environment and can discover novel strategies, but requires safe exploration and more compute. Most production agent training uses a hybrid: offline pretraining followed by online fine-tuning.

How do I handle safety during online agent training?

Implement multiple safety layers: (1) Action filtering to block dangerous actions before execution, (2) Sandboxed environments that can't affect production systems, (3) Human oversight for high-risk actions with approval gates, (4) Reward penalties for unsafe behavior, (5) Rollback mechanisms to undo agent actions. Start training in simulation before deploying to real environments.

What's curriculum learning and why does it matter for agents?

Curriculum learning trains agents on progressively harder tasks—like teaching a child arithmetic before calculus. For agents, this might mean starting with single-step tasks, then multi-step, then tasks requiring error recovery. Curriculum learning accelerates training (easier tasks provide more learning signal early on) and enables learning behaviors that would be too hard to discover from scratch.

How do I design rewards for agentic tasks?

Use sparse outcome rewards (task success/failure) combined with dense process rewards (step-level feedback). Process Reward Models (PRMs) can score intermediate steps, providing gradient signal even when final outcomes are rare. For verifiable tasks (code, math), use execution feedback. For subjective tasks, combine rule-based heuristics with learned reward models. Avoid reward hacking by using diverse reward signals.

Can I do online RL with API-based models like GPT-4?

Limited. API models don't expose gradients needed for RL updates. Workarounds include: (1) Fine-tuning your own model with RL, then distilling to API calls at inference, (2) Using API models as reward models while training open models with RL, (3) Prompt-based "learning" through in-context examples (not true RL). For serious online RL, you need models you can update—open weights are essential.

How much data do I need for online agent RL?

Less than you might expect if the environment provides good signal. WebRL showed improvements with thousands of trajectories, not millions. Key factors: reward density (sparse rewards need more data), task complexity, and curriculum design. Start with 1,000-10,000 episodes and scale based on learning curves. Quality of reward signal matters more than raw data volume.

Online RL & Curriculum Learning for Agents

Traditional LLM training ends at deployment. The model you ship is the model users experience—it doesn't learn from interactions, adapt to new challenges, or improve from mistakes. But a new paradigm is emerging: agents that learn continuously through experience, using reinforcement learning to improve during and after deployment.

This guide covers three interconnected frontiers: online RL where agents learn from real interactions, curriculum learning where agents train on progressively harder tasks, and reward shaping that provides meaningful feedback for complex agentic behaviors. Together, these techniques enable AI systems that genuinely improve through experience.

Why Online RL for Agents?

The fundamental limitation of supervised fine-tuning is that models can only learn from pre-collected data. They see examples of good behavior but never experience the consequences of their own actions. This creates a gap between training and deployment that becomes especially problematic for agents operating in dynamic environments.

The Static Model Problem

Consider a web agent trained on WebArena. During training, it sees expert demonstrations of navigating e-commerce sites, filling forms, and completing purchases. But websites change constantly—new layouts, updated interfaces, different error messages. The agent trained on January's websites encounters February's redesigns and struggles because it never learned to adapt.

The same pattern repeats across domains. Customer service agents encounter novel complaints. Code agents face new APIs and libraries. Research agents navigate evolving knowledge bases. Static training cannot anticipate every variation, and the distribution shift between training and deployment compounds over time.

The Promise of Online Learning

Online reinforcement learning addresses this directly. Instead of learning only from demonstrations, agents learn from their own interactions with the environment. They try actions, observe outcomes, and update their behavior based on what works. This creates a feedback loop where the agent continuously adapts to the actual distribution it encounters.

The shift is analogous to how humans learn skills. Reading about cooking versus actually cooking creates different kinds of understanding. The cook who experiments, fails, and adjusts develops intuitions that no recipe book can provide. Online RL gives agents that same opportunity to learn from experience.

2025: The Year Online Agent RL Became Practical

Karpathy's 2025 Year in Review highlights a critical shift: "Unlike the SFT and RLHF stages which are relatively thin/short, RLVR involves training against objective (non-gameable) reward functions allowing for longer optimization. Running RLVR turned out to offer high capability/$, and most of the capability progress of 2025 was defined by LLM labs chewing through this new stage."

This observation captures why 2025 marked a turning point. The combination of verifiable rewards, efficient RL algorithms, and scalable infrastructure finally made online learning for agents economically viable. What was once a research curiosity became a production technique.

Online RL Frameworks for Agents

Several frameworks have emerged to make online RL practical for LLM-based agents. Each addresses different challenges in the pipeline from agent execution to model training.

AgentGym-RL: The Unified Framework

AgentGym-RL, published in September 2025, provides a comprehensive framework for training LLM agents through multi-turn reinforcement learning. It addresses a critical gap: the community previously lacked a unified, interactive RL framework that could effectively train agents from scratch—without relying on supervised fine-tuning—across diverse environments.

The framework's architecture reflects lessons learned from earlier attempts. It separates three concerns that previous systems conflated:

The Environment Module provides standardized access to diverse scenarios through a server-client architecture with unified HTTP protocols. This abstraction means the same agent can train across web navigation, scientific research, embodied tasks, and digital games without code changes. The environments supported include WebArena (online shopping, forums, collaborative development), TextCraft (crafting games), BabyAI (grid world embodied tasks), and SciWorld (scientific exploration).

The Agent Module encapsulates reasoning and decision-making for multi-turn interactions. It supports mechanisms like long-horizon planning, self-reflection, and tool use. Crucially, it handles the complexity of multi-step reasoning where early decisions constrain later options—something that single-turn response generation never encounters.

The Training Module implements RL pipelines with support for PPO, GRPO, RLOO, and REINFORCE++. Beyond online RL, it also supports SFT, DPO, and AgentEvol for different training scenarios. This flexibility matters because the optimal training approach depends on available supervision, environment characteristics, and computational budget.

AgentGym-RL introduces ScalingInter-RL, a training approach that addresses the exploration-exploitation tradeoff for long-horizon tasks. In early stages, it emphasizes exploitation by restricting interaction horizons, helping the agent learn basic capabilities quickly. As training progresses, it gradually shifts toward exploration with larger horizons, encouraging diverse problem-solving strategies. This curriculum prevents the agent from getting stuck in local optima while avoiding the instability that comes from exploring too aggressively too early.

The results demonstrate the approach's effectiveness: AgentGym-RL trained 7B models that outperform other open-source models by large margins, rivaling or surpassing top-tier proprietary models on several benchmarks.

Microsoft's Agent Lightning

Agent Lightning, released in late 2025, takes a different approach focused on practical adoption. Its core insight is that most AI agents already exist—built with LangChain, AutoGen, CrewAI, or custom frameworks—and rewriting them for RL training is prohibitively expensive. Agent Lightning makes existing agents trainable without code modification.

The framework achieves this through Training-Agent Disaggregation, separating agent execution from model training. An Agent Runner manages execution lifecycle and collects interaction data, while an Algorithm Module handles LLM inference and training on GPU resources. They communicate asynchronously through LightningStore, which acts as a central repository for data exchange.

This architecture enables concurrent scaling: CPU-based agent execution and GPU-intensive model training proceed independently, optimizing resource allocation. An organization can run thousands of agent interactions across many CPUs while a smaller number of GPUs handle the more intensive training updates.

Agent Lightning's LightningRL algorithm addresses the credit assignment challenge that plagues multi-step agent training. When an agent completes a task after many LLM calls, which calls contributed positively? Which hurt? LightningRL includes a credit assignment module that decomposes trajectories into training transitions, enabling RL to handle complex interaction logic including multi-agent scenarios and dynamic workflows.

The framework also introduces Automatic Intermediate Rewarding (AIR), which converts system signals—tool return status, API success codes, validation results—into intermediate rewards. This reduces the sparse reward problem that makes long agent workflows so difficult to train. Instead of learning only from final task success, the agent receives feedback throughout execution.

Experiments show consistent improvement on text-to-SQL, RAG, and math tool use tasks using Llama 3.2 3B as the base model. More importantly, the framework works with any agent framework, lowering the barrier to RL adoption across the industry.

WebRL: Self-Evolving Curriculum for Web Agents

WebRL focuses specifically on web agents, where three challenges have historically limited RL success: scarcity of training tasks, sparse feedback signals, and policy distribution drift during online learning.

The framework's solution centers on a self-evolving curriculum. Instead of requiring humans to create progressively harder training tasks, WebRL automatically generates appropriate challenges based on the agent's current capability. As the agent masters easy tasks, the curriculum introduces harder ones. As the agent struggles, the curriculum provides more scaffolding.

This self-evolution addresses a fundamental limitation of static curricula: they assume a single learning trajectory. In practice, different agents starting from different base models have different strengths and weaknesses. A curriculum designed for one model may be too easy or too hard for another. Self-evolving curricula adapt to each learner.

WebRL also includes a robust outcome-supervised reward model that provides meaningful feedback beyond binary task success. Web tasks often partially succeed—the agent might navigate to the right page but fill a form incorrectly. Outcome supervision distinguishes these cases, providing richer signal than all-or-nothing rewards.

The results are striking: WebRL improves Llama-3.1-8B's success rate on WebArena-Lite from 4.8% to 42.4%, significantly surpassing GPT-4-Turbo (17.6%) and GPT-4o (13.9%). A relatively small open model, trained with the right RL approach, outperforms models an order of magnitude larger.

NeMo Gym and OpenRLHF Integration

For scientific and specialized agents, NVIDIA's NeMo Gym provides infrastructure specifically designed for complex agentic behaviors. Traditional RLHF doesn't scale for multi-step research processes like literature review, hypothesis generation, experimental design, and analysis. NeMo Gym enables RL with verifiable rewards—computational verification of task completion rather than subjective human ratings.

The integration with OpenRLHF, added in November 2025, combines NeMo Gym's environment infrastructure with OpenRLHF's training efficiency. OpenRLHF provides comprehensive support for PPO, GRPO, REINFORCE++, and asynchronous training modes. The --async_train flag enables asynchronous RL where generation and training overlap, improving throughput. The --agent_func_path parameter connects to custom agent environments.

Asynchronous RLHF research shows that asynchronous generation and learning can make training 2x faster while maintaining performance. Training a chatbot from LLaMA 3.1 8B was ~40% faster than synchronous runs with matching final performance. For math and reasoning, async RL fine-tuned Rho 1B on GSM8k ~70% faster while matching synchronous accuracy.

Multi-Agent Curriculum Learning

Curriculum learning—training on progressively harder tasks—has long been recognized as beneficial for RL. But for agents, the challenge is creating appropriate curricula. Who decides what's "harder"? How do you generate enough tasks at each difficulty level? Multi-agent approaches provide elegant answers by letting agent interactions create natural curricula.

The Autocurriculum Phenomenon

The foundational insight comes from OpenAI's emergent tool use research: "Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, agents create a self-supervised autocurriculum inducing multiple distinct rounds of emergent strategy."

In their experiments, hiders and seekers co-evolved through six distinct phases. Hiders learned to build shelters from boxes. Seekers learned to use ramps to climb over shelters. Hiders learned to lock ramps in place. And so on. Each phase emerged naturally from competitive pressure—no human designer specified "now learn to use ramps."

This autocurriculum property is powerful because it scales with agent capability. For any skill level, an environment full of agents at that level provides appropriately challenging training. As agents improve, their opponents improve, maintaining productive training signal indefinitely. This sidesteps the curriculum design problem entirely.

SPIRAL: Self-Play for Reasoning

SPIRAL demonstrates that self-play benefits extend beyond physical tasks to reasoning. The key finding is remarkable: self-play on simple games improves mathematical reasoning without seeing any mathematical content.

Training on Kuhn Poker alone achieved 8.7% average improvement on math benchmarks. The explanation lies in what self-play teaches: chain-of-thought patterns, strategic thinking, consideration of alternatives. These general reasoning skills transfer to mathematics even though the training domain was completely different.

SPIRAL's automatic curriculum outperforms fixed training because self-play avoids two failure modes. Format learning collapse occurs when models overfit to specific response formats rather than learning underlying reasoning. Strategy exploitation occurs when models find shortcuts that work against static opponents but fail against adaptive ones. Continuous adaptation through self-play prevents both.

Self-Play SWE-RL for Software Engineering

Self-play SWE-RL (SSR) applies these ideas to software engineering, where the goal is training agents to find and fix bugs. The challenge is data: human-curated software issues and test suites are expensive and limited.

SSR breaks this dependency through a self-play loop. One "role" injects bugs into code. Another "role" tries to detect and fix them. Both improve simultaneously: the bug injector learns to create harder-to-find bugs, while the fixer learns to handle increasingly subtle issues.

This creates a curriculum that evolves with agent capability. Early training involves obvious bugs—syntax errors, missing imports. As both agents improve, bugs become more subtle—logic errors, edge cases, race conditions. The curriculum complexity is bounded only by what the agents can learn, not by human curation effort.

CRUISE: Curriculum-Based Self-Play

CRUISE (Curriculum-Based Iterative Self-Play), developed for multi-drone racing, shows how structured curricula can be combined with self-play for physical tasks with continuous control.

The approach addresses the severe exploration problem in head-to-head competition. When two untrained agents compete, neither can do anything interesting, so neither learns anything useful. CRUISE solves this by introducing difficulty progressively while maintaining self-play dynamics.

The insight generalizes: self-play works best when agents are capable enough to provide meaningful challenges to each other. Curriculum learning bootstraps that initial capability. Once agents reach a threshold, self-play takes over and continues driving improvement without further curriculum design.

Population-Based Training for Diversity

Beyond pairwise self-play, population-based methods maintain diverse agent populations that provide varied training partners. Research on Pommerman demonstrates a two-stage approach: curriculum learning establishes basic skills across three incremental difficulty phases, then population-based self-play enables continued improvement through diverse opponents.

The diversity aspect is crucial. Training against a single opponent, even one that improves, leads to overfitting. Populations prevent this by ensuring agents must handle varied strategies. Each agent in the population may have learned different approaches to the same task, and training against all of them produces more robust policies.

Reward Shaping for Agentic Tasks

The sparse reward problem is especially severe for agents. A web agent might execute dozens of actions across multiple pages before completing a purchase. A research agent might retrieve, read, and synthesize dozens of documents before producing a report. If the only reward comes at the end, the agent receives almost no signal about which intermediate actions helped.

Reward shaping provides intermediate feedback that guides learning without changing the optimal policy. For agents, this means rewarding good intermediate actions—not just successful task completion.

Process Reward Models: Step-Level Feedback

Process Reward Models (PRMs) output scores at every step in a reasoning process, not just at the end. This provides fine-grained supervision that helps with credit assignment: when something goes wrong, PRMs can localize the failure to specific steps rather than blaming the entire trajectory.

Traditional outcome reward models (ORMs) create a sparse credit assignment problem. A 20-step agent trajectory receives a single reward at the end. Which steps were good? Which were mistakes? ORMs provide no information. PRMs evaluate each step, distinguishing "this step was helpful" from "this step was a mistake even though the task eventually succeeded."

The training is more expensive—you need step-level supervision rather than just outcome labels. But the improved credit assignment can dramatically accelerate learning, especially for long-horizon tasks where sparse rewards make learning nearly impossible.

AgentPRM: Process Rewards for Agentic Tasks

While PRMs have been explored for multi-step reasoning tasks like mathematics, AgentPRM extends them to agentic settings where actions impact external environments. The challenges are different: reasoning steps are typically reversible (you can always think again), but agent actions often have irreversible effects (you can't un-send an email).

AgentPRM follows a lightweight actor-critic paradigm, using Monte Carlo rollouts to compute reward targets and optimize policies. The key insight is that this requires minimal modifications to existing RLHF pipelines, making it practical to adopt at scale.

The framework includes InversePRM, which learns from failed trajectories as well as successful ones. If an agent fails a task, InversePRM identifies which steps led to failure, providing negative signal that's otherwise lost. This doubles the training data efficiency by learning from both successes and failures.

Results show that small 3B models trained with AgentPRM and InversePRM outperform strong GPT-4o baselines on the ALFWorld benchmark. The combination of process supervision and inverse learning extracts more signal from each trajectory than traditional approaches.

ThinkPRM: Long Chain-of-Thought Verification

ThinkPRM, from April 2025, addresses the cost of step-level supervision by leveraging long chain-of-thought reasoning. Instead of training discriminative PRMs that require thousands of step-level labels, ThinkPRM fine-tunes a verifier on orders of magnitude fewer process labels.

The approach capitalizes on inherent reasoning abilities of long CoT models. Rather than learning from scratch what makes a reasoning step good, ThinkPRM uses the model's existing understanding of valid reasoning. This transfer reduces labeling requirements dramatically—ThinkPRM outperforms traditional approaches using only 1% of the process labels in standard datasets like PRM800K.

For agents, this suggests a path to affordable process supervision. Instead of labeling every step of every trajectory, you can fine-tune efficient verifiers that generalize from limited examples.

Credit Assignment for Multi-Step Actions

The credit assignment problem—determining which actions in a sequence contributed to an outcome—is fundamental to agent RL. Recent research from CMU shows that large language models can help solve this more efficiently than traditional methods.

RICOL (Retrospective In-Context Optimization for Learning) uses LLMs to retrospectively evaluate actions, converting sparse rewards into dense learning signals. Instead of training a critic network from scratch to estimate value functions, RICOL leverages an LLM's existing world knowledge to assess which actions were good in context.

This is more sample-efficient than Monte Carlo estimation, which requires many trajectories to estimate action values accurately. The LLM can often determine whether an action was helpful from a single trajectory by reasoning about what the action accomplished and what alternatives existed.

Meta's SWEET-RL: Turn-Level Advantages

SWEET-RL, introduced by Meta in March 2025, provides a framework specifically for multi-turn agent training with step-wise rewards.

Existing approaches often treat agent tasks as bandit problems, merging outcome and turn-level rewards to estimate trajectory-level advantages. This lacks fine-grained credit assignment—when rewards are spread across an entire trajectory, it's difficult to identify which specific decisions contributed positively or negatively.

SWEET-RL addresses this with an asymmetric actor-critic structure. The critic has access to additional information during training, such as the correct solution, which is not visible to the actor. This privileged information allows the critic to evaluate each decision with much finer resolution. Instead of training a value function that estimates overall reward, SWEET-RL directly models an advantage function at each turn.

The advantage function answers: "How much better was this action than what we expected?" This turn-level granularity is exactly what multi-turn agents need. Early good decisions that set up later success get appropriate credit, and early mistakes that doomed the trajectory get appropriate blame.

Automatic Intermediate Rewarding

Microsoft's Agent Lightning introduces Automatic Intermediate Rewarding (AIR), which converts system signals into intermediate rewards without human labeling. Tool return status, API success codes, validation results, and error messages all become reward signals.

This is particularly valuable because agents already generate these signals—they're just discarded. A web agent that clicks a button receives a response indicating success or failure. A code agent that runs tests receives pass/fail results. AIR captures these existing signals and feeds them back as intermediate rewards.

The approach requires no additional annotation cost. The rewards are "free" in the sense that the agent environment already produces them. AIR just systematically collects and uses them for training, turning sparse outcome supervision into dense intermediate feedback.

Verifiable Rewards: The RLVR Revolution

The 2025 shift toward verifiable rewards represents perhaps the most significant development in agent RL. RLVR (Reinforcement Learning with Verifiable Rewards) provides clear-cut, binary ground truth signals that are immune to gaming.

What Makes Rewards "Verifiable"?

Verifiable rewards are simple functions providing binary correctness signals—typically "1" (correct) or "0" (incorrect)—based on predefined criteria. Unlike neural reward models used in traditional RLHF, verifiable rewards offer direct, bias-free connection to ground truth.

For mathematics, verification is straightforward: does the answer match the ground truth? For code, verification involves execution: does the code pass test cases? For factual questions, verification checks against known correct answers.

The key advantage is non-gameability. Neural reward models can be exploited—models learn to produce responses that score highly without actually being good. Verifiable rewards can't be exploited in the same way: either the answer is correct or it isn't.

DeepSeek R1 and GRPO

DeepSeek's R1 model demonstrated RLVR's power dramatically. R1-Zero was trained without any supervised training data using GRPO (Group Relative Policy Optimization), and it self-evolved to solve complex problems through complex chain-of-thought reasoning.

The approach combined rule-based math/code rewards with preference rewards for open-ended tasks, achieving remarkable results at a fraction of traditional RLHF costs. Token-length regularization ("TLDR") dynamically shrinks chains-of-thought without hurting accuracy, addressing a common failure mode where models become unnecessarily verbose.

The Efficiency vs. Capability Debate

Not everyone agrees that RLVR improves fundamental reasoning capability. Tsinghua research from April 2025 titled "Reasoning LLMs Are Just Efficient Samplers" found that RLVR-trained models generate reasoning paths already in the base model's distribution.

Their analysis breaks down RLVR gains as:

Majority: Search compression (pass@k → pass@1 efficiency)—the model gets better at finding the right answer on the first try
Minority: Capability expansion (pass@k ceiling lift)—the model actually gains new problem-solving abilities

This doesn't invalidate RLVR—search compression is valuable. But it suggests that RLVR primarily makes models faster at finding solutions they could already find, rather than enabling solutions they couldn't find before.

Expanding RLVR to Diverse Domains

Research on extending RLVR shows the approach generalizes beyond math and code. By fine-tuning 7B base models using various RL algorithms and soft reward verifiers, researchers achieved up to 8.0% accuracy improvements on diverse, free-form reasoning tasks—surpassing models like Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B.

The key is creating verifiable rewards for new domains. For scientific tasks, verification might involve checking whether conclusions are supported by cited evidence. For planning tasks, verification might involve simulating whether a plan achieves stated goals. The framework is general; the challenge is defining appropriate verification for each domain.

Production Considerations

Moving online RL from research to production introduces practical challenges that academic papers often gloss over.

Training Stability

Online RL can be unstable, especially for long-horizon agent tasks. The policy distribution shifts during training, which can cause reward model predictions to become unreliable. Several strategies help:

Conservative updates: Techniques like PPO clip updates to prevent large policy changes. Agent Lightning's LightningRL and AgentGym-RL's ScalingInter-RL both implement careful update strategies that prioritize stability over speed.

Curriculum pacing: Starting with shorter horizons and gradually extending them (as in ScalingInter-RL) helps maintain stability. The agent builds capabilities incrementally rather than facing the hardest tasks immediately.

Asynchronous training caution: While async training improves throughput, it can affect stability. OpenRLHF documentation recommends prioritizing synchronous training when stability is critical.

Infrastructure Requirements

Agent RL requires infrastructure that traditional LLM training doesn't. Agents interact with external environments—web browsers, code execution sandboxes, APIs, simulations. These environments must be:

Parallelizable: Training requires many concurrent agent instances to collect experience efficiently Reproducible: For debugging and evaluation, you need to replay specific trajectories Safe: Agents shouldn't have unrestricted access to production systems during training

AgentGym-RL's server-client architecture with unified HTTP protocols addresses these needs. Environments run as services that agents connect to, enabling parallelization while maintaining isolation.

Reward Engineering

Designing rewards for agent tasks requires careful thought:

Avoid reward hacking: Even verifiable rewards can be gamed if verification is imperfect. A code agent might produce code that passes tests but fails on edge cases not covered by tests.

Balance density and noise: Denser rewards (from PRMs or AIR) help learning but introduce noise. If intermediate rewards don't align with final task success, they can mislead training.

Consider irreversibility: Agent actions often have real effects. Reward functions should account for this—preferring reversible approaches over irreversible ones, even if both accomplish the goal.

Evaluation Challenges

Evaluating online-trained agents is harder than evaluating static models:

Distribution shift: The agent performs best on distributions similar to its training. Evaluation on held-out environments may underestimate capabilities on the training distribution or vice versa.

Temporal dynamics: Agent capabilities may improve rapidly during training then plateau. Point-in-time evaluation may miss these dynamics.

Multi-agent effects: For agents trained with self-play, evaluation against held-out opponents may not reflect training performance.

Agent-RewardBench, presented at ACL 2025, provides a unified benchmark for reward modeling across perception, planning, and safety in multimodal agents. Standardized benchmarks like this are essential for comparing approaches fairly.

The Road Ahead

Several trends will likely shape online RL for agents over the next few years.

Scaling Verifiable Rewards

Recent analysis suggests that scaling RLVR could be equally pivotal as scaling pre-training. As modern LLMs approach the limits of language token exposure, learning from experience may represent the next capability leap.

The challenge is creating verifiable rewards for more domains. Math and code are "easy" because correctness is well-defined. Extending to open-ended tasks—creative writing, strategic advice, nuanced analysis—requires new verification approaches, possibly involving AI judges with their own verification mechanisms.

Current frameworks focus on text-based environments, but agents increasingly operate in multi-modal settings. Web agents need to understand screenshots. Embodied agents process sensor data. Research agents interpret figures and diagrams.

Multi-modal verification is harder than text verification. Determining whether a generated image meets a specification, or whether an agent correctly interpreted a chart, requires capabilities that current verification approaches lack.

Continual Learning Without Catastrophic Forgetting

Online learning naturally accumulates capabilities over time, but neural networks suffer from catastrophic forgetting—new learning can overwrite old capabilities. Agents need architectures and training methods that enable continuous improvement without degrading existing skills.

Methods like elastic weight consolidation, replay buffers, and modular architectures help, but the problem isn't solved. An agent that trains on web navigation, then code generation, then customer service should retain all three capabilities—current methods struggle with this.

Emergent Capabilities from Scale

Perhaps most exciting is the possibility that online RL at sufficient scale will produce emergent capabilities that don't appear at smaller scales. Self-play already produces emergent strategies in games. What emergent behaviors might appear in general-purpose agents trained with online RL on diverse tasks?

This is speculative, but the history of deep learning suggests that scale often enables qualitative shifts in capability. Online RL provides a path to scale that SFT and offline RLHF cannot match—there's no limit to how much experience an agent can accumulate.

Table of Contents