Should I use PPO or GRPO?

PPO for maximum quality and stability, GRPO for faster training on reasoning tasks. However, consider REINFORCE++ or GSPO as better alternatives to GRPO—research shows "GRPO exhibits the weakest performance among the three RL algorithms evaluated."

What's the difference between GRPO and GSPO?

GRPO operates at token level; GSPO at sequence level. This difference eliminates GRPO's instability issues. From Qwen: "GSPO has inherently resolved the stability challenges in the RL training of large MoE models."

When should I use DPO vs online RL?

DPO is simpler and cheaper but "empirically inferior to online alignment methods." Use DPO for quick alignment or limited compute; use online RL (PPO, GSPO, REINFORCE++) for maximum quality or when you need the model to improve beyond your preference data.

Do I need the KL penalty?

For reasoning tasks, probably not. "Recent works like DAPO and Dr. GRPO have shown that the KL divergence term is not necessary for LLM reasoning tasks." For general alignment where you want to stay close to a helpful SFT model, KL penalty helps prevent drift.

What's the most compute-efficient algorithm?

SimPO (no reference model needed) or KTO (only needs binary feedback) for offline. REINFORCE++ for online—"reduced RLHF training time from 60 hours (PPO) to 42 hours" on Llama3 8B.

RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond | Enrico Piovano

Q: Can I train reasoning from base model without SFT?

Yes, DeepSeek proved this with R1-Zero. "We directly applied reinforcement learning to the base model without relying on supervised fine-tuning." However, this requires careful reward design and longer training.

The RL Algorithm Landscape

Reinforcement learning has become the key differentiator in LLM capabilities. From ChatGPT's PPO-based RLHF to DeepSeek R1's GRPO to Qwen3's GSPO, the choice of RL algorithm significantly impacts training stability, efficiency, and final model quality.

Why RL algorithms matter for LLMs: After pretraining (predicting next tokens) and SFT (following instructions), RL is what makes models helpful, harmless, and honest. But RL for LLMs is uniquely challenging: you're optimizing sequences of 1000+ tokens, rewards are sparse (one signal for the whole response), and policy updates must be stable across billions of parameters. The algorithm you choose determines whether training converges smoothly or collapses catastrophically.

The stability-efficiency tradeoff: PPO is stable but requires 4 model copies in memory—impossible for 70B+ models without massive GPU clusters. DPO eliminates RL entirely but can't match PPO's final quality. GRPO halves memory but introduces instability. GSPO and REINFORCE++ fix GRPO's issues while keeping memory benefits. Each algorithm represents a different point on this tradeoff curve.

This post provides a comprehensive guide to the RL algorithms used in modern LLM training, their tradeoffs, and when to use each.

The Evolution of LLM RL

Code

PPO (2017)      → The gold standard, but complex and expensive
    ↓
DPO (2023)      → Simplified to preference learning, no RL needed
    ↓
GRPO (2024)     → Eliminated critic model, group-based advantages
    ↓
GSPO (2025)     → Fixed GRPO instability with sequence-level optimization
    ↓
REINFORCE++ (2025) → Global normalization, most stable

From research: "By formalizing LLM post-training as a token-level MDP, algorithms such as REINFORCE, ReMax, RLOO, PPO, GRPO, and Dr. GRPO are variations of the same core principle: estimating unbiased gradients of the expected return while mitigating variance through carefully designed baselines."

PPO: The Foundation

Overview

Proximal Policy Optimization remains the most widely used algorithm for RLHF:

From Cameron Wolfe: "Due to its simplicity and effectiveness, PPO is widely used across domains and has become the go-to choice for aligning language models via RLHF."

The RLHF Pipeline with PPO

Pre-train the LLM on internet text
SFT on demonstration data
Train reward model on human preferences
RL fine-tune with PPO to maximize reward

Components Required

PPO requires four models in memory:

Model	Purpose	Memory
Policy	The LLM being trained	Full model
Reference	Frozen SFT model for KL penalty	Full model
Critic	Estimates value function	Full model
Reward	Scores responses	Full model

From research: "The memory overhead is high because we keep four copies of the LLM in memory: the policy, the reference policy, the critic, and the reward model."

The PPO Objective

Understanding the clipping mechanism: PPO's innovation is the clipped objective—it limits how much the policy can change in one update. If an action looks much better than expected (high advantage), vanilla policy gradients would push hard to increase its probability, potentially destabilizing training. PPO says "wait—if this action was already unlikely, don't make it too likely too fast." The clip prevents overshooting.

Code

L_PPO = E[min(r_t × A_t, clip(r_t, 1-ε, 1+ε) × A_t)]
      - β × KL(π || π_ref)
      + c × H(π)

Where:
- r_t = π(a|s) / π_old(a|s)  [importance ratio]
- A_t = advantage estimate from critic
- ε = clip range (typically 0.2)
- β = KL penalty coefficient
- H(π) = entropy bonus

Breaking down each term:

Clipped policy loss: The min(...) ensures the objective never incentivizes moving the policy ratio beyond [1-ε, 1+ε]. This is the stability mechanism.
KL penalty: Keeps the new policy close to the reference (SFT) model, preventing reward hacking where the model finds degenerate high-reward outputs.
Entropy bonus: Encourages exploration by penalizing overly deterministic policies. Without it, the model might collapse to always generating one "safe" response.

From research: "One of the key ideas behind PPO is that it limits how much the policy is allowed to change during each update step. This is done using a clipped loss function, which helps prevent the model from making overly large updates that could destabilize training."

PPO Advantages

Most stable: Decades of research, well-understood
Best performance: When tuned correctly, achieves highest quality
Industry proven: Used by OpenAI, Anthropic for production models

PPO Disadvantages

High memory: 4 full model copies
Complex tuning: Many hyperparameters
Slow training: 138% slower than alternatives

From research: "While PPO achieves significant advantages in accuracy and reward, it was 138% slower than REINFORCE++ in training speed."

GRPO: DeepSeek's Innovation

The Key Insight

GRPO eliminates the critic model by estimating advantages from groups of responses to the same prompt:

From DeepSeek: "To save training costs of RL, they adopted Group Relative Policy Optimization (GRPO), which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead."

How GRPO Works

Python

def grpo_advantage(responses, rewards):
    """
    For each prompt, generate multiple responses.
    Compute advantage relative to group statistics.
    """
    group_mean = mean(rewards)
    group_std = std(rewards)

    advantages = []
    for reward in rewards:
        adv = (reward - group_mean) / (group_std + eps)
        advantages.append(adv)

    return advantages

GRPO vs PPO

Aspect	PPO	GRPO
Critic model	Required	Not required
Memory	4 models	3 models
Baseline	Learned value function	Group statistics
Stability	High	Moderate
Use case	General RLHF	Reasoning models

GRPO's Success with DeepSeek R1

DeepSeek used GRPO to train R1-Zero with pure RL:

From DeepSeek: "We directly applied reinforcement learning to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought for solving complex problems."

Results: "The pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, improves to 86.7%."

GRPO's Problems

However, GRPO has significant stability issues:

From research: "GRPO exhibits the weakest performance among the three RL algorithms evaluated."

From Qwen team: "Existing RL algorithms (such as GRPO) exhibit severe instability issues during long training and lead to irreversible model collapse, hindering further performance improvements with increased compute."

Three biases in GRPO (identified by Dr. GRPO paper):

Baseline bias: Using a biased baseline without correcting the scaling factor
Response-level length bias: "For correct answers (with positive advantages), this bias incentivizes shorter responses; for incorrect answers (with negative advantages), this bias results in longer responses."
Question-level difficulty bias: "Questions within one batch can vary significantly in type, domain, and difficulty, leading to question-specific gradient estimation bias."

GSPO: Qwen's Solution

The Problem with Token-Level Optimization

From Qwen: "The instability of GRPO stems from the fundamental misapplication and invalidation of importance sampling weights in its algorithmic design. This introduces high-variance training noise that progressively accumulates with increased response length and is further amplified by the clipping mechanism, ultimately precipitating model collapse."

GSPO's Sequence-Level Approach

GSPO (Group Sequence Policy Optimization) operates at the sequence level instead of token level:

From Qwen: "Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization."

Key differences:

Importance ratios computed over whole sequences
Clipping at sequence level, not token level
All tokens in a sequence treated equally during backprop

From research: "Optimization happens at the sequence level, not the token level. Importance ratios are calculated over the whole output. Clipping is also done at the level of full sequences."

GSPO Benefits

From Qwen:

Better Training Stability: "GSPO has inherently resolved the stability challenges in the RL training of large Mixture-of-Experts (MoE) models, eliminating the need for complex stabilization strategies."
Higher Efficiency: "GSPO demonstrates significantly higher training efficiency than GRPO, achieving better performance under the same training cost."
Infrastructure-Friendly: "Due to sequence-level optimization, GSPO is fundamentally more tolerant to precision discrepancies."

Real-World Impact

From Qwen: "These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models."

REINFORCE++

Overview

REINFORCE++ achieves PPO-like stability without a value network:

From research: "REINFORCE++ achieves PPO-like training stability and efficiency without relying on a value network, by incorporating clipped policy updates, KL divergence penalties, and advantage normalization."

Key Innovation: Global Normalization

From research: "Rather than estimating advantages independently for each prompt (as in RLOO, GRPO), REINFORCE++ uses global advantage normalization. This mechanism provides better stability, particularly across heterogeneous prompts and noisy reward functions."

Performance Comparison

Metric	PPO	GRPO	REINFORCE++
Stability	Highest	Lowest	High
Speed	Slowest	Fast	Fastest
Memory	Highest	Lower	Lower
Quality	Best	Variable	Good

From research: "Logic-RL and PRIME demonstrate that REINFORCE++ is more stable in training compared to GRPO and faster than PPO."

Efficiency data: "On Llama3 8B, REINFORCE++ reduced RLHF training time (70k samples, H100 GPU) from 60 hours (PPO) to 42 hours."

DAPO: ByteDance's Open-Source Solution

Overview

DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) is ByteDance's fully open-source RL system:

From research: "DAPO is an algorithm that fully open-sources a state-of-the-art large-scale RL system achieving 50 points on AIME 2024 using Qwen2.5-32B base model."

Four Key Techniques

From the DAPO paper:

Clip-higher: "Increases the upper bound of the PPO clipping range to encourage exploration and prevent entropy collapse during training."
Dynamic sampling: "Improves training efficiency by filtering out prompts where all sampled responses are either always correct or always wrong."
Token-level policy gradient loss: "Moves from sample-level to token-level loss calculation so that longer responses can have more influence on the gradient update."
Overlong reward shaping: "Adds a soft penalty for responses that get truncated for being too long, which reduces reward noise and helps stabilize training."

Significance

From research: "Unlike previous works that withhold training details, DAPO introduces four key techniques that make large-scale LLM RL successful."

Published as NeurIPS 2025 poster.

RLOO: REINFORCE Leave-One-Out

The Approach

From research: "RLOO eliminates the need for a value function at each time-step by replacing it with the expected return over multiple trajectories sampled on the fly."

For each prompt, RLOO:

Samples N responses
For response i, uses average reward of other N-1 responses as baseline
This provides unbiased gradient estimates without a critic

Why It Works for LLMs

From research: "Not all RL tasks allow multiple trajectory sampling from the same state. In LLM post-training, the agent has significant control over transitions, enabling multiple trajectory sampling and making RLOO viable."

GRPO's Relationship to RLOO

From research: "GRPO is extremely closely related to other RL algorithms—it's derived from PPO and has a similar advantage computation as RLOO."

DPO and Preference-Based Methods

Why DPO Changed Everything

DPO eliminates RL entirely:

From research: "Direct Preference Optimization (DPO) offers a streamlined alternative to RLHF by optimizing the same objective but bypasses the explicit need for a separate reward model, thereby reducing the computational costs."

The DPO Objective

Code

L_DPO = -log(σ(β × (log π(y_w|x)/π_ref(y_w|x)
                  - log π(y_l|x)/π_ref(y_l|x))))

Where:
- y_w = winning (preferred) response
- y_l = losing (rejected) response
- π_ref = reference model
- β = temperature

DPO Limitations

From research: "DPO encounters several limitations: 1) high dependency on the SFT part, 2) tendency to overfit beyond a single epoch, and 3) inefficient learning and memory utilization."

From research: "Standard DPO is an offline training method, and it has become increasingly clear that it underperforms compared to more advanced online reinforcement learning (RL) techniques like RLHF with PPO."

DPO Variants

IPO (Identity Policy Optimization)

From Argilla: "The IPO paper provides a strong theoretical framework explaining the basis of RLHF and DPO, highlighting major shortcomings of these approaches. To avoid overfitting and weak regularization, IPO adds a regularization term."

Purpose: "IPO was developed to address the DPO overfitting issue."

KTO (Kahneman-Tversky Optimization)

From research: "KTO directly maximizes the utility of generations instead of maximizing the log-likelihood of the preferences. KTO only requires a binary signal of whether output is desirable or not, which is a kind of data easier to obtain than preferences."

Key advantage: "Without doing SFT first, DPO-aligned models tend to ramble and hallucinate entire conversations. KTO does not suffer from this phenomenon."

Performance: "KTO is good or better than DPO at all scales. For Llama models, KTO alone matches the performance of SFT and DPO combined."

ORPO (Odds Ratio Preference Optimization)

From Argilla: "ORPO creates a new objective by using an odds ratio-based loss to penalize undesirable responses along with conventional negative log-likelihood loss. It only relies on the base model as the preference alignment is performed during the SFT."

Benefits: "ORPO offers efficiency (requires significantly less computational resources than RLHF), stability (more stable training dynamics), and faster training time than the full alignment pipeline."

SimPO (Simple Preference Optimization)

From Princeton NLP: "SimPO is a simpler and more effective preference optimization algorithm than DPO that doesn't require a reference model."

Key innovation: "Using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model."

Results: "SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard."

Published at NeurIPS 2024.

Online vs Offline DPO

From research: "Offline DPO is empirically inferior to online alignment methods. To explore the online algorithm of DPO, iterative and online DPO have been implemented."

Online DPO variants:

Self-Rewarding: Model generates both responses and preference labels
OAIF: Uses external LLM for preference labeling
sDPO: Step-wise iteration through preference data partitions

Algorithm Selection Guide

When to Use Each Algorithm

Algorithm	Best For	Avoid When
PPO	Maximum quality, production models	Limited compute/memory
GRPO	Reasoning tasks, DeepSeek-style	Long training, MoE models
GSPO	MoE models, long training runs	Not yet widely available
REINFORCE++	Balance of speed and stability	Need maximum quality
DPO	Quick alignment, limited compute	Need online learning
KTO	Only have binary feedback	Have preference pairs
SimPO	Reference-free training	Need explicit reward model

Decision Tree

Code

Need maximum quality?
├── Yes → PPO (if compute allows) or GSPO
└── No
    ├── Have preference pairs?
    │   ├── Yes → DPO/SimPO (offline) or Online DPO
    │   └── No (only good/bad labels) → KTO
    └── Need online RL?
        ├── Yes
        │   ├── Training MoE? → GSPO
        │   ├── Need stability? → REINFORCE++
        │   └── Following DeepSeek approach? → GRPO
        └── No → DPO variants

Practical Recommendations

For reasoning models: From research: "The connection between RLHF and reasoning comes from how the DeepSeek team applied a similar RL-based approach (with GRPO) to train the reasoning capabilities of their R1 and R1-Zero models."

Consider: GRPO → GSPO → REINFORCE++

For general alignment: Start with DPO/SimPO, graduate to PPO/GSPO for maximum quality.

For MoE models: From Qwen: "GSPO has inherently resolved the stability challenges in the RL training of large MoE models."

KL divergence considerations: From research: "Recent works like DAPO and Dr. GRPO have shown that the KL divergence term is not necessary for LLM reasoning tasks."

Implementation Resources

Frameworks

OpenRLHF: From GitHub: "OpenRLHF is an easy-to-use, scalable and high-performance RLHF Framework based on Ray supporting PPO, GRPO, REINFORCE++, TIS, vLLM, Ray, Dynamic Sampling, and Async Agentic RL."

From research: "The CMU Advanced Natural Language Processing Spring 2025 course uses OpenRLHF as the RLHF framework teaching case."

TRL (Transformers Reinforcement Learning): Supports SFT, RM, PPO, DPO, IPO, KTO, and ORPO through Hugging Face.

Hyperparameter Guidance

PPO:

YAML

ppo_config:
  learning_rate: 1e-6
  kl_coefficient: 0.01-0.1
  clip_range: 0.2
  batch_size: 512+
  value_coefficient: 0.5-1.0

DPO/SimPO:

YAML

dpo_config:
  learning_rate: 5e-7 to 5e-6
  beta: 0.1-0.5
  epochs: 1-3
  batch_size: 128

GRPO:

YAML

grpo_config:
  group_size: 8-16
  clip_range: 0.2
  learning_rate: 1e-6
  kl_coefficient: 0.0  # Often not needed for reasoning

The Future of LLM RL

Trends

Critic-free methods: GRPO, GSPO, REINFORCE++ show critic models aren't necessary
Sequence-level optimization: GSPO's success suggests token-level may not be optimal
KL-free training: Research shows KL penalty often unnecessary for reasoning
Open-source parity: DAPO, OpenRLHF enabling reproduction of frontier results

Open Questions

From research: "Research has shown that the choice of RL algorithm between GRPO and RLOO, and different KL coefficients do not affect model collapse significantly."

This suggests other factors (data quality, reward design, training dynamics) may matter more than algorithm choice for many use cases.

Conclusion

The RL algorithm landscape for LLMs has evolved rapidly:

PPO remains the gold standard for maximum quality when compute allows
GRPO enabled reasoning breakthroughs but has stability issues
GSPO fixes GRPO's problems and powers Qwen3
REINFORCE++ offers the best balance of speed and stability
DPO variants simplify alignment for compute-constrained settings

Choose your algorithm based on your constraints: compute budget, stability requirements, whether you need online learning, and target model architecture.

Table of Contents

The RL Algorithm Landscape

The Evolution of LLM RL

PPO: The Foundation

Overview

The RLHF Pipeline with PPO

Components Required

The PPO Objective

PPO Advantages

PPO Disadvantages

GRPO: DeepSeek's Innovation

The Key Insight

How GRPO Works

GRPO vs PPO

GRPO's Success with DeepSeek R1

GRPO's Problems

GSPO: Qwen's Solution

The Problem with Token-Level Optimization

GSPO's Sequence-Level Approach

GSPO Benefits

Real-World Impact

REINFORCE++

Overview

Key Innovation: Global Normalization

Performance Comparison

DAPO: ByteDance's Open-Source Solution

Overview

Four Key Techniques

Significance

RLOO: REINFORCE Leave-One-Out

The Approach

Why It Works for LLMs

GRPO's Relationship to RLOO

DPO and Preference-Based Methods

Why DPO Changed Everything

The DPO Objective

DPO Limitations

DPO Variants

IPO (Identity Policy Optimization)

KTO (Kahneman-Tversky Optimization)

ORPO (Odds Ratio Preference Optimization)

SimPO (Simple Preference Optimization)

Online vs Offline DPO

Algorithm Selection Guide

When to Use Each Algorithm

Decision Tree

Practical Recommendations

Implementation Resources

Frameworks

Hyperparameter Guidance

The Future of LLM RL

Trends

Open Questions

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

GRPO: Group Relative Policy Optimization Explained

RLVR: Reinforcement Learning with Verifiable Rewards

Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR