Skip to main content
Back to Blog

RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond

A comprehensive guide to reinforcement learning algorithms for LLM alignment—PPO, GRPO, GSPO, REINFORCE++, DPO, and their variants. Understanding the tradeoffs that power modern AI assistants.

12 min read
Share:

The RL Algorithm Landscape

Reinforcement learning has become the key differentiator in LLM capabilities. From ChatGPT's PPO-based RLHF to DeepSeek R1's GRPO to Qwen3's GSPO, the choice of RL algorithm significantly impacts training stability, efficiency, and final model quality.

Why RL algorithms matter for LLMs: After pretraining (predicting next tokens) and SFT (following instructions), RL is what makes models helpful, harmless, and honest. But RL for LLMs is uniquely challenging: you're optimizing sequences of 1000+ tokens, rewards are sparse (one signal for the whole response), and policy updates must be stable across billions of parameters. The algorithm you choose determines whether training converges smoothly or collapses catastrophically.

The stability-efficiency tradeoff: PPO is stable but requires 4 model copies in memory—impossible for 70B+ models without massive GPU clusters. DPO eliminates RL entirely but can't match PPO's final quality. GRPO halves memory but introduces instability. GSPO and REINFORCE++ fix GRPO's issues while keeping memory benefits. Each algorithm represents a different point on this tradeoff curve.

This post provides a comprehensive guide to the RL algorithms used in modern LLM training, their tradeoffs, and when to use each.

The Evolution of LLM RL

Code
PPO (2017)      → The gold standard, but complex and expensive
    ↓
DPO (2023)      → Simplified to preference learning, no RL needed
    ↓
GRPO (2024)     → Eliminated critic model, group-based advantages
    ↓
GSPO (2025)     → Fixed GRPO instability with sequence-level optimization
    ↓
REINFORCE++ (2025) → Global normalization, most stable

From research: "By formalizing LLM post-training as a token-level MDP, algorithms such as REINFORCE, ReMax, RLOO, PPO, GRPO, and Dr. GRPO are variations of the same core principle: estimating unbiased gradients of the expected return while mitigating variance through carefully designed baselines."

PPO: The Foundation

Overview

Proximal Policy Optimization remains the most widely used algorithm for RLHF:

From Cameron Wolfe: "Due to its simplicity and effectiveness, PPO is widely used across domains and has become the go-to choice for aligning language models via RLHF."

The RLHF Pipeline with PPO

  1. Pre-train the LLM on internet text
  2. SFT on demonstration data
  3. Train reward model on human preferences
  4. RL fine-tune with PPO to maximize reward

Components Required

PPO requires four models in memory:

ModelPurposeMemory
PolicyThe LLM being trainedFull model
ReferenceFrozen SFT model for KL penaltyFull model
CriticEstimates value functionFull model
RewardScores responsesFull model

From research: "The memory overhead is high because we keep four copies of the LLM in memory: the policy, the reference policy, the critic, and the reward model."

The PPO Objective

Understanding the clipping mechanism: PPO's innovation is the clipped objective—it limits how much the policy can change in one update. If an action looks much better than expected (high advantage), vanilla policy gradients would push hard to increase its probability, potentially destabilizing training. PPO says "wait—if this action was already unlikely, don't make it too likely too fast." The clip prevents overshooting.

Code
L_PPO = E[min(r_t × A_t, clip(r_t, 1-ε, 1+ε) × A_t)]
      - β × KL(π || π_ref)
      + c × H(π)

Where:
- r_t = π(a|s) / π_old(a|s)  [importance ratio]
- A_t = advantage estimate from critic
- ε = clip range (typically 0.2)
- β = KL penalty coefficient
- H(π) = entropy bonus

Breaking down each term:

  • Clipped policy loss: The min(...) ensures the objective never incentivizes moving the policy ratio beyond [1-ε, 1+ε]. This is the stability mechanism.
  • KL penalty: Keeps the new policy close to the reference (SFT) model, preventing reward hacking where the model finds degenerate high-reward outputs.
  • Entropy bonus: Encourages exploration by penalizing overly deterministic policies. Without it, the model might collapse to always generating one "safe" response.

From research: "One of the key ideas behind PPO is that it limits how much the policy is allowed to change during each update step. This is done using a clipped loss function, which helps prevent the model from making overly large updates that could destabilize training."

PPO Advantages

  • Most stable: Decades of research, well-understood
  • Best performance: When tuned correctly, achieves highest quality
  • Industry proven: Used by OpenAI, Anthropic for production models

PPO Disadvantages

  • High memory: 4 full model copies
  • Complex tuning: Many hyperparameters
  • Slow training: 138% slower than alternatives

From research: "While PPO achieves significant advantages in accuracy and reward, it was 138% slower than REINFORCE++ in training speed."

GRPO: DeepSeek's Innovation

The Key Insight

GRPO eliminates the critic model by estimating advantages from groups of responses to the same prompt:

From DeepSeek: "To save training costs of RL, they adopted Group Relative Policy Optimization (GRPO), which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead."

How GRPO Works

Python
def grpo_advantage(responses, rewards):
    """
    For each prompt, generate multiple responses.
    Compute advantage relative to group statistics.
    """
    group_mean = mean(rewards)
    group_std = std(rewards)

    advantages = []
    for reward in rewards:
        adv = (reward - group_mean) / (group_std + eps)
        advantages.append(adv)

    return advantages

GRPO vs PPO

AspectPPOGRPO
Critic modelRequiredNot required
Memory4 models3 models
BaselineLearned value functionGroup statistics
StabilityHighModerate
Use caseGeneral RLHFReasoning models

GRPO's Success with DeepSeek R1

DeepSeek used GRPO to train R1-Zero with pure RL:

From DeepSeek: "We directly applied reinforcement learning to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought for solving complex problems."

Results: "The pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, improves to 86.7%."

GRPO's Problems

However, GRPO has significant stability issues:

From research: "GRPO exhibits the weakest performance among the three RL algorithms evaluated."

From Qwen team: "Existing RL algorithms (such as GRPO) exhibit severe instability issues during long training and lead to irreversible model collapse, hindering further performance improvements with increased compute."

Three biases in GRPO (identified by Dr. GRPO paper):

  1. Baseline bias: Using a biased baseline without correcting the scaling factor

  2. Response-level length bias: "For correct answers (with positive advantages), this bias incentivizes shorter responses; for incorrect answers (with negative advantages), this bias results in longer responses."

  3. Question-level difficulty bias: "Questions within one batch can vary significantly in type, domain, and difficulty, leading to question-specific gradient estimation bias."

GSPO: Qwen's Solution

The Problem with Token-Level Optimization

From Qwen: "The instability of GRPO stems from the fundamental misapplication and invalidation of importance sampling weights in its algorithmic design. This introduces high-variance training noise that progressively accumulates with increased response length and is further amplified by the clipping mechanism, ultimately precipitating model collapse."

GSPO's Sequence-Level Approach

GSPO (Group Sequence Policy Optimization) operates at the sequence level instead of token level:

From Qwen: "Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization."

Key differences:

  • Importance ratios computed over whole sequences
  • Clipping at sequence level, not token level
  • All tokens in a sequence treated equally during backprop

From research: "Optimization happens at the sequence level, not the token level. Importance ratios are calculated over the whole output. Clipping is also done at the level of full sequences."

GSPO Benefits

From Qwen:

  1. Better Training Stability: "GSPO has inherently resolved the stability challenges in the RL training of large Mixture-of-Experts (MoE) models, eliminating the need for complex stabilization strategies."

  2. Higher Efficiency: "GSPO demonstrates significantly higher training efficiency than GRPO, achieving better performance under the same training cost."

  3. Infrastructure-Friendly: "Due to sequence-level optimization, GSPO is fundamentally more tolerant to precision discrepancies."

Real-World Impact

From Qwen: "These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models."

REINFORCE++

Overview

REINFORCE++ achieves PPO-like stability without a value network:

From research: "REINFORCE++ achieves PPO-like training stability and efficiency without relying on a value network, by incorporating clipped policy updates, KL divergence penalties, and advantage normalization."

Key Innovation: Global Normalization

From research: "Rather than estimating advantages independently for each prompt (as in RLOO, GRPO), REINFORCE++ uses global advantage normalization. This mechanism provides better stability, particularly across heterogeneous prompts and noisy reward functions."

Performance Comparison

MetricPPOGRPOREINFORCE++
StabilityHighestLowestHigh
SpeedSlowestFastFastest
MemoryHighestLowerLower
QualityBestVariableGood

From research: "Logic-RL and PRIME demonstrate that REINFORCE++ is more stable in training compared to GRPO and faster than PPO."

Efficiency data: "On Llama3 8B, REINFORCE++ reduced RLHF training time (70k samples, H100 GPU) from 60 hours (PPO) to 42 hours."

DAPO: ByteDance's Open-Source Solution

Overview

DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) is ByteDance's fully open-source RL system:

From research: "DAPO is an algorithm that fully open-sources a state-of-the-art large-scale RL system achieving 50 points on AIME 2024 using Qwen2.5-32B base model."

Four Key Techniques

From the DAPO paper:

  1. Clip-higher: "Increases the upper bound of the PPO clipping range to encourage exploration and prevent entropy collapse during training."

  2. Dynamic sampling: "Improves training efficiency by filtering out prompts where all sampled responses are either always correct or always wrong."

  3. Token-level policy gradient loss: "Moves from sample-level to token-level loss calculation so that longer responses can have more influence on the gradient update."

  4. Overlong reward shaping: "Adds a soft penalty for responses that get truncated for being too long, which reduces reward noise and helps stabilize training."

Significance

From research: "Unlike previous works that withhold training details, DAPO introduces four key techniques that make large-scale LLM RL successful."

Published as NeurIPS 2025 poster.

RLOO: REINFORCE Leave-One-Out

The Approach

From research: "RLOO eliminates the need for a value function at each time-step by replacing it with the expected return over multiple trajectories sampled on the fly."

For each prompt, RLOO:

  1. Samples N responses
  2. For response i, uses average reward of other N-1 responses as baseline
  3. This provides unbiased gradient estimates without a critic

Why It Works for LLMs

From research: "Not all RL tasks allow multiple trajectory sampling from the same state. In LLM post-training, the agent has significant control over transitions, enabling multiple trajectory sampling and making RLOO viable."

GRPO's Relationship to RLOO

From research: "GRPO is extremely closely related to other RL algorithms—it's derived from PPO and has a similar advantage computation as RLOO."

DPO and Preference-Based Methods

Why DPO Changed Everything

DPO eliminates RL entirely:

From research: "Direct Preference Optimization (DPO) offers a streamlined alternative to RLHF by optimizing the same objective but bypasses the explicit need for a separate reward model, thereby reducing the computational costs."

The DPO Objective

Code
L_DPO = -log(σ(β × (log π(y_w|x)/π_ref(y_w|x)
                  - log π(y_l|x)/π_ref(y_l|x))))

Where:
- y_w = winning (preferred) response
- y_l = losing (rejected) response
- π_ref = reference model
- β = temperature

DPO Limitations

From research: "DPO encounters several limitations: 1) high dependency on the SFT part, 2) tendency to overfit beyond a single epoch, and 3) inefficient learning and memory utilization."

From research: "Standard DPO is an offline training method, and it has become increasingly clear that it underperforms compared to more advanced online reinforcement learning (RL) techniques like RLHF with PPO."

DPO Variants

IPO (Identity Policy Optimization)

From Argilla: "The IPO paper provides a strong theoretical framework explaining the basis of RLHF and DPO, highlighting major shortcomings of these approaches. To avoid overfitting and weak regularization, IPO adds a regularization term."

Purpose: "IPO was developed to address the DPO overfitting issue."

KTO (Kahneman-Tversky Optimization)

From research: "KTO directly maximizes the utility of generations instead of maximizing the log-likelihood of the preferences. KTO only requires a binary signal of whether output is desirable or not, which is a kind of data easier to obtain than preferences."

Key advantage: "Without doing SFT first, DPO-aligned models tend to ramble and hallucinate entire conversations. KTO does not suffer from this phenomenon."

Performance: "KTO is good or better than DPO at all scales. For Llama models, KTO alone matches the performance of SFT and DPO combined."

ORPO (Odds Ratio Preference Optimization)

From Argilla: "ORPO creates a new objective by using an odds ratio-based loss to penalize undesirable responses along with conventional negative log-likelihood loss. It only relies on the base model as the preference alignment is performed during the SFT."

Benefits: "ORPO offers efficiency (requires significantly less computational resources than RLHF), stability (more stable training dynamics), and faster training time than the full alignment pipeline."

SimPO (Simple Preference Optimization)

From Princeton NLP: "SimPO is a simpler and more effective preference optimization algorithm than DPO that doesn't require a reference model."

Key innovation: "Using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model."

Results: "SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard."

Published at NeurIPS 2024.

Online vs Offline DPO

From research: "Offline DPO is empirically inferior to online alignment methods. To explore the online algorithm of DPO, iterative and online DPO have been implemented."

Online DPO variants:

  • Self-Rewarding: Model generates both responses and preference labels
  • OAIF: Uses external LLM for preference labeling
  • sDPO: Step-wise iteration through preference data partitions

Algorithm Selection Guide

When to Use Each Algorithm

AlgorithmBest ForAvoid When
PPOMaximum quality, production modelsLimited compute/memory
GRPOReasoning tasks, DeepSeek-styleLong training, MoE models
GSPOMoE models, long training runsNot yet widely available
REINFORCE++Balance of speed and stabilityNeed maximum quality
DPOQuick alignment, limited computeNeed online learning
KTOOnly have binary feedbackHave preference pairs
SimPOReference-free trainingNeed explicit reward model

Decision Tree

Code
Need maximum quality?
├── Yes → PPO (if compute allows) or GSPO
└── No
    ├── Have preference pairs?
    │   ├── Yes → DPO/SimPO (offline) or Online DPO
    │   └── No (only good/bad labels) → KTO
    └── Need online RL?
        ├── Yes
        │   ├── Training MoE? → GSPO
        │   ├── Need stability? → REINFORCE++
        │   └── Following DeepSeek approach? → GRPO
        └── No → DPO variants

Practical Recommendations

For reasoning models: From research: "The connection between RLHF and reasoning comes from how the DeepSeek team applied a similar RL-based approach (with GRPO) to train the reasoning capabilities of their R1 and R1-Zero models."

Consider: GRPO → GSPO → REINFORCE++

For general alignment: Start with DPO/SimPO, graduate to PPO/GSPO for maximum quality.

For MoE models: From Qwen: "GSPO has inherently resolved the stability challenges in the RL training of large MoE models."

KL divergence considerations: From research: "Recent works like DAPO and Dr. GRPO have shown that the KL divergence term is not necessary for LLM reasoning tasks."

Implementation Resources

Frameworks

OpenRLHF: From GitHub: "OpenRLHF is an easy-to-use, scalable and high-performance RLHF Framework based on Ray supporting PPO, GRPO, REINFORCE++, TIS, vLLM, Ray, Dynamic Sampling, and Async Agentic RL."

From research: "The CMU Advanced Natural Language Processing Spring 2025 course uses OpenRLHF as the RLHF framework teaching case."

TRL (Transformers Reinforcement Learning): Supports SFT, RM, PPO, DPO, IPO, KTO, and ORPO through Hugging Face.

Hyperparameter Guidance

PPO:

YAML
ppo_config:
  learning_rate: 1e-6
  kl_coefficient: 0.01-0.1
  clip_range: 0.2
  batch_size: 512+
  value_coefficient: 0.5-1.0

DPO/SimPO:

YAML
dpo_config:
  learning_rate: 5e-7 to 5e-6
  beta: 0.1-0.5
  epochs: 1-3
  batch_size: 128

GRPO:

YAML
grpo_config:
  group_size: 8-16
  clip_range: 0.2
  learning_rate: 1e-6
  kl_coefficient: 0.0  # Often not needed for reasoning

The Future of LLM RL

  1. Critic-free methods: GRPO, GSPO, REINFORCE++ show critic models aren't necessary
  2. Sequence-level optimization: GSPO's success suggests token-level may not be optimal
  3. KL-free training: Research shows KL penalty often unnecessary for reasoning
  4. Open-source parity: DAPO, OpenRLHF enabling reproduction of frontier results

Open Questions

From research: "Research has shown that the choice of RL algorithm between GRPO and RLOO, and different KL coefficients do not affect model collapse significantly."

This suggests other factors (data quality, reward design, training dynamics) may matter more than algorithm choice for many use cases.

Conclusion

The RL algorithm landscape for LLMs has evolved rapidly:

  1. PPO remains the gold standard for maximum quality when compute allows
  2. GRPO enabled reasoning breakthroughs but has stability issues
  3. GSPO fixes GRPO's problems and powers Qwen3
  4. REINFORCE++ offers the best balance of speed and stability
  5. DPO variants simplify alignment for compute-constrained settings

Choose your algorithm based on your constraints: compute budget, stability requirements, whether you need online learning, and target model architecture.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles