RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond
A comprehensive guide to reinforcement learning algorithms for LLM alignment—PPO, GRPO, GSPO, REINFORCE++, DPO, and their variants. Understanding the tradeoffs that power modern AI assistants.
Table of Contents
The RL Algorithm Landscape
Reinforcement learning has become the key differentiator in LLM capabilities. From ChatGPT's PPO-based RLHF to DeepSeek R1's GRPO to Qwen3's GSPO, the choice of RL algorithm significantly impacts training stability, efficiency, and final model quality.
Why RL algorithms matter for LLMs: After pretraining (predicting next tokens) and SFT (following instructions), RL is what makes models helpful, harmless, and honest. But RL for LLMs is uniquely challenging: you're optimizing sequences of 1000+ tokens, rewards are sparse (one signal for the whole response), and policy updates must be stable across billions of parameters. The algorithm you choose determines whether training converges smoothly or collapses catastrophically.
The stability-efficiency tradeoff: PPO is stable but requires 4 model copies in memory—impossible for 70B+ models without massive GPU clusters. DPO eliminates RL entirely but can't match PPO's final quality. GRPO halves memory but introduces instability. GSPO and REINFORCE++ fix GRPO's issues while keeping memory benefits. Each algorithm represents a different point on this tradeoff curve.
This post provides a comprehensive guide to the RL algorithms used in modern LLM training, their tradeoffs, and when to use each.
The Evolution of LLM RL
PPO (2017) → The gold standard, but complex and expensive
↓
DPO (2023) → Simplified to preference learning, no RL needed
↓
GRPO (2024) → Eliminated critic model, group-based advantages
↓
GSPO (2025) → Fixed GRPO instability with sequence-level optimization
↓
REINFORCE++ (2025) → Global normalization, most stable
From research: "By formalizing LLM post-training as a token-level MDP, algorithms such as REINFORCE, ReMax, RLOO, PPO, GRPO, and Dr. GRPO are variations of the same core principle: estimating unbiased gradients of the expected return while mitigating variance through carefully designed baselines."
PPO: The Foundation
Overview
Proximal Policy Optimization remains the most widely used algorithm for RLHF:
From Cameron Wolfe: "Due to its simplicity and effectiveness, PPO is widely used across domains and has become the go-to choice for aligning language models via RLHF."
The RLHF Pipeline with PPO
- Pre-train the LLM on internet text
- SFT on demonstration data
- Train reward model on human preferences
- RL fine-tune with PPO to maximize reward
Components Required
PPO requires four models in memory:
| Model | Purpose | Memory |
|---|---|---|
| Policy | The LLM being trained | Full model |
| Reference | Frozen SFT model for KL penalty | Full model |
| Critic | Estimates value function | Full model |
| Reward | Scores responses | Full model |
From research: "The memory overhead is high because we keep four copies of the LLM in memory: the policy, the reference policy, the critic, and the reward model."
The PPO Objective
Understanding the clipping mechanism: PPO's innovation is the clipped objective—it limits how much the policy can change in one update. If an action looks much better than expected (high advantage), vanilla policy gradients would push hard to increase its probability, potentially destabilizing training. PPO says "wait—if this action was already unlikely, don't make it too likely too fast." The clip prevents overshooting.
L_PPO = E[min(r_t × A_t, clip(r_t, 1-ε, 1+ε) × A_t)]
- β × KL(π || π_ref)
+ c × H(π)
Where:
- r_t = π(a|s) / π_old(a|s) [importance ratio]
- A_t = advantage estimate from critic
- ε = clip range (typically 0.2)
- β = KL penalty coefficient
- H(π) = entropy bonus
Breaking down each term:
- Clipped policy loss: The
min(...)ensures the objective never incentivizes moving the policy ratio beyond [1-ε, 1+ε]. This is the stability mechanism. - KL penalty: Keeps the new policy close to the reference (SFT) model, preventing reward hacking where the model finds degenerate high-reward outputs.
- Entropy bonus: Encourages exploration by penalizing overly deterministic policies. Without it, the model might collapse to always generating one "safe" response.
From research: "One of the key ideas behind PPO is that it limits how much the policy is allowed to change during each update step. This is done using a clipped loss function, which helps prevent the model from making overly large updates that could destabilize training."
PPO Advantages
- Most stable: Decades of research, well-understood
- Best performance: When tuned correctly, achieves highest quality
- Industry proven: Used by OpenAI, Anthropic for production models
PPO Disadvantages
- High memory: 4 full model copies
- Complex tuning: Many hyperparameters
- Slow training: 138% slower than alternatives
From research: "While PPO achieves significant advantages in accuracy and reward, it was 138% slower than REINFORCE++ in training speed."
GRPO: DeepSeek's Innovation
The Key Insight
GRPO eliminates the critic model by estimating advantages from groups of responses to the same prompt:
From DeepSeek: "To save training costs of RL, they adopted Group Relative Policy Optimization (GRPO), which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead."
How GRPO Works
def grpo_advantage(responses, rewards):
"""
For each prompt, generate multiple responses.
Compute advantage relative to group statistics.
"""
group_mean = mean(rewards)
group_std = std(rewards)
advantages = []
for reward in rewards:
adv = (reward - group_mean) / (group_std + eps)
advantages.append(adv)
return advantages
GRPO vs PPO
| Aspect | PPO | GRPO |
|---|---|---|
| Critic model | Required | Not required |
| Memory | 4 models | 3 models |
| Baseline | Learned value function | Group statistics |
| Stability | High | Moderate |
| Use case | General RLHF | Reasoning models |
GRPO's Success with DeepSeek R1
DeepSeek used GRPO to train R1-Zero with pure RL:
From DeepSeek: "We directly applied reinforcement learning to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought for solving complex problems."
Results: "The pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, improves to 86.7%."
GRPO's Problems
However, GRPO has significant stability issues:
From research: "GRPO exhibits the weakest performance among the three RL algorithms evaluated."
From Qwen team: "Existing RL algorithms (such as GRPO) exhibit severe instability issues during long training and lead to irreversible model collapse, hindering further performance improvements with increased compute."
Three biases in GRPO (identified by Dr. GRPO paper):
-
Baseline bias: Using a biased baseline without correcting the scaling factor
-
Response-level length bias: "For correct answers (with positive advantages), this bias incentivizes shorter responses; for incorrect answers (with negative advantages), this bias results in longer responses."
-
Question-level difficulty bias: "Questions within one batch can vary significantly in type, domain, and difficulty, leading to question-specific gradient estimation bias."
GSPO: Qwen's Solution
The Problem with Token-Level Optimization
From Qwen: "The instability of GRPO stems from the fundamental misapplication and invalidation of importance sampling weights in its algorithmic design. This introduces high-variance training noise that progressively accumulates with increased response length and is further amplified by the clipping mechanism, ultimately precipitating model collapse."
GSPO's Sequence-Level Approach
GSPO (Group Sequence Policy Optimization) operates at the sequence level instead of token level:
From Qwen: "Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization."
Key differences:
- Importance ratios computed over whole sequences
- Clipping at sequence level, not token level
- All tokens in a sequence treated equally during backprop
From research: "Optimization happens at the sequence level, not the token level. Importance ratios are calculated over the whole output. Clipping is also done at the level of full sequences."
GSPO Benefits
From Qwen:
-
Better Training Stability: "GSPO has inherently resolved the stability challenges in the RL training of large Mixture-of-Experts (MoE) models, eliminating the need for complex stabilization strategies."
-
Higher Efficiency: "GSPO demonstrates significantly higher training efficiency than GRPO, achieving better performance under the same training cost."
-
Infrastructure-Friendly: "Due to sequence-level optimization, GSPO is fundamentally more tolerant to precision discrepancies."
Real-World Impact
From Qwen: "These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models."
REINFORCE++
Overview
REINFORCE++ achieves PPO-like stability without a value network:
From research: "REINFORCE++ achieves PPO-like training stability and efficiency without relying on a value network, by incorporating clipped policy updates, KL divergence penalties, and advantage normalization."
Key Innovation: Global Normalization
From research: "Rather than estimating advantages independently for each prompt (as in RLOO, GRPO), REINFORCE++ uses global advantage normalization. This mechanism provides better stability, particularly across heterogeneous prompts and noisy reward functions."
Performance Comparison
| Metric | PPO | GRPO | REINFORCE++ |
|---|---|---|---|
| Stability | Highest | Lowest | High |
| Speed | Slowest | Fast | Fastest |
| Memory | Highest | Lower | Lower |
| Quality | Best | Variable | Good |
From research: "Logic-RL and PRIME demonstrate that REINFORCE++ is more stable in training compared to GRPO and faster than PPO."
Efficiency data: "On Llama3 8B, REINFORCE++ reduced RLHF training time (70k samples, H100 GPU) from 60 hours (PPO) to 42 hours."
DAPO: ByteDance's Open-Source Solution
Overview
DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) is ByteDance's fully open-source RL system:
From research: "DAPO is an algorithm that fully open-sources a state-of-the-art large-scale RL system achieving 50 points on AIME 2024 using Qwen2.5-32B base model."
Four Key Techniques
From the DAPO paper:
-
Clip-higher: "Increases the upper bound of the PPO clipping range to encourage exploration and prevent entropy collapse during training."
-
Dynamic sampling: "Improves training efficiency by filtering out prompts where all sampled responses are either always correct or always wrong."
-
Token-level policy gradient loss: "Moves from sample-level to token-level loss calculation so that longer responses can have more influence on the gradient update."
-
Overlong reward shaping: "Adds a soft penalty for responses that get truncated for being too long, which reduces reward noise and helps stabilize training."
Significance
From research: "Unlike previous works that withhold training details, DAPO introduces four key techniques that make large-scale LLM RL successful."
Published as NeurIPS 2025 poster.
RLOO: REINFORCE Leave-One-Out
The Approach
From research: "RLOO eliminates the need for a value function at each time-step by replacing it with the expected return over multiple trajectories sampled on the fly."
For each prompt, RLOO:
- Samples N responses
- For response i, uses average reward of other N-1 responses as baseline
- This provides unbiased gradient estimates without a critic
Why It Works for LLMs
From research: "Not all RL tasks allow multiple trajectory sampling from the same state. In LLM post-training, the agent has significant control over transitions, enabling multiple trajectory sampling and making RLOO viable."
GRPO's Relationship to RLOO
From research: "GRPO is extremely closely related to other RL algorithms—it's derived from PPO and has a similar advantage computation as RLOO."
DPO and Preference-Based Methods
Why DPO Changed Everything
DPO eliminates RL entirely:
From research: "Direct Preference Optimization (DPO) offers a streamlined alternative to RLHF by optimizing the same objective but bypasses the explicit need for a separate reward model, thereby reducing the computational costs."
The DPO Objective
L_DPO = -log(σ(β × (log π(y_w|x)/π_ref(y_w|x)
- log π(y_l|x)/π_ref(y_l|x))))
Where:
- y_w = winning (preferred) response
- y_l = losing (rejected) response
- π_ref = reference model
- β = temperature
DPO Limitations
From research: "DPO encounters several limitations: 1) high dependency on the SFT part, 2) tendency to overfit beyond a single epoch, and 3) inefficient learning and memory utilization."
From research: "Standard DPO is an offline training method, and it has become increasingly clear that it underperforms compared to more advanced online reinforcement learning (RL) techniques like RLHF with PPO."
DPO Variants
IPO (Identity Policy Optimization)
From Argilla: "The IPO paper provides a strong theoretical framework explaining the basis of RLHF and DPO, highlighting major shortcomings of these approaches. To avoid overfitting and weak regularization, IPO adds a regularization term."
Purpose: "IPO was developed to address the DPO overfitting issue."
KTO (Kahneman-Tversky Optimization)
From research: "KTO directly maximizes the utility of generations instead of maximizing the log-likelihood of the preferences. KTO only requires a binary signal of whether output is desirable or not, which is a kind of data easier to obtain than preferences."
Key advantage: "Without doing SFT first, DPO-aligned models tend to ramble and hallucinate entire conversations. KTO does not suffer from this phenomenon."
Performance: "KTO is good or better than DPO at all scales. For Llama models, KTO alone matches the performance of SFT and DPO combined."
ORPO (Odds Ratio Preference Optimization)
From Argilla: "ORPO creates a new objective by using an odds ratio-based loss to penalize undesirable responses along with conventional negative log-likelihood loss. It only relies on the base model as the preference alignment is performed during the SFT."
Benefits: "ORPO offers efficiency (requires significantly less computational resources than RLHF), stability (more stable training dynamics), and faster training time than the full alignment pipeline."
SimPO (Simple Preference Optimization)
From Princeton NLP: "SimPO is a simpler and more effective preference optimization algorithm than DPO that doesn't require a reference model."
Key innovation: "Using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model."
Results: "SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard."
Published at NeurIPS 2024.
Online vs Offline DPO
From research: "Offline DPO is empirically inferior to online alignment methods. To explore the online algorithm of DPO, iterative and online DPO have been implemented."
Online DPO variants:
- Self-Rewarding: Model generates both responses and preference labels
- OAIF: Uses external LLM for preference labeling
- sDPO: Step-wise iteration through preference data partitions
Algorithm Selection Guide
When to Use Each Algorithm
| Algorithm | Best For | Avoid When |
|---|---|---|
| PPO | Maximum quality, production models | Limited compute/memory |
| GRPO | Reasoning tasks, DeepSeek-style | Long training, MoE models |
| GSPO | MoE models, long training runs | Not yet widely available |
| REINFORCE++ | Balance of speed and stability | Need maximum quality |
| DPO | Quick alignment, limited compute | Need online learning |
| KTO | Only have binary feedback | Have preference pairs |
| SimPO | Reference-free training | Need explicit reward model |
Decision Tree
Need maximum quality?
├── Yes → PPO (if compute allows) or GSPO
└── No
├── Have preference pairs?
│ ├── Yes → DPO/SimPO (offline) or Online DPO
│ └── No (only good/bad labels) → KTO
└── Need online RL?
├── Yes
│ ├── Training MoE? → GSPO
│ ├── Need stability? → REINFORCE++
│ └── Following DeepSeek approach? → GRPO
└── No → DPO variants
Practical Recommendations
For reasoning models: From research: "The connection between RLHF and reasoning comes from how the DeepSeek team applied a similar RL-based approach (with GRPO) to train the reasoning capabilities of their R1 and R1-Zero models."
Consider: GRPO → GSPO → REINFORCE++
For general alignment: Start with DPO/SimPO, graduate to PPO/GSPO for maximum quality.
For MoE models: From Qwen: "GSPO has inherently resolved the stability challenges in the RL training of large MoE models."
KL divergence considerations: From research: "Recent works like DAPO and Dr. GRPO have shown that the KL divergence term is not necessary for LLM reasoning tasks."
Implementation Resources
Frameworks
OpenRLHF: From GitHub: "OpenRLHF is an easy-to-use, scalable and high-performance RLHF Framework based on Ray supporting PPO, GRPO, REINFORCE++, TIS, vLLM, Ray, Dynamic Sampling, and Async Agentic RL."
From research: "The CMU Advanced Natural Language Processing Spring 2025 course uses OpenRLHF as the RLHF framework teaching case."
TRL (Transformers Reinforcement Learning): Supports SFT, RM, PPO, DPO, IPO, KTO, and ORPO through Hugging Face.
Hyperparameter Guidance
PPO:
ppo_config:
learning_rate: 1e-6
kl_coefficient: 0.01-0.1
clip_range: 0.2
batch_size: 512+
value_coefficient: 0.5-1.0
DPO/SimPO:
dpo_config:
learning_rate: 5e-7 to 5e-6
beta: 0.1-0.5
epochs: 1-3
batch_size: 128
GRPO:
grpo_config:
group_size: 8-16
clip_range: 0.2
learning_rate: 1e-6
kl_coefficient: 0.0 # Often not needed for reasoning
The Future of LLM RL
Trends
- Critic-free methods: GRPO, GSPO, REINFORCE++ show critic models aren't necessary
- Sequence-level optimization: GSPO's success suggests token-level may not be optimal
- KL-free training: Research shows KL penalty often unnecessary for reasoning
- Open-source parity: DAPO, OpenRLHF enabling reproduction of frontier results
Open Questions
From research: "Research has shown that the choice of RL algorithm between GRPO and RLOO, and different KL coefficients do not affect model collapse significantly."
This suggests other factors (data quality, reward design, training dynamics) may matter more than algorithm choice for many use cases.
Conclusion
The RL algorithm landscape for LLMs has evolved rapidly:
- PPO remains the gold standard for maximum quality when compute allows
- GRPO enabled reasoning breakthroughs but has stability issues
- GSPO fixes GRPO's problems and powers Qwen3
- REINFORCE++ offers the best balance of speed and stability
- DPO variants simplify alignment for compute-constrained settings
Choose your algorithm based on your constraints: compute budget, stability requirements, whether you need online learning, and target model architecture.
Frequently Asked Questions
Related Articles
GRPO: Group Relative Policy Optimization Explained
Understanding Group Relative Policy Optimization—the technique behind DeepSeek's training efficiency and a simpler alternative to PPO-based RLHF.
RLVR: Reinforcement Learning with Verifiable Rewards
Understanding Reinforcement Learning with Verifiable Rewards (RLVR)—the technique behind DeepSeek R1's reasoning capabilities, process reward models, and when to use verifiable vs human feedback.
Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR
A deep technical guide to training reasoning models like o1 and DeepSeek R1—covering PPO, GRPO, reward function design, RLVR, and distillation techniques.