GRPO: Group Relative Policy Optimization Explained
Understanding Group Relative Policy Optimization—the technique behind DeepSeek's training efficiency and a simpler alternative to PPO-based RLHF.
Table of Contents
The Evolution of Preference Optimization
The journey from RLHF to simpler alternatives:
PPO (RLHF) → Complex, effective, hard to tune
↓
DPO → Simpler, no reward model, competitive
↓
IPO → Addresses DPO overfitting
↓
GRPO → Group-based, no reference model needed, efficient
GRPO (Group Relative Policy Optimization) represents the latest evolution—offering the benefits of preference optimization without the complexity of PPO or the reference model requirement of DPO.
What Makes GRPO Different
The Key Insight
Traditional methods compare responses to a reference policy or reward model. GRPO takes a different approach: compare responses within a group.
DPO approach:
For each preference pair (chosen, rejected):
- Compare policy probability vs. reference model probability
- Optimize to increase chosen, decrease rejected, relative to reference
GRPO approach:
For each prompt, generate multiple responses:
- Score all responses (using reward model or other signal)
- Normalize scores within the group
- Optimize policy relative to group statistics
Why Group-Based?
Eliminates reference model: DPO requires keeping a frozen reference model in memory. GRPO compares within the current batch, eliminating this requirement.
Better credit assignment: Instead of binary chosen/rejected, GRPO uses relative rankings within groups, providing richer training signal.
More stable optimization: Group normalization reduces variance and makes training more stable.
The GRPO Algorithm
Step-by-Step
The GRPO training loop is conceptually simple, though implementation requires care around memory management and numerical stability.
- Sample prompts from training distribution
- Generate multiple responses per prompt (group size G, typically 4-16)
- Score responses using reward model or other metric
- Compute group statistics (mean, std) for each prompt's responses
- Normalize scores within groups → advantages
- Update policy to increase probability of high-advantage responses
Why this works intuitively: By generating multiple responses to the same prompt and comparing them, we create a natural ranking. The model learns "for this type of question, responses with these characteristics score higher." The group normalization means we're always asking "is this response better or worse than what we typically generate for this prompt?"—a relative question rather than an absolute one.
The computational tradeoff: Generating G responses per prompt means G times the inference cost. But we extract G training signals from each prompt, making training more sample-efficient. For expensive reward models (like LLM-as-judge), this amortizes the reward computation cost across multiple policy updates.
Mathematical Formulation
Advantage computation:
where is the advantage of response , is the reward, is the mean reward of group , and is the standard deviation.
Policy gradient:
With clipping (similar to PPO):
where is the probability ratio and is the clip range (e.g., 0.2).
KL regularization (optional):
Understanding the math in plain language:
The advantage is simply "how much better is this response than average for this group?" Positive advantage means better than average; negative means worse. The standard deviation normalization ensures advantages are on a consistent scale across different prompts (some prompts might have high variance in response quality, others low).
The policy gradient says: increase the probability of generating responses with positive advantage, decrease the probability of responses with negative advantage. The gradient scales with the advantage magnitude—much better responses get stronger positive updates.
The clipping (from PPO) prevents any single update from changing the policy too drastically. If the probability ratio r gets too far from 1, we clip it. This adds stability: we don't let one surprisingly good or bad example dominate training.
The KL regularization is optional but often helpful. It penalizes the policy for drifting too far from its starting point, preventing mode collapse and maintaining diversity. The β parameter controls the strength of this regularization.
Pseudocode
def grpo_update(policy, prompts, reward_model, group_size=8):
all_advantages = []
all_responses = []
all_prompts = []
for prompt in prompts:
# Generate group of responses
responses = [policy.generate(prompt) for _ in range(group_size)]
# Score responses
rewards = [reward_model(prompt, r) for r in responses]
# Compute group-normalized advantages
mean_reward = np.mean(rewards)
std_reward = np.std(rewards) + 1e-8
advantages = [(r - mean_reward) / std_reward for r in rewards]
all_advantages.extend(advantages)
all_responses.extend(responses)
all_prompts.extend([prompt] * group_size)
# Policy gradient update
loss = compute_clipped_policy_loss(
policy, all_prompts, all_responses, all_advantages
)
loss.backward()
optimizer.step()
Walking through the pseudocode:
The outer loop iterates through prompts. For each prompt, we generate group_size different responses by sampling from the current policy. This is where the "group" comes from—multiple attempts at the same question.
The reward model scores each response. This could be a learned reward model, an LLM judge, or even a simple rule-based metric (like correctness for math problems). The key requirement: the reward should be a scalar that indicates response quality.
The group normalization ((r - mean_reward) / std_reward) converts raw rewards to advantages. The + 1e-8 prevents division by zero when all responses have identical rewards (rare but possible).
We accumulate data across all prompts before the gradient update. This is important: we want each update to see diverse prompts, not just variations of a single one.
The compute_clipped_policy_loss function (not shown) implements the PPO-style clipped objective. It computes the log probability of each response under the current policy, compares to a stored old probability, and applies clipping and advantage weighting.
Comparing GRPO to Alternatives
GRPO vs. PPO (RLHF)
| Aspect | PPO | GRPO |
|---|---|---|
| Reward model | Required | Required |
| Value model | Required | Not required |
| Reference model | Required for KL | Optional |
| Memory usage | High (3+ models) | Lower (2 models) |
| Stability | Requires careful tuning | More stable |
| Sample efficiency | Lower | Higher (multiple per prompt) |
GRPO vs. DPO
| Aspect | DPO | GRPO |
|---|---|---|
| Data format | Pairwise preferences | Responses + scores |
| Reference model | Required | Not required |
| Online/Offline | Offline (fixed data) | Online (generates during training) |
| Signal richness | Binary (better/worse) | Continuous (relative advantages) |
| Training | Simpler | Slightly more complex |
GRPO vs. REINFORCE
GRPO can be seen as a variance-reduced REINFORCE:
REINFORCE:
GRPO:
The group normalization acts as an adaptive baseline, reducing variance.
Implementation Details
Group Size Selection
Larger groups = lower variance, higher compute:
| Group Size | Variance | Compute | Recommendation |
|---|---|---|---|
| 2 | High | Low | Not recommended |
| 4 | Medium | Medium | Minimum viable |
| 8 | Low | High | Good default |
| 16 | Very low | Very high | Maximum quality |
Practical choice: Start with 8, reduce if compute-limited.
Reward Model Integration
GRPO needs scores for each response:
Option 1: Trained reward model
reward = reward_model(prompt, response) # Scalar output
Option 2: LLM-as-judge
reward = llm_judge(prompt, response, rubric) # Returns score
Option 3: Rule-based rewards
reward = compute_rule_rewards(response) # Length, format, etc.
Option 4: Outcome-based rewards
reward = verify_outcome(response) # Correct/incorrect for reasoning
Handling Long Responses
Generation is expensive. Strategies:
Early stopping: Stop generation if reward model signals low quality.
Response caching: Reuse good responses across training steps.
Importance sampling: Weight older responses by probability ratio.
Memory Optimization
GRPO generates multiple responses, increasing memory:
Gradient accumulation:
for micro_batch in batch.split(micro_batch_size):
loss = compute_loss(micro_batch)
(loss / num_micro_batches).backward()
optimizer.step()
Response chunking: Generate and score in chunks, don't store all simultaneously.
Training Recipe
Hyperparameters
grpo_config:
group_size: 8
clip_range: 0.2
learning_rate: 1e-6
batch_size: 512 # Total responses per update
kl_coefficient: 0.0 # Often not needed
max_response_length: 1024
temperature: 0.8 # For diversity in generation
epochs: 1-3
Training Loop
for epoch in range(num_epochs):
for batch in dataloader:
prompts = batch['prompts']
# Generate groups
with torch.no_grad():
responses = generate_groups(policy, prompts, group_size)
rewards = score_responses(reward_model, prompts, responses)
# Compute advantages
advantages = compute_group_advantages(rewards, group_size)
# Policy update
loss = grpo_loss(policy, prompts, responses, advantages)
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
# Logging
log_metrics(loss, advantages, rewards)
Monitoring
Track during training:
- Mean reward: Should increase
- Reward variance: Should decrease as policy improves
- Advantage distribution: Should center around 0
- Policy entropy: Watch for collapse
- KL from initial policy: Monitor drift
DeepSeek's GRPO Implementation
DeepSeek used GRPO to train their reasoning models:
Key Innovations
Outcome-based rewards for reasoning:
def reasoning_reward(prompt, response):
# Extract final answer
answer = extract_answer(response)
# Compare to ground truth
correct = verify_answer(answer, prompt.expected_answer)
return 1.0 if correct else 0.0
Process rewards (optional):
def process_reward(prompt, response):
steps = extract_reasoning_steps(response)
step_scores = [verify_step(step) for step in steps]
return sum(step_scores) / len(step_scores)
Long CoT training: Allow very long responses for complex reasoning, reward only the final answer correctness.
Training Efficiency
DeepSeek reported significant efficiency gains:
- Fewer models in memory (no value function)
- Better sample efficiency (group comparisons)
- More stable training (group normalization)
Advanced Techniques
Iterative GRPO
Run GRPO multiple times with improving reward signals:
Iteration 1: Train with basic reward model
Iteration 2: Train reward model on iteration 1 outputs, retrain policy
Iteration 3: Repeat...
Multi-Objective GRPO
Balance multiple reward signals:
def multi_objective_reward(prompt, response):
helpfulness = helpfulness_model(prompt, response)
safety = safety_model(prompt, response)
quality = quality_model(prompt, response)
# Weighted combination
return 0.5 * helpfulness + 0.3 * safety + 0.2 * quality
Curriculum Learning
Start with easier prompts, progress to harder:
def get_curriculum_prompts(epoch, all_prompts):
difficulty_scores = [estimate_difficulty(p) for p in all_prompts]
threshold = min(1.0, 0.3 + 0.1 * epoch) # Increase over time
return [p for p, d in zip(all_prompts, difficulty_scores) if d <= threshold]
Best-of-N Distillation
Alternative to online GRPO:
- Generate N responses per prompt
- Select best using reward model
- Fine-tune on best responses (SFT)
- Repeat
Simpler than GRPO but doesn't optimize the policy gradient directly.
When to Use GRPO
Good Fit
- Reasoning tasks: Where correctness is verifiable
- Limited memory: Can't afford PPO's multiple models
- Need online learning: Want to generate during training
- Have reward model: Or can compute rewards automatically
Less Suitable
- Only have preference pairs: DPO is simpler
- Need reference model anyway: PPO might work equally well
- Very limited compute: Best-of-N distillation is simpler
Practical Tips
Reward Model Quality
GRPO is only as good as your rewards:
- Train a strong reward model first
- Validate on held-out preferences
- Watch for reward hacking
Generation Diversity
Groups need diverse responses:
- Use temperature > 0 (0.7-1.0)
- Consider top-p sampling
- Vary prompting slightly
Gradient Stability
Large groups can cause gradient issues:
- Clip gradients (max_norm=1.0)
- Use learning rate warmup
- Monitor for NaN/Inf
Evaluation
Don't just track training metrics:
- Evaluate on held-out prompts
- Check capability retention
- Human evaluation periodically
Conclusion
GRPO offers a compelling middle ground in preference optimization—simpler than PPO, more flexible than DPO, and well-suited for tasks with computable rewards. DeepSeek's success demonstrates its effectiveness for reasoning tasks.
The key insight—comparing within groups rather than to a reference—eliminates infrastructure complexity while maintaining training signal quality. For teams looking to move beyond DPO without the full complexity of PPO, GRPO is worth serious consideration.
Frequently Asked Questions
Related Articles
RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond
A comprehensive guide to reinforcement learning algorithms for LLM alignment—PPO, GRPO, GSPO, REINFORCE++, DPO, and their variants. Understanding the tradeoffs that power modern AI assistants.
Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR
A deep technical guide to training reasoning models like o1 and DeepSeek R1—covering PPO, GRPO, reward function design, RLVR, and distillation techniques.
RLVR: Reinforcement Learning with Verifiable Rewards
Understanding Reinforcement Learning with Verifiable Rewards (RLVR)—the technique behind DeepSeek R1's reasoning capabilities, process reward models, and when to use verifiable vs human feedback.