How does GRPO compare to DPO in practice?

GRPO typically achieves similar or better results than DPO, with the advantage of online learning (generating during training). DPO is simpler to implement with existing preference datasets. Choose GRPO when you have a reward model and want online optimization; choose DPO when you have fixed preference data.

What reward model should I use with GRPO?

Options: (1) Trained preference model (best quality), (2) LLM-as-judge (flexible, no training needed), (3) Rule-based (fast, limited). For reasoning tasks, outcome-based rewards (correct/incorrect) work well. Start with the simplest reward that captures what you care about.

How do I know if GRPO is working?

Monitor: mean reward increasing, reward variance within groups decreasing, policy generating higher-quality responses on held-out prompts. Red flags: reward increasing but quality not improving (reward hacking), policy collapse to repetitive outputs.

Can I use GRPO without a reward model?

Technically you need some scoring function. Alternatives to trained reward models: LLM-as-judge, rule-based scoring, outcome verification (for reasoning), or even response length/format heuristics. The quality of GRPO depends directly on reward quality.

What's the minimum compute for GRPO?

With LoRA and efficient implementations: 1-2 high-end GPUs for 7B models, 4-8 GPUs for 70B models. The main cost is generation—generating 8 responses per prompt is 8× the cost of DPO's single forward pass. Use smaller group sizes if compute-limited.

How does GRPO handle reward hacking?

Same as other RL methods: it's vulnerable if the reward model has exploitable flaws. Mitigate by: using diverse training prompts, monitoring for suspicious reward increases without quality improvements, periodically retraining reward model, and including KL regularization if needed.

GRPO: Group Relative Policy Optimization Explained | Enrico Piovano

The Evolution of Preference Optimization

The journey from RLHF to simpler alternatives:

Code

PPO (RLHF)     →  Complex, effective, hard to tune
    ↓
DPO            →  Simpler, no reward model, competitive
    ↓
IPO            →  Addresses DPO overfitting
    ↓
GRPO           →  Group-based, no reference model needed, efficient

GRPO (Group Relative Policy Optimization) represents the latest evolution—offering the benefits of preference optimization without the complexity of PPO or the reference model requirement of DPO.

What Makes GRPO Different

The Key Insight

Traditional methods compare responses to a reference policy or reward model. GRPO takes a different approach: compare responses within a group.

DPO approach:

Code

For each preference pair (chosen, rejected):
- Compare policy probability vs. reference model probability
- Optimize to increase chosen, decrease rejected, relative to reference

GRPO approach:

Code

For each prompt, generate multiple responses:
- Score all responses (using reward model or other signal)
- Normalize scores within the group
- Optimize policy relative to group statistics

Why Group-Based?

Eliminates reference model: DPO requires keeping a frozen reference model in memory. GRPO compares within the current batch, eliminating this requirement.

Better credit assignment: Instead of binary chosen/rejected, GRPO uses relative rankings within groups, providing richer training signal.

More stable optimization: Group normalization reduces variance and makes training more stable.

The GRPO Algorithm

Step-by-Step

The GRPO training loop is conceptually simple, though implementation requires care around memory management and numerical stability.

Sample prompts from training distribution
Generate multiple responses per prompt (group size G, typically 4-16)
Score responses using reward model or other metric
Compute group statistics (mean, std) for each prompt's responses
Normalize scores within groups → advantages
Update policy to increase probability of high-advantage responses

Why this works intuitively: By generating multiple responses to the same prompt and comparing them, we create a natural ranking. The model learns "for this type of question, responses with these characteristics score higher." The group normalization means we're always asking "is this response better or worse than what we typically generate for this prompt?"—a relative question rather than an absolute one.

The computational tradeoff: Generating G responses per prompt means G times the inference cost. But we extract G training signals from each prompt, making training more sample-efficient. For expensive reward models (like LLM-as-judge), this amortizes the reward computation cost across multiple policy updates.

Mathematical Formulation

Advantage computation: $A(y) = \frac{r(y) - \mu_g}{\sigma_g}$

where $A(y)$ is the advantage of response $y$ , $r(y)$ is the reward, $\mu_g$ is the mean reward of group $g$ , and $\sigma_g$ is the standard deviation.

Policy gradient: $\nabla \mathcal{L} = \mathbb{E}\left[A(y) \cdot \nabla \log \pi(y|x)\right]$

With clipping (similar to PPO): $\nabla \mathcal{L} = \mathbb{E}\left[\min\left(\rho \cdot A, \text{clip}(\rho, 1-\epsilon, 1+\epsilon) \cdot A\right)\right]$

where $\rho = \frac{\pi(y|x)}{\pi_{\text{old}}(y|x)}$ is the probability ratio and $\epsilon$ is the clip range (e.g., 0.2).

KL regularization (optional): $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{policy}} - \beta \cdot D_{\text{KL}}(\pi \| \pi_{\text{old}})$

Understanding the math in plain language:

The advantage is simply "how much better is this response than average for this group?" Positive advantage means better than average; negative means worse. The standard deviation normalization ensures advantages are on a consistent scale across different prompts (some prompts might have high variance in response quality, others low).

The policy gradient says: increase the probability of generating responses with positive advantage, decrease the probability of responses with negative advantage. The gradient scales with the advantage magnitude—much better responses get stronger positive updates.

The clipping (from PPO) prevents any single update from changing the policy too drastically. If the probability ratio r gets too far from 1, we clip it. This adds stability: we don't let one surprisingly good or bad example dominate training.

The KL regularization is optional but often helpful. It penalizes the policy for drifting too far from its starting point, preventing mode collapse and maintaining diversity. The β parameter controls the strength of this regularization.

Pseudocode

Python

def grpo_update(policy, prompts, reward_model, group_size=8):
    all_advantages = []
    all_responses = []
    all_prompts = []

    for prompt in prompts:
        # Generate group of responses
        responses = [policy.generate(prompt) for _ in range(group_size)]

        # Score responses
        rewards = [reward_model(prompt, r) for r in responses]

        # Compute group-normalized advantages
        mean_reward = np.mean(rewards)
        std_reward = np.std(rewards) + 1e-8
        advantages = [(r - mean_reward) / std_reward for r in rewards]

        all_advantages.extend(advantages)
        all_responses.extend(responses)
        all_prompts.extend([prompt] * group_size)

    # Policy gradient update
    loss = compute_clipped_policy_loss(
        policy, all_prompts, all_responses, all_advantages
    )

    loss.backward()
    optimizer.step()

Walking through the pseudocode:

The outer loop iterates through prompts. For each prompt, we generate group_size different responses by sampling from the current policy. This is where the "group" comes from—multiple attempts at the same question.

The reward model scores each response. This could be a learned reward model, an LLM judge, or even a simple rule-based metric (like correctness for math problems). The key requirement: the reward should be a scalar that indicates response quality.

The group normalization ((r - mean_reward) / std_reward) converts raw rewards to advantages. The + 1e-8 prevents division by zero when all responses have identical rewards (rare but possible).

We accumulate data across all prompts before the gradient update. This is important: we want each update to see diverse prompts, not just variations of a single one.

The compute_clipped_policy_loss function (not shown) implements the PPO-style clipped objective. It computes the log probability of each response under the current policy, compares to a stored old probability, and applies clipping and advantage weighting.

Comparing GRPO to Alternatives

GRPO vs. PPO (RLHF)

Aspect	PPO	GRPO
Reward model	Required	Required
Value model	Required	Not required
Reference model	Required for KL	Optional
Memory usage	High (3+ models)	Lower (2 models)
Stability	Requires careful tuning	More stable
Sample efficiency	Lower	Higher (multiple per prompt)

GRPO vs. DPO

Aspect	DPO	GRPO
Data format	Pairwise preferences	Responses + scores
Reference model	Required	Not required
Online/Offline	Offline (fixed data)	Online (generates during training)
Signal richness	Binary (better/worse)	Continuous (relative advantages)
Training	Simpler	Slightly more complex

GRPO vs. REINFORCE

GRPO can be seen as a variance-reduced REINFORCE:

REINFORCE: $\nabla \mathcal{L} = \mathbb{E}\left[(r - b) \cdot \nabla \log \pi(y|x)\right]$

GRPO: $\nabla \mathcal{L} = \mathbb{E}\left[\frac{r - \mu_g}{\sigma_g} \cdot \nabla \log \pi(y|x)\right]$

The group normalization acts as an adaptive baseline, reducing variance.

Implementation Details

Group Size Selection

Larger groups = lower variance, higher compute:

Group Size	Variance	Compute	Recommendation
2	High	Low	Not recommended
4	Medium	Medium	Minimum viable
8	Low	High	Good default
16	Very low	Very high	Maximum quality

Practical choice: Start with 8, reduce if compute-limited.

Reward Model Integration

GRPO needs scores for each response:

Option 1: Trained reward model

Python

reward = reward_model(prompt, response)  # Scalar output

Option 2: LLM-as-judge

Python

reward = llm_judge(prompt, response, rubric)  # Returns score

Option 3: Rule-based rewards

Python

reward = compute_rule_rewards(response)  # Length, format, etc.

Option 4: Outcome-based rewards

Python

reward = verify_outcome(response)  # Correct/incorrect for reasoning

Handling Long Responses

Generation is expensive. Strategies:

Early stopping: Stop generation if reward model signals low quality.

Response caching: Reuse good responses across training steps.

Importance sampling: Weight older responses by probability ratio.

Memory Optimization

GRPO generates multiple responses, increasing memory:

Gradient accumulation:

Python

for micro_batch in batch.split(micro_batch_size):
    loss = compute_loss(micro_batch)
    (loss / num_micro_batches).backward()
optimizer.step()

Response chunking: Generate and score in chunks, don't store all simultaneously.

Training Recipe

Hyperparameters

YAML

grpo_config:
  group_size: 8
  clip_range: 0.2
  learning_rate: 1e-6
  batch_size: 512  # Total responses per update
  kl_coefficient: 0.0  # Often not needed
  max_response_length: 1024
  temperature: 0.8  # For diversity in generation
  epochs: 1-3

Training Loop

Python

for epoch in range(num_epochs):
    for batch in dataloader:
        prompts = batch['prompts']

        # Generate groups
        with torch.no_grad():
            responses = generate_groups(policy, prompts, group_size)
            rewards = score_responses(reward_model, prompts, responses)

        # Compute advantages
        advantages = compute_group_advantages(rewards, group_size)

        # Policy update
        loss = grpo_loss(policy, prompts, responses, advantages)
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=1.0)

        optimizer.step()
        scheduler.step()

        # Logging
        log_metrics(loss, advantages, rewards)

Monitoring

Track during training:

Mean reward: Should increase
Reward variance: Should decrease as policy improves
Advantage distribution: Should center around 0
Policy entropy: Watch for collapse
KL from initial policy: Monitor drift

DeepSeek's GRPO Implementation

DeepSeek used GRPO to train their reasoning models:

Key Innovations

Outcome-based rewards for reasoning:

Python

def reasoning_reward(prompt, response):
    # Extract final answer
    answer = extract_answer(response)
    # Compare to ground truth
    correct = verify_answer(answer, prompt.expected_answer)
    return 1.0 if correct else 0.0

Process rewards (optional):

Python

def process_reward(prompt, response):
    steps = extract_reasoning_steps(response)
    step_scores = [verify_step(step) for step in steps]
    return sum(step_scores) / len(step_scores)

Long CoT training: Allow very long responses for complex reasoning, reward only the final answer correctness.

Training Efficiency

DeepSeek reported significant efficiency gains:

Fewer models in memory (no value function)
Better sample efficiency (group comparisons)
More stable training (group normalization)

Advanced Techniques

Iterative GRPO

Run GRPO multiple times with improving reward signals:

Code

Iteration 1: Train with basic reward model
Iteration 2: Train reward model on iteration 1 outputs, retrain policy
Iteration 3: Repeat...

Multi-Objective GRPO

Balance multiple reward signals:

Python

def multi_objective_reward(prompt, response):
    helpfulness = helpfulness_model(prompt, response)
    safety = safety_model(prompt, response)
    quality = quality_model(prompt, response)

    # Weighted combination
    return 0.5 * helpfulness + 0.3 * safety + 0.2 * quality

Curriculum Learning

Start with easier prompts, progress to harder:

Python

def get_curriculum_prompts(epoch, all_prompts):
    difficulty_scores = [estimate_difficulty(p) for p in all_prompts]
    threshold = min(1.0, 0.3 + 0.1 * epoch)  # Increase over time
    return [p for p, d in zip(all_prompts, difficulty_scores) if d <= threshold]

Best-of-N Distillation

Alternative to online GRPO:

Generate N responses per prompt
Select best using reward model
Fine-tune on best responses (SFT)
Repeat

Simpler than GRPO but doesn't optimize the policy gradient directly.

When to Use GRPO

Good Fit

Reasoning tasks: Where correctness is verifiable
Limited memory: Can't afford PPO's multiple models
Need online learning: Want to generate during training
Have reward model: Or can compute rewards automatically

Less Suitable

Only have preference pairs: DPO is simpler
Need reference model anyway: PPO might work equally well
Very limited compute: Best-of-N distillation is simpler

Practical Tips

Reward Model Quality

GRPO is only as good as your rewards:

Train a strong reward model first
Validate on held-out preferences
Watch for reward hacking

Generation Diversity

Groups need diverse responses:

Use temperature > 0 (0.7-1.0)
Consider top-p sampling
Vary prompting slightly

Gradient Stability

Large groups can cause gradient issues:

Clip gradients (max_norm=1.0)
Use learning rate warmup
Monitor for NaN/Inf

Evaluation

Don't just track training metrics:

Evaluate on held-out prompts
Check capability retention
Human evaluation periodically

Conclusion

GRPO offers a compelling middle ground in preference optimization—simpler than PPO, more flexible than DPO, and well-suited for tasks with computable rewards. DeepSeek's success demonstrates its effectiveness for reasoning tasks.

The key insight—comparing within groups rather than to a reference—eliminates infrastructure complexity while maintaining training signal quality. For teams looking to move beyond DPO without the full complexity of PPO, GRPO is worth serious consideration.

Table of Contents

The Evolution of Preference Optimization

What Makes GRPO Different

The Key Insight

Why Group-Based?

The GRPO Algorithm

Step-by-Step

Mathematical Formulation

Pseudocode

Comparing GRPO to Alternatives

GRPO vs. PPO (RLHF)

GRPO vs. DPO

GRPO vs. REINFORCE

Implementation Details

Group Size Selection

Reward Model Integration

Handling Long Responses

Memory Optimization

Training Recipe

Hyperparameters

Training Loop

Monitoring

DeepSeek's GRPO Implementation

Key Innovations

Training Efficiency

Advanced Techniques

Iterative GRPO

Multi-Objective GRPO

Curriculum Learning

Best-of-N Distillation

When to Use GRPO

Good Fit

Less Suitable

Practical Tips

Reward Model Quality

Generation Diversity

Gradient Stability

Evaluation

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond

Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR

RLVR: Reinforcement Learning with Verifiable Rewards