Skip to main content
Back to Blog

GRPO: Group Relative Policy Optimization Explained

Understanding Group Relative Policy Optimization—the technique behind DeepSeek's training efficiency and a simpler alternative to PPO-based RLHF.

8 min read
Share:

The Evolution of Preference Optimization

The journey from RLHF to simpler alternatives:

Code
PPO (RLHF)     →  Complex, effective, hard to tune
    ↓
DPO            →  Simpler, no reward model, competitive
    ↓
IPO            →  Addresses DPO overfitting
    ↓
GRPO           →  Group-based, no reference model needed, efficient

GRPO (Group Relative Policy Optimization) represents the latest evolution—offering the benefits of preference optimization without the complexity of PPO or the reference model requirement of DPO.

What Makes GRPO Different

The Key Insight

Traditional methods compare responses to a reference policy or reward model. GRPO takes a different approach: compare responses within a group.

DPO approach:

Code
For each preference pair (chosen, rejected):
- Compare policy probability vs. reference model probability
- Optimize to increase chosen, decrease rejected, relative to reference

GRPO approach:

Code
For each prompt, generate multiple responses:
- Score all responses (using reward model or other signal)
- Normalize scores within the group
- Optimize policy relative to group statistics

Why Group-Based?

Eliminates reference model: DPO requires keeping a frozen reference model in memory. GRPO compares within the current batch, eliminating this requirement.

Better credit assignment: Instead of binary chosen/rejected, GRPO uses relative rankings within groups, providing richer training signal.

More stable optimization: Group normalization reduces variance and makes training more stable.

The GRPO Algorithm

Step-by-Step

The GRPO training loop is conceptually simple, though implementation requires care around memory management and numerical stability.

  1. Sample prompts from training distribution
  2. Generate multiple responses per prompt (group size G, typically 4-16)
  3. Score responses using reward model or other metric
  4. Compute group statistics (mean, std) for each prompt's responses
  5. Normalize scores within groups → advantages
  6. Update policy to increase probability of high-advantage responses

Why this works intuitively: By generating multiple responses to the same prompt and comparing them, we create a natural ranking. The model learns "for this type of question, responses with these characteristics score higher." The group normalization means we're always asking "is this response better or worse than what we typically generate for this prompt?"—a relative question rather than an absolute one.

The computational tradeoff: Generating G responses per prompt means G times the inference cost. But we extract G training signals from each prompt, making training more sample-efficient. For expensive reward models (like LLM-as-judge), this amortizes the reward computation cost across multiple policy updates.

Mathematical Formulation

Advantage computation: A(y)=r(y)μgσgA(y) = \frac{r(y) - \mu_g}{\sigma_g}

where A(y)A(y) is the advantage of response yy, r(y)r(y) is the reward, μg\mu_g is the mean reward of group gg, and σg\sigma_g is the standard deviation.

Policy gradient: L=E[A(y)logπ(yx)]\nabla \mathcal{L} = \mathbb{E}\left[A(y) \cdot \nabla \log \pi(y|x)\right]

With clipping (similar to PPO): L=E[min(ρA,clip(ρ,1ϵ,1+ϵ)A)]\nabla \mathcal{L} = \mathbb{E}\left[\min\left(\rho \cdot A, \text{clip}(\rho, 1-\epsilon, 1+\epsilon) \cdot A\right)\right]

where ρ=π(yx)πold(yx)\rho = \frac{\pi(y|x)}{\pi_{\text{old}}(y|x)} is the probability ratio and ϵ\epsilon is the clip range (e.g., 0.2).

KL regularization (optional): Ltotal=LpolicyβDKL(ππold)\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{policy}} - \beta \cdot D_{\text{KL}}(\pi \| \pi_{\text{old}})

Understanding the math in plain language:

The advantage is simply "how much better is this response than average for this group?" Positive advantage means better than average; negative means worse. The standard deviation normalization ensures advantages are on a consistent scale across different prompts (some prompts might have high variance in response quality, others low).

The policy gradient says: increase the probability of generating responses with positive advantage, decrease the probability of responses with negative advantage. The gradient scales with the advantage magnitude—much better responses get stronger positive updates.

The clipping (from PPO) prevents any single update from changing the policy too drastically. If the probability ratio r gets too far from 1, we clip it. This adds stability: we don't let one surprisingly good or bad example dominate training.

The KL regularization is optional but often helpful. It penalizes the policy for drifting too far from its starting point, preventing mode collapse and maintaining diversity. The β parameter controls the strength of this regularization.

Pseudocode

Python
def grpo_update(policy, prompts, reward_model, group_size=8):
    all_advantages = []
    all_responses = []
    all_prompts = []

    for prompt in prompts:
        # Generate group of responses
        responses = [policy.generate(prompt) for _ in range(group_size)]

        # Score responses
        rewards = [reward_model(prompt, r) for r in responses]

        # Compute group-normalized advantages
        mean_reward = np.mean(rewards)
        std_reward = np.std(rewards) + 1e-8
        advantages = [(r - mean_reward) / std_reward for r in rewards]

        all_advantages.extend(advantages)
        all_responses.extend(responses)
        all_prompts.extend([prompt] * group_size)

    # Policy gradient update
    loss = compute_clipped_policy_loss(
        policy, all_prompts, all_responses, all_advantages
    )

    loss.backward()
    optimizer.step()

Walking through the pseudocode:

The outer loop iterates through prompts. For each prompt, we generate group_size different responses by sampling from the current policy. This is where the "group" comes from—multiple attempts at the same question.

The reward model scores each response. This could be a learned reward model, an LLM judge, or even a simple rule-based metric (like correctness for math problems). The key requirement: the reward should be a scalar that indicates response quality.

The group normalization ((r - mean_reward) / std_reward) converts raw rewards to advantages. The + 1e-8 prevents division by zero when all responses have identical rewards (rare but possible).

We accumulate data across all prompts before the gradient update. This is important: we want each update to see diverse prompts, not just variations of a single one.

The compute_clipped_policy_loss function (not shown) implements the PPO-style clipped objective. It computes the log probability of each response under the current policy, compares to a stored old probability, and applies clipping and advantage weighting.

Comparing GRPO to Alternatives

GRPO vs. PPO (RLHF)

AspectPPOGRPO
Reward modelRequiredRequired
Value modelRequiredNot required
Reference modelRequired for KLOptional
Memory usageHigh (3+ models)Lower (2 models)
StabilityRequires careful tuningMore stable
Sample efficiencyLowerHigher (multiple per prompt)

GRPO vs. DPO

AspectDPOGRPO
Data formatPairwise preferencesResponses + scores
Reference modelRequiredNot required
Online/OfflineOffline (fixed data)Online (generates during training)
Signal richnessBinary (better/worse)Continuous (relative advantages)
TrainingSimplerSlightly more complex

GRPO vs. REINFORCE

GRPO can be seen as a variance-reduced REINFORCE:

REINFORCE: L=E[(rb)logπ(yx)]\nabla \mathcal{L} = \mathbb{E}\left[(r - b) \cdot \nabla \log \pi(y|x)\right]

GRPO: L=E[rμgσglogπ(yx)]\nabla \mathcal{L} = \mathbb{E}\left[\frac{r - \mu_g}{\sigma_g} \cdot \nabla \log \pi(y|x)\right]

The group normalization acts as an adaptive baseline, reducing variance.

Implementation Details

Group Size Selection

Larger groups = lower variance, higher compute:

Group SizeVarianceComputeRecommendation
2HighLowNot recommended
4MediumMediumMinimum viable
8LowHighGood default
16Very lowVery highMaximum quality

Practical choice: Start with 8, reduce if compute-limited.

Reward Model Integration

GRPO needs scores for each response:

Option 1: Trained reward model

Python
reward = reward_model(prompt, response)  # Scalar output

Option 2: LLM-as-judge

Python
reward = llm_judge(prompt, response, rubric)  # Returns score

Option 3: Rule-based rewards

Python
reward = compute_rule_rewards(response)  # Length, format, etc.

Option 4: Outcome-based rewards

Python
reward = verify_outcome(response)  # Correct/incorrect for reasoning

Handling Long Responses

Generation is expensive. Strategies:

Early stopping: Stop generation if reward model signals low quality.

Response caching: Reuse good responses across training steps.

Importance sampling: Weight older responses by probability ratio.

Memory Optimization

GRPO generates multiple responses, increasing memory:

Gradient accumulation:

Python
for micro_batch in batch.split(micro_batch_size):
    loss = compute_loss(micro_batch)
    (loss / num_micro_batches).backward()
optimizer.step()

Response chunking: Generate and score in chunks, don't store all simultaneously.

Training Recipe

Hyperparameters

YAML
grpo_config:
  group_size: 8
  clip_range: 0.2
  learning_rate: 1e-6
  batch_size: 512  # Total responses per update
  kl_coefficient: 0.0  # Often not needed
  max_response_length: 1024
  temperature: 0.8  # For diversity in generation
  epochs: 1-3

Training Loop

Python
for epoch in range(num_epochs):
    for batch in dataloader:
        prompts = batch['prompts']

        # Generate groups
        with torch.no_grad():
            responses = generate_groups(policy, prompts, group_size)
            rewards = score_responses(reward_model, prompts, responses)

        # Compute advantages
        advantages = compute_group_advantages(rewards, group_size)

        # Policy update
        loss = grpo_loss(policy, prompts, responses, advantages)
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=1.0)

        optimizer.step()
        scheduler.step()

        # Logging
        log_metrics(loss, advantages, rewards)

Monitoring

Track during training:

  • Mean reward: Should increase
  • Reward variance: Should decrease as policy improves
  • Advantage distribution: Should center around 0
  • Policy entropy: Watch for collapse
  • KL from initial policy: Monitor drift

DeepSeek's GRPO Implementation

DeepSeek used GRPO to train their reasoning models:

Key Innovations

Outcome-based rewards for reasoning:

Python
def reasoning_reward(prompt, response):
    # Extract final answer
    answer = extract_answer(response)
    # Compare to ground truth
    correct = verify_answer(answer, prompt.expected_answer)
    return 1.0 if correct else 0.0

Process rewards (optional):

Python
def process_reward(prompt, response):
    steps = extract_reasoning_steps(response)
    step_scores = [verify_step(step) for step in steps]
    return sum(step_scores) / len(step_scores)

Long CoT training: Allow very long responses for complex reasoning, reward only the final answer correctness.

Training Efficiency

DeepSeek reported significant efficiency gains:

  • Fewer models in memory (no value function)
  • Better sample efficiency (group comparisons)
  • More stable training (group normalization)

Advanced Techniques

Iterative GRPO

Run GRPO multiple times with improving reward signals:

Code
Iteration 1: Train with basic reward model
Iteration 2: Train reward model on iteration 1 outputs, retrain policy
Iteration 3: Repeat...

Multi-Objective GRPO

Balance multiple reward signals:

Python
def multi_objective_reward(prompt, response):
    helpfulness = helpfulness_model(prompt, response)
    safety = safety_model(prompt, response)
    quality = quality_model(prompt, response)

    # Weighted combination
    return 0.5 * helpfulness + 0.3 * safety + 0.2 * quality

Curriculum Learning

Start with easier prompts, progress to harder:

Python
def get_curriculum_prompts(epoch, all_prompts):
    difficulty_scores = [estimate_difficulty(p) for p in all_prompts]
    threshold = min(1.0, 0.3 + 0.1 * epoch)  # Increase over time
    return [p for p, d in zip(all_prompts, difficulty_scores) if d <= threshold]

Best-of-N Distillation

Alternative to online GRPO:

  1. Generate N responses per prompt
  2. Select best using reward model
  3. Fine-tune on best responses (SFT)
  4. Repeat

Simpler than GRPO but doesn't optimize the policy gradient directly.

When to Use GRPO

Good Fit

  • Reasoning tasks: Where correctness is verifiable
  • Limited memory: Can't afford PPO's multiple models
  • Need online learning: Want to generate during training
  • Have reward model: Or can compute rewards automatically

Less Suitable

  • Only have preference pairs: DPO is simpler
  • Need reference model anyway: PPO might work equally well
  • Very limited compute: Best-of-N distillation is simpler

Practical Tips

Reward Model Quality

GRPO is only as good as your rewards:

  • Train a strong reward model first
  • Validate on held-out preferences
  • Watch for reward hacking

Generation Diversity

Groups need diverse responses:

  • Use temperature > 0 (0.7-1.0)
  • Consider top-p sampling
  • Vary prompting slightly

Gradient Stability

Large groups can cause gradient issues:

  • Clip gradients (max_norm=1.0)
  • Use learning rate warmup
  • Monitor for NaN/Inf

Evaluation

Don't just track training metrics:

  • Evaluate on held-out prompts
  • Check capability retention
  • Human evaluation periodically

Conclusion

GRPO offers a compelling middle ground in preference optimization—simpler than PPO, more flexible than DPO, and well-suited for tasks with computable rewards. DeepSeek's success demonstrates its effectiveness for reasoning tasks.

The key insight—comparing within groups rather than to a reference—eliminates infrastructure complexity while maintaining training signal quality. For teams looking to move beyond DPO without the full complexity of PPO, GRPO is worth serious consideration.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles