How much data do I need for SFT?

For significant behavior change: 10K-50K high-quality examples. For style adjustment or format consistency: 1K-5K examples can work. Quality matters more than quantity. Start small, evaluate, and add data to address specific gaps.

Should I use full fine-tuning or LoRA?

LoRA for most cases: 90%+ of the benefit at fraction of cost. Full fine-tuning when: maximum capability needed, you have resources, you're training a base model. LoRA with high rank (32-64) across all layers approaches full fine-tuning performance.

How do I get preference data?

Options: (1) Human annotation (expensive, gold standard), (2) AI labeling (cheaper, scalable), (3) Implicit feedback from production (thumb up/down), (4) Generate pairs and use heuristics (response length, format compliance). Start with AI labeling + small human validation set.

What if DPO makes my model worse on some tasks?

Common issue. Mitigate by: (1) Include diverse prompts in preference data, (2) Add capability retention examples (preferred responses that exercise various capabilities), (3) Reduce β to soften preference signal, (4) Evaluate broadly before and after.

How do I know when to stop training?

Monitor: validation loss, win rate on held-out preferences, and broad capability evaluations. Stop when: validation loss plateaus or increases, win rate stops improving, or capability metrics regress. Checkpoint frequently and evaluate each checkpoint.

Can I combine SFT and DPO in one training run?

Yes, some approaches interleave SFT and DPO objectives. However, standard practice is sequential: SFT first to get good instruction following, then DPO/RLHF to refine preferences. This gives clearer signal on what each stage contributes.

SFT and RLHF: The Complete Guide to Post-Training LLMs | Enrico Piovano

The Post-Training Stack

Pre-trained LLMs are impressive but not immediately useful. They complete text, but they don't follow instructions, refuse harmful requests, or behave helpfully. Post-training transforms these base models into the assistants we use daily.

The modern post-training pipeline:

Code

[Base Model] → Pre-training on internet text
      ↓
[SFT Model] → Supervised Fine-Tuning on demonstrations
      ↓
[RLHF Model] → Reinforcement Learning from Human Feedback
      ↓
[Production Model] → Safety fine-tuning, capability elicitation

This post explains each stage in depth, with practical guidance for implementing these techniques.

Supervised Fine-Tuning (SFT)

What SFT Does

SFT teaches the model to follow instructions by training on examples of desired behavior. Input: instruction + context. Output: ideal response.

The fundamental transformation SFT creates: Base models are trained on raw internet text to predict the next token. They're excellent at continuing text in plausible ways, but they don't understand they should answer questions rather than continue them. A base model sees "What is 2+2?" and thinks "this looks like a math worksheet, I should generate more math questions." SFT rewires this: by training on thousands of instruction→response pairs, the model learns that its job is to respond helpfully, not to continue text.

Why SFT alone creates useful assistants: After SFT, a model can follow instructions, answer questions, write code, and engage in conversation. This is sufficient for many use cases. The original ChatGPT was essentially "GPT-3 + SFT + a bit of RLHF." SFT is the workhorse transformation that creates usable AI assistants from raw language models.

Before SFT (base model):

Code

Input: "Write a haiku about machine learning"
Output: "Write a haiku about machine learning applications
        in healthcare. Discuss the benefits and challenges..."
        [continues text as if completing a document]

After SFT:

Code

Input: "Write a haiku about machine learning"
Output: "Data flows like streams
        Neural networks learn and grow
        Patterns emerge clear"

SFT Data

Quality data is everything. Types of SFT data:

Why data quality dominates model quality: The model can only learn behaviors present in the training data. If your SFT data contains verbose, meandering responses, your model will be verbose and meandering. If it contains concise, focused responses, your model will be concise and focused. This seems obvious but has profound implications: the best way to improve your model is usually to improve your data, not to tweak training hyperparameters.

The "demonstration distribution" problem: Your model will learn to produce responses that look like your demonstrations. If all your examples are formal, the model will be formal even when informality is appropriate. If all examples are long, the model will pad short answers. Diversity in your SFT data directly translates to flexibility in your model.

Instruction-following demonstrations:

JSON

{
  "instruction": "Summarize this article in 3 bullet points",
  "input": "[article text]",
  "output": "• Key point one\n• Key point two\n• Key point three"
}

Conversational demonstrations:

JSON

{
  "messages": [
    {"role": "user", "content": "What's the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's its population?"},
    {"role": "assistant", "content": "Paris has a population of approximately 2.1 million people in the city proper, and about 12 million in the greater metropolitan area."}
  ]
}

Capability demonstrations:

JSON

{
  "instruction": "Solve this math problem step by step",
  "input": "If a train travels at 60 mph for 2.5 hours, how far does it travel?",
  "output": "Let me solve this step by step:\n\n1. Distance = Speed × Time\n2. Speed = 60 mph\n3. Time = 2.5 hours\n4. Distance = 60 × 2.5 = 150 miles\n\nThe train travels 150 miles."
}

SFT Data Collection

Human annotation: Expert annotators write ideal responses. High quality but expensive ($5-20 per example).

Distillation from stronger models: Use GPT-4 or Claude to generate training data for smaller models. Common and effective.

User interaction data: Filter production conversations for high-quality examples. Requires quality signals.

Synthetic generation: Generate instructions programmatically, use LLM for responses, filter for quality.

SFT Training

Standard approach:

Learning rate: 1e-5 to 5e-5
Epochs: 1-3 (watch for overfitting)
Batch size: 32-128 (larger is more stable)
Warmup: 3-10% of training steps

LoRA approach (parameter-efficient):

LoRA rank: 8-64 (higher for more capacity)
LoRA alpha: 2× rank
Apply to: Q, K, V, O projections at minimum; all linear layers for best results
Learning rate: 1e-4 to 5e-4 (higher than full fine-tuning)

Training dynamics:

Code

Epoch 1: Model learns instruction format
Epoch 2: Quality improves, more consistent outputs
Epoch 3+: Risk of overfitting, monitor validation loss

SFT Best Practices

Diverse instructions: Cover many task types, phrasings, complexity levels
Quality over quantity: 1000 excellent examples > 10000 mediocre ones
Response style consistency: All examples should have consistent tone, format, style
Include edge cases: What should the model do with ambiguous instructions?
Balance task types: Don't over-represent any single capability

Reinforcement Learning from Human Feedback (RLHF)

Why RLHF?

SFT has limitations:

Can only match the quality of demonstrations
Doesn't learn preferences between acceptable responses
Harder to encode subtle quality differences

RLHF addresses this by training on preferences—which response is better—rather than single demonstrations.

The key insight behind RLHF: It's easier for humans to compare two responses than to write a perfect response. If you ask someone "write an ideal explanation of quantum computing for a 10-year-old," they'll struggle. But if you show them two explanations and ask "which is better for a 10-year-old?", they can easily judge. RLHF exploits this asymmetry: collect comparisons (easy for humans), train a reward model to predict comparisons (ML), then optimize the LLM to score highly on the reward model (RL).

What RLHF captures that SFT can't: SFT teaches "this is a good response." RLHF teaches "this response is better than that one" across a spectrum of quality. This comparative signal enables the model to learn subtle quality distinctions: more accurate, more helpful, safer, more appropriate tone. The model doesn't just learn what good looks like—it learns to discriminate between degrees of goodness.

The RLHF Pipeline

Code

Step 1: Collect preference data
        [Prompt] → [Response A] vs [Response B] → Human labels "A is better"

Step 2: Train reward model
        Reward model learns to predict which response humans prefer

Step 3: RL fine-tuning
        Policy (LLM) optimizes to produce responses that maximize reward

Preference Data Collection

Pairwise comparisons: Show annotators two responses, ask which is better.

Code

Prompt: "Explain quantum computing simply"

Response A: "Quantum computing uses qubits that can be 0 and 1 simultaneously..."
Response B: "Unlike regular computers that use bits (0 or 1), quantum computers..."

Annotator choice: B (clearer, more accessible)

Ranking: Show multiple responses, rank from best to worst.

Code

Prompt: [same]
Responses: [A, B, C, D]
Ranking: B > D > A > C

Likert ratings: Rate each response independently (1-5 scale).

Code

Response A: Helpfulness 4/5, Accuracy 5/5, Clarity 3/5
Response B: Helpfulness 5/5, Accuracy 4/5, Clarity 5/5

Reward Model Training

The reward model learns to predict human preferences.

Why the reward model is the bottleneck of RLHF: The reward model defines what "good" means for the RL phase. If the reward model has blind spots, the policy will exploit them. If the reward model prefers longer responses regardless of quality, the policy will learn to be verbose. If the reward model can't distinguish subtle quality differences, the policy won't learn them. The quality ceiling of your RLHF model is set by your reward model.

The reward model as a human preference simulator: Think of it this way: you can't have a human judge every response during RL training (millions of responses). So you train a model to simulate human judgment. The RL phase then optimizes against this simulation. This works when the simulation is accurate, but can fail when the reward model is overconfident about edge cases it's never seen.

Architecture: Usually the same architecture as the LLM, with a scalar output head.

Training objective:

$\mathcal{L} = -\log\sigma(r(y_w) - r(y_l))$

where $r(y)$ is the reward model's score for response $y$ , $\sigma$ is the sigmoid function, $y_w$ is the human-preferred (chosen) response, and $y_l$ is the human-dispreferred (rejected) response.

Training considerations:

Same base model as policy works well
Can use LoRA for efficiency
Need 10K-100K comparisons for good reward model
Validate on held-out comparisons

RL Optimization with PPO

Why PPO for RLHF?

Proximal Policy Optimization (PPO) became the standard for RLHF because it balances sample efficiency with stability. Unlike simpler policy gradient methods, PPO prevents catastrophically large updates that could destroy the model's capabilities.

The Core PPO Objective:

PPO uses a clipped surrogate objective to limit policy changes:

Code

L_CLIP = E[min(r(θ) × A, clip(r(θ), 1-ε, 1+ε) × A)]

Where:
- r(θ) = π(a|s) / π_old(a|s)  (probability ratio)
- A = advantage estimate
- ε = clip range (typically 0.2)

Intuition: The clipping ensures that if an action's probability changes too much (ratio far from 1), we cap its contribution to the gradient. This prevents the policy from changing too drastically in a single update.

The Full RLHF Objective:

For RLHF, we combine PPO with a KL penalty to stay close to the reference (SFT) model:

Code

Objective = E[reward - β × KL(policy || reference)]

Expanded:
reward_total = reward_model(response) - β × log(π(response) / π_ref(response))

The Four-Model Setup:

RLHF with PPO requires managing four models simultaneously:

Code

1. Policy Model (Actor): The LLM being trained
2. Reference Model: Frozen copy of SFT model for KL computation
3. Reward Model: Scores response quality
4. Value Model (Critic): Estimates expected future reward

Memory requirement: ~4× model size (can optimize with sharing)

PPO Training Loop in Detail

The PPO training loop has four distinct steps that repeat each iteration. Understanding each step is critical because they depend on each other—getting one wrong cascades through the entire training process.

Step 1: Rollout Generation

First, we generate responses from the current policy and record the log probabilities. These log probs are essential—we'll compare them to log probs after updates to compute probability ratios for the PPO objective. Using do_sample=True ensures exploration; deterministic generation would collapse to always picking the highest-probability token.

Python

def generate_rollouts(policy, prompts, generation_config):
    """Generate responses and compute log probabilities."""
    rollouts = []
    for prompt in prompts:
        # Sample response from current policy
        response = policy.generate(
            prompt,
            temperature=generation_config.temperature,
            max_tokens=generation_config.max_tokens,
            do_sample=True
        )

        # Compute log probability of generated response
        log_prob = compute_sequence_log_prob(policy, prompt, response)

        rollouts.append({
            'prompt': prompt,
            'response': response,
            'log_prob': log_prob
        })
    return rollouts

Step 2: Reward Computation

Now we score each response with the reward model and apply a KL penalty. The KL penalty is crucial—it prevents the policy from drifting too far from the reference model (the SFT checkpoint). Without it, the model would exploit reward model weaknesses, producing responses that score highly but are actually low quality ("reward hacking").

The formula: total_reward = reward_score - β × KL_divergence. Higher β means more conservative training; lower β allows more exploration but risks instability.

Python

def compute_rewards(rollouts, reward_model, ref_model, kl_coef):
    """Score responses and apply KL penalty."""
    for rollout in rollouts:
        # Get reward model score
        rm_score = reward_model(rollout['prompt'], rollout['response'])

        # Compute KL penalty
        ref_log_prob = compute_sequence_log_prob(
            ref_model, rollout['prompt'], rollout['response']
        )
        kl_penalty = rollout['log_prob'] - ref_log_prob

        # Total reward with KL penalty
        rollout['reward'] = rm_score - kl_coef * kl_penalty
        rollout['kl'] = kl_penalty
        rollout['rm_score'] = rm_score

    return rollouts

Step 3: Advantage Estimation (GAE)

The advantage function tells us "how much better was this action than average?" Raw rewards are noisy—GAE (Generalized Advantage Estimation) smooths this by mixing short-term and long-term estimates.

Why not just use rewards? High variance makes training unstable.
Why not just use value estimates? They're biased by the value model's errors.
GAE solution: Blend them with parameter λ. When λ=1, it's just rewards (high variance). When λ=0, it's just value estimates (high bias). λ=0.95 is a common sweet spot.

The backward pass accumulates advantages from the end of the sequence to the beginning—necessary because each token's advantage depends on future rewards.

Python

def compute_advantages(rollouts, value_model, gamma=1.0, lam=0.95):
    """Compute GAE advantages for each token."""
    for rollout in rollouts:
        values = value_model(rollout['prompt'], rollout['response'])
        rewards = rollout['token_rewards']  # Per-token rewards

        advantages = []
        gae = 0

        # Backward pass through tokens
        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_value = 0
            else:
                next_value = values[t + 1]

            delta = rewards[t] + gamma * next_value - values[t]
            gae = delta + gamma * lam * gae
            advantages.insert(0, gae)

        rollout['advantages'] = advantages
        rollout['returns'] = [a + v for a, v in zip(advantages, values)]

    return rollouts

Step 4: PPO Update

The actual policy optimization. This is where PPO's "proximal" nature matters: we compute a probability ratio (new policy prob / old policy prob) and clip it to prevent drastic updates.

The clipping logic: if an action's probability increases too much (ratio > 1+ε), we cap the gradient contribution. Same if it decreases too much (ratio < 1-ε). This keeps updates conservative, preventing the policy from changing so much that our collected rollouts become invalid.

We also update the value model using MSE loss against the computed returns—this improves advantage estimation for future iterations.

Python

def ppo_update(policy, value_model, rollouts, config):
    """Perform PPO update on policy and value model."""
    optimizer = torch.optim.Adam([
        {'params': policy.parameters(), 'lr': config.policy_lr},
        {'params': value_model.parameters(), 'lr': config.value_lr}
    ])

    for epoch in range(config.ppo_epochs):
        for batch in create_minibatches(rollouts, config.batch_size):
            # Current policy log probs
            new_log_probs = compute_sequence_log_prob(
                policy, batch['prompts'], batch['responses']
            )

            # Probability ratio
            ratio = torch.exp(new_log_probs - batch['old_log_probs'])

            # Clipped surrogate objective
            advantages = batch['advantages']
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1 - config.clip_range,
                               1 + config.clip_range) * advantages
            policy_loss = -torch.min(surr1, surr2).mean()

            # Value function loss
            values = value_model(batch['prompts'], batch['responses'])
            value_loss = F.mse_loss(values, batch['returns'])

            # Combined loss
            loss = policy_loss + config.value_coef * value_loss

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(policy.parameters(), config.max_grad_norm)
            optimizer.step()

PPO Hyperparameters Deep Dive

KL Coefficient (β): 0.01-0.2

Code

β = 0.01: Aggressive optimization, policy can drift far from reference
β = 0.1:  Balanced (common starting point)
β = 0.2:  Conservative, stays close to SFT behavior

Adaptive KL: Some implementations adjust β to hit a target KL:
- If KL > target: increase β
- If KL < target: decrease β

Clip Range (ε): 0.1-0.3

Code

ε = 0.1: Very conservative updates
ε = 0.2: Standard choice
ε = 0.3: More aggressive, use with caution

Learning Rates:

Code

Policy LR: 1e-6 to 5e-6 (much lower than SFT)
Value LR: 1e-5 to 5e-5 (can be higher than policy)

Why so low? We're fine-tuning a capable model, not training from scratch.
Large updates can catastrophically harm capabilities.

Batch and Epoch Settings:

Code

Rollout batch size: 512-2048 prompts per iteration
PPO epochs: 2-4 updates per rollout batch
Minibatch size: 64-256 for gradient updates

More PPO epochs = more sample efficiency but risk of overfitting to rollouts

Generation Settings:

Code

Temperature: 0.7-1.0 during training (exploration)
Max tokens: Task-dependent, typically 256-1024
Top-p: 0.95 (standard)

Memory Optimization Strategies

Challenge: Four models in memory is expensive.

Strategy 1: Weight Sharing

Python

# Share base weights between policy and reference
# Only the LoRA adapters differ
policy = load_base_model()
policy.load_lora_adapters("policy_lora")

reference = policy  # Same base weights
# Just disable LoRA when computing reference log probs

Strategy 2: Sequential Computation

Python

# Don't keep all models in GPU memory simultaneously
def compute_step(prompt_batch):
    # 1. Generate with policy (policy in GPU)
    responses = generate(policy, prompts)
    policy.to('cpu')

    # 2. Score with reward model (reward model in GPU)
    reward_model.to('cuda')
    rewards = score(reward_model, prompts, responses)
    reward_model.to('cpu')

    # 3. Compute reference log probs (reference in GPU)
    # ... and so on

Strategy 3: Gradient Checkpointing

Python

policy.gradient_checkpointing_enable()
# Trades compute for memory - recomputes activations during backward pass

RLHF Challenges and Mitigations

Reward Hacking: The model finds ways to maximize reward that don't align with intent.

Code

Example 1: Length exploitation
- Reward model flaw: Longer responses score slightly higher
- Result: Model becomes unnecessarily verbose
- Fix: Length normalization, penalize excessive length

Example 2: Sycophancy
- Reward model flaw: Agreeable responses score higher
- Result: Model always agrees with user, even when wrong
- Fix: Include adversarial examples where disagreement is correct

Example 3: Formatting exploits
- Reward model flaw: Bullet points and headers score higher
- Result: Every response uses unnecessary formatting
- Fix: Diverse format examples in preference data

Mitigation strategies:

Python

# Reward model ensemble
rewards = [rm(prompt, response) for rm in reward_models]
final_reward = min(rewards)  # Conservative: use minimum

# Reward clipping
reward = torch.clamp(reward, -clip_value, clip_value)

# Auxiliary losses
loss = ppo_loss + aux_weight * auxiliary_loss  # e.g., perplexity on held-out data

Mode Collapse: Model converges to narrow range of "safe" responses.

Signs:

Decreasing response diversity
High reward but repetitive outputs
Low entropy in generation

Mitigations:

Python

# Entropy bonus
entropy = -torch.sum(probs * torch.log(probs), dim=-1)
loss = ppo_loss - entropy_coef * entropy.mean()

# Higher temperature during training
generation_config.temperature = 1.0  # Not 0.7

# KL penalty (built into RLHF objective)
# Prevents straying too far from diverse SFT distribution

Catastrophic Forgetting: Model loses capabilities while optimizing for reward.

Monitor these metrics:

Performance on held-out benchmarks (MMLU, etc.)
Generation quality on diverse prompts
Task completion rates outside reward optimization

Mitigations:

Python

# Mix in SFT data
if random.random() < sft_mix_ratio:
    # Do SFT update instead of PPO
    loss = sft_loss(batch)
else:
    loss = ppo_loss(batch)

# Capability-specific evaluation
for capability in ['math', 'coding', 'reasoning']:
    score = evaluate(policy, capability_benchmark)
    if score < threshold:
        # Add capability data to training mix
        pass

RLHF Debugging Checklist

If reward increases but quality decreases:

Reward hacking—inspect high-reward samples manually
Check if reward model has obvious exploits
Add reward clipping or ensemble

If KL explodes:

Learning rate too high
Increase KL coefficient (β)
Check for numerical instabilities

If training is unstable:

Reduce learning rate
Increase batch size
Add gradient clipping
Check value function accuracy

If no learning happens:

Verify reward model provides meaningful signal
Check that advantages have reasonable variance
Ensure log probs are computed correctly

Direct Preference Optimization (DPO)

DPO simplifies RLHF by eliminating the explicit reward model and RL training loop. Understanding why it works requires diving into the mathematics of RLHF.

The Mathematical Foundation

The RLHF Objective:

RLHF optimizes a policy to maximize reward while staying close to a reference policy:

Code

max_π E[r(x, y)] - β × KL(π || πref)

This KL constraint prevents the policy from deviating too far from the SFT model, avoiding reward hacking.

The Key Insight:

The optimal policy for this constrained optimization has a closed-form solution:

Code

π*(y|x) = (1/Z(x)) × πref(y|x) × exp(r(x,y) / β)

Where Z(x) is the partition function (normalizer)

This means: the optimal policy is the reference policy reweighted by exponentiated reward.

The Reparameterization:

We can invert this relationship to express reward in terms of policies:

$r(x, y) = \beta \cdot \log\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \cdot \log Z(x)$

Since Z(x) cancels out when computing preference probabilities (it's the same for both responses), we can substitute this into the Bradley-Terry preference model to get the DPO loss directly.

DPO Loss Function

$\mathcal{L} = -\log\sigma\left(\beta \cdot \left(\log\frac{\pi(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log\frac{\pi(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right)$

Intuitive interpretation:

log(π(y|x)/πref(y|x)) = how much more likely the policy makes response y compared to reference
We want this ratio to be higher for chosen responses than rejected ones
The sigmoid converts this difference into a probability
We minimize negative log-likelihood of the preference

Gradient behavior:

When policy already prefers chosen over rejected: small gradient (already good)
When policy prefers rejected: large gradient pushing toward chosen
The reference policy acts as an anchor preventing drift

DPO Advantages

Simpler pipeline: No reward model training, no RL infrastructure
More stable: Standard supervised learning with well-understood dynamics
Lower memory: Only need policy and reference model (can share weights with LoRA)
Faster iteration: Single training phase, easier hyperparameter tuning
Deterministic: No sampling variance from RL rollouts

DPO Training in Practice

Data format:

JSON

{
  "prompt": "Explain machine learning",
  "chosen": "Machine learning is a subset of AI that enables systems to learn patterns from data without explicit programming. It works by...",
  "rejected": "ML is when computers learn stuff on their own I guess..."
}

Hyperparameters:

β (temperature): 0.1-0.5 (lower = stronger preference signal)
Learning rate: 5e-7 to 5e-6 (lower than SFT)
Epochs: 1-3 (watch for overfitting)
Batch size: 32-128 (larger batches help stability)

Implementation example:

Python

import torch
import torch.nn.functional as F

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             reference_chosen_logps, reference_rejected_logps,
             beta=0.1):
    """
    Compute DPO loss for a batch of preferences.

    All inputs are log probabilities of shape (batch_size,)
    """
    # Compute log ratios
    chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps)

    # DPO loss: negative log sigmoid of reward difference
    losses = -F.logsigmoid(chosen_rewards - rejected_rewards)

    # Metrics for monitoring
    chosen_better = (chosen_rewards > rejected_rewards).float().mean()
    reward_margin = (chosen_rewards - rejected_rewards).mean()

    return losses.mean(), {
        'accuracy': chosen_better.item(),
        'reward_margin': reward_margin.item()
    }

Training loop considerations:

Python

def train_dpo_epoch(model, ref_model, dataloader, optimizer, beta):
    model.train()
    ref_model.eval()  # Reference model is frozen

    for batch in dataloader:
        # Get log probs from both models
        with torch.no_grad():
            ref_chosen_logps = get_sequence_logps(ref_model, batch['chosen'])
            ref_rejected_logps = get_sequence_logps(ref_model, batch['rejected'])

        policy_chosen_logps = get_sequence_logps(model, batch['chosen'])
        policy_rejected_logps = get_sequence_logps(model, batch['rejected'])

        loss, metrics = dpo_loss(
            policy_chosen_logps, policy_rejected_logps,
            ref_chosen_logps, ref_rejected_logps,
            beta=beta
        )

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

DPO Variants

The success of DPO sparked numerous variants addressing its limitations:

IPO (Identity Preference Optimization): Addresses DPO's tendency to overfit by using a different loss:

$\mathcal{L} = \left(\log\frac{\pi(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log\frac{\pi(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \frac{1}{2\beta}\right)^2$

IPO targets a fixed margin rather than pushing preferences infinitely apart. Better for noisy preference data.

KTO (Kahneman-Tversky Optimization): Works with unpaired data—you only need examples labeled "good" or "bad", not paired comparisons: $\mathcal{L}_{\text{good}} = 1 - \sigma\left(\beta \cdot \left(\log\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} - z_{\text{ref}}\right)\right)$

$\mathcal{L}_{\text{bad}} = 1 - \sigma\left(\beta \cdot \left(z_{\text{ref}} - \log\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\right)\right)$ Where z_ref is a reference point. Useful when paired preference data is hard to collect.

ORPO (Odds Ratio Preference Optimization): Combines SFT and preference optimization in a single objective: $\mathcal{L} = -\log\pi(y_w|x) - \lambda \cdot \log\sigma\left(\log\frac{\text{odds}(y_w)}{\text{odds}(y_l)}\right)$ No need for a reference model—the SFT component acts as implicit regularization.

SimPO (Simple Preference Optimization): Removes the reference model entirely by using length-normalized rewards: $\mathcal{L} = -\log\sigma\left(\beta \cdot \left(\frac{r(y_w)}{|y_w|} - \frac{r(y_l)}{|y_l|} - \gamma\right)\right)$ Where γ is a target margin. Simpler and often competitive with DPO.

Choosing a variant:

Method	Best for	Avoids
DPO	Standard preference tuning	-
IPO	Noisy preferences	Overfitting
KTO	Unpaired good/bad data	Need for comparisons
ORPO	Combined SFT + preference	Reference model overhead
SimPO	Maximum simplicity	Reference model entirely

Common DPO Pitfalls

1. Reference model drift: If you update the reference model during training, you lose the anchor. Keep it frozen.

2. Length exploitation: Models may learn to prefer longer or shorter responses based on spurious correlations in data:

Python

# Monitor length statistics
chosen_lengths = [len(x['chosen']) for x in dataset]
rejected_lengths = [len(x['rejected']) for x in dataset]
# If significantly different, add length normalization

3. Catastrophic forgetting: Aggressive DPO can harm capabilities learned during SFT:

Use lower learning rates than SFT
Mix in SFT data (10-20% of batches)
Increase β to stay closer to reference

4. Preference data quality: DPO amplifies patterns in your data—including spurious ones:

Validate preference data manually
Ensure annotator agreement
Balance across task types

5. β tuning:

Code

β too low (0.01): Aggressive optimization, may diverge from reference
β too high (1.0): Weak signal, slow learning
Sweet spot: Usually 0.1-0.3

DPO vs. RLHF Comparison

Aspect	RLHF (PPO)	DPO
Complexity	High (reward model + RL)	Low (single supervised phase)
Stability	Requires careful tuning	Generally stable
Compute	4 models in memory	2 models in memory
Performance	State-of-the-art ceiling	Competitive, sometimes equal
Flexibility	Arbitrary reward signals	Pairwise preferences only
Iteration speed	Slow (RL rollouts)	Fast (supervised batches)
Failure modes	Reward hacking, mode collapse	Length bias, forgetting

When to use RLHF:

Complex, multi-objective reward functions
Iterative preference learning with online data
Maximum capability extraction worth the engineering cost
You have RL infrastructure already

When to use DPO:

Standard preference alignment
Limited compute or engineering resources
Rapid iteration on alignment experiments
Good enough performance is acceptable

Constitutional AI and RLAIF

The Labeling Bottleneck

Human labeling is expensive and doesn't scale. Constitutional AI uses AI feedback instead:

RLAIF (RL from AI Feedback)

Replace human labelers with LLMs:

Code

1. Generate response pairs
2. Ask LLM to judge which is better (with principles)
3. Train reward model on AI judgments
4. Run RL as usual

Principle-based judging:

Code

Evaluate which response better follows these principles:
1. Be helpful and informative
2. Be harmless and avoid dangerous content
3. Be honest and acknowledge uncertainty

Response A: [text]
Response B: [text]

Which response better follows the principles?

Constitutional AI

Train model to self-critique and revise:

Stage 1: Supervised self-critique

Code

Initial response: [potentially problematic response]
Critique: "This response could be harmful because..."
Revision: [improved response]

Stage 2: RL with AI feedback Use LLM to generate preference labels based on principles.

Advantages

Scales better than human labeling
Consistent application of principles
Can cover more edge cases
Enables rapid iteration

Limitations

AI judges have biases
May not capture nuanced human preferences
Need diverse prompts to avoid mode collapse

Practical Implementation Guide

Choosing Your Approach

Start with SFT if:

You have quality demonstration data
Task is well-defined
Budget is limited

Add RLHF/DPO if:

SFT plateau reached
Subtle quality improvements needed
Have preference data or can collect it

Data Pipeline

Python

class PostTrainingPipeline:
    def prepare_sft_data(self, raw_data):
        # Format as instruction-response pairs
        # Filter for quality
        # Deduplicate
        # Balance across task types
        return formatted_data

    def prepare_preference_data(self, sft_model, prompts):
        # Generate multiple responses per prompt
        # Collect preferences (human or AI)
        # Format as chosen/rejected pairs
        return preference_data

    def train_sft(self, base_model, sft_data):
        # Standard fine-tuning
        return sft_model

    def train_dpo(self, sft_model, preference_data):
        # DPO optimization
        return aligned_model

Evaluation

SFT evaluation:

Instruction-following accuracy
Response quality (LLM-as-judge)
Format compliance
Task-specific benchmarks

RLHF/DPO evaluation:

Win rate vs. SFT baseline
Human preference evaluation
Safety evaluations
Capability retention (no regression)

Common Pitfalls

Insufficient SFT before RLHF: Get SFT right first
Low-quality preference data: Garbage in, garbage out
Over-optimization: Model becomes sycophantic or narrow
Forgetting base capabilities: Evaluate broadly, not just on target task
Ignoring safety: Preference optimization can introduce new failure modes

Conclusion

Post-training transforms base models into useful assistants. SFT teaches instruction following, RLHF/DPO aligns outputs with human preferences. The field is evolving rapidly—DPO simplified RLHF, and new methods continue to emerge.

The key is understanding the purpose of each stage and iterating based on evaluation. Start simple (SFT), measure carefully, and add complexity (RLHF/DPO) only when you have evidence it helps.

Table of Contents

The Post-Training Stack

Supervised Fine-Tuning (SFT)

What SFT Does

SFT Data

SFT Data Collection

SFT Training

SFT Best Practices

Reinforcement Learning from Human Feedback (RLHF)

Why RLHF?

The RLHF Pipeline

Preference Data Collection

Reward Model Training

RL Optimization with PPO

PPO Training Loop in Detail

PPO Hyperparameters Deep Dive

Memory Optimization Strategies

RLHF Challenges and Mitigations

RLHF Debugging Checklist

Direct Preference Optimization (DPO)

The Mathematical Foundation

DPO Loss Function

DPO Advantages

DPO Training in Practice

DPO Variants

Common DPO Pitfalls

DPO vs. RLHF Comparison

Constitutional AI and RLAIF

The Labeling Bottleneck

RLAIF (RL from AI Feedback)

Constitutional AI

Advantages

Limitations

Practical Implementation Guide

Choosing Your Approach

Data Pipeline

Evaluation

Common Pitfalls

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Fine-Tuning vs Prompting: When to Use Each

RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond

Synthetic Data Generation for LLM Training