Skip to main content
Back to Blog

SFT and RLHF: The Complete Guide to Post-Training LLMs

A deep dive into Supervised Fine-Tuning and Reinforcement Learning from Human Feedback—the techniques that transform base models into useful assistants.

13 min read
Share:

The Post-Training Stack

Pre-trained LLMs are impressive but not immediately useful. They complete text, but they don't follow instructions, refuse harmful requests, or behave helpfully. Post-training transforms these base models into the assistants we use daily.

The modern post-training pipeline:

Code
[Base Model] → Pre-training on internet text
      ↓
[SFT Model] → Supervised Fine-Tuning on demonstrations
      ↓
[RLHF Model] → Reinforcement Learning from Human Feedback
      ↓
[Production Model] → Safety fine-tuning, capability elicitation

This post explains each stage in depth, with practical guidance for implementing these techniques.

Supervised Fine-Tuning (SFT)

What SFT Does

SFT teaches the model to follow instructions by training on examples of desired behavior. Input: instruction + context. Output: ideal response.

The fundamental transformation SFT creates: Base models are trained on raw internet text to predict the next token. They're excellent at continuing text in plausible ways, but they don't understand they should answer questions rather than continue them. A base model sees "What is 2+2?" and thinks "this looks like a math worksheet, I should generate more math questions." SFT rewires this: by training on thousands of instruction→response pairs, the model learns that its job is to respond helpfully, not to continue text.

Why SFT alone creates useful assistants: After SFT, a model can follow instructions, answer questions, write code, and engage in conversation. This is sufficient for many use cases. The original ChatGPT was essentially "GPT-3 + SFT + a bit of RLHF." SFT is the workhorse transformation that creates usable AI assistants from raw language models.

Before SFT (base model):

Code
Input: "Write a haiku about machine learning"
Output: "Write a haiku about machine learning applications
        in healthcare. Discuss the benefits and challenges..."
        [continues text as if completing a document]

After SFT:

Code
Input: "Write a haiku about machine learning"
Output: "Data flows like streams
        Neural networks learn and grow
        Patterns emerge clear"

SFT Data

Quality data is everything. Types of SFT data:

Why data quality dominates model quality: The model can only learn behaviors present in the training data. If your SFT data contains verbose, meandering responses, your model will be verbose and meandering. If it contains concise, focused responses, your model will be concise and focused. This seems obvious but has profound implications: the best way to improve your model is usually to improve your data, not to tweak training hyperparameters.

The "demonstration distribution" problem: Your model will learn to produce responses that look like your demonstrations. If all your examples are formal, the model will be formal even when informality is appropriate. If all examples are long, the model will pad short answers. Diversity in your SFT data directly translates to flexibility in your model.

Instruction-following demonstrations:

JSON
{
  "instruction": "Summarize this article in 3 bullet points",
  "input": "[article text]",
  "output": "• Key point one\n• Key point two\n• Key point three"
}

Conversational demonstrations:

JSON
{
  "messages": [
    {"role": "user", "content": "What's the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What's its population?"},
    {"role": "assistant", "content": "Paris has a population of approximately 2.1 million people in the city proper, and about 12 million in the greater metropolitan area."}
  ]
}

Capability demonstrations:

JSON
{
  "instruction": "Solve this math problem step by step",
  "input": "If a train travels at 60 mph for 2.5 hours, how far does it travel?",
  "output": "Let me solve this step by step:\n\n1. Distance = Speed × Time\n2. Speed = 60 mph\n3. Time = 2.5 hours\n4. Distance = 60 × 2.5 = 150 miles\n\nThe train travels 150 miles."
}

SFT Data Collection

Human annotation: Expert annotators write ideal responses. High quality but expensive ($5-20 per example).

Distillation from stronger models: Use GPT-4 or Claude to generate training data for smaller models. Common and effective.

User interaction data: Filter production conversations for high-quality examples. Requires quality signals.

Synthetic generation: Generate instructions programmatically, use LLM for responses, filter for quality.

SFT Training

Standard approach:

  • Learning rate: 1e-5 to 5e-5
  • Epochs: 1-3 (watch for overfitting)
  • Batch size: 32-128 (larger is more stable)
  • Warmup: 3-10% of training steps

LoRA approach (parameter-efficient):

  • LoRA rank: 8-64 (higher for more capacity)
  • LoRA alpha: 2× rank
  • Apply to: Q, K, V, O projections at minimum; all linear layers for best results
  • Learning rate: 1e-4 to 5e-4 (higher than full fine-tuning)

Training dynamics:

Code
Epoch 1: Model learns instruction format
Epoch 2: Quality improves, more consistent outputs
Epoch 3+: Risk of overfitting, monitor validation loss

SFT Best Practices

  1. Diverse instructions: Cover many task types, phrasings, complexity levels
  2. Quality over quantity: 1000 excellent examples > 10000 mediocre ones
  3. Response style consistency: All examples should have consistent tone, format, style
  4. Include edge cases: What should the model do with ambiguous instructions?
  5. Balance task types: Don't over-represent any single capability

Reinforcement Learning from Human Feedback (RLHF)

Why RLHF?

SFT has limitations:

  • Can only match the quality of demonstrations
  • Doesn't learn preferences between acceptable responses
  • Harder to encode subtle quality differences

RLHF addresses this by training on preferences—which response is better—rather than single demonstrations.

The key insight behind RLHF: It's easier for humans to compare two responses than to write a perfect response. If you ask someone "write an ideal explanation of quantum computing for a 10-year-old," they'll struggle. But if you show them two explanations and ask "which is better for a 10-year-old?", they can easily judge. RLHF exploits this asymmetry: collect comparisons (easy for humans), train a reward model to predict comparisons (ML), then optimize the LLM to score highly on the reward model (RL).

What RLHF captures that SFT can't: SFT teaches "this is a good response." RLHF teaches "this response is better than that one" across a spectrum of quality. This comparative signal enables the model to learn subtle quality distinctions: more accurate, more helpful, safer, more appropriate tone. The model doesn't just learn what good looks like—it learns to discriminate between degrees of goodness.

The RLHF Pipeline

Code
Step 1: Collect preference data
        [Prompt] → [Response A] vs [Response B] → Human labels "A is better"

Step 2: Train reward model
        Reward model learns to predict which response humans prefer

Step 3: RL fine-tuning
        Policy (LLM) optimizes to produce responses that maximize reward

Preference Data Collection

Pairwise comparisons: Show annotators two responses, ask which is better.

Code
Prompt: "Explain quantum computing simply"

Response A: "Quantum computing uses qubits that can be 0 and 1 simultaneously..."
Response B: "Unlike regular computers that use bits (0 or 1), quantum computers..."

Annotator choice: B (clearer, more accessible)

Ranking: Show multiple responses, rank from best to worst.

Code
Prompt: [same]
Responses: [A, B, C, D]
Ranking: B > D > A > C

Likert ratings: Rate each response independently (1-5 scale).

Code
Response A: Helpfulness 4/5, Accuracy 5/5, Clarity 3/5
Response B: Helpfulness 5/5, Accuracy 4/5, Clarity 5/5

Reward Model Training

The reward model learns to predict human preferences.

Why the reward model is the bottleneck of RLHF: The reward model defines what "good" means for the RL phase. If the reward model has blind spots, the policy will exploit them. If the reward model prefers longer responses regardless of quality, the policy will learn to be verbose. If the reward model can't distinguish subtle quality differences, the policy won't learn them. The quality ceiling of your RLHF model is set by your reward model.

The reward model as a human preference simulator: Think of it this way: you can't have a human judge every response during RL training (millions of responses). So you train a model to simulate human judgment. The RL phase then optimizes against this simulation. This works when the simulation is accurate, but can fail when the reward model is overconfident about edge cases it's never seen.

Architecture: Usually the same architecture as the LLM, with a scalar output head.

Training objective:

L=logσ(r(yw)r(yl))\mathcal{L} = -\log\sigma(r(y_w) - r(y_l))

where r(y)r(y) is the reward model's score for response yy, σ\sigma is the sigmoid function, ywy_w is the human-preferred (chosen) response, and yly_l is the human-dispreferred (rejected) response.

Training considerations:

  • Same base model as policy works well
  • Can use LoRA for efficiency
  • Need 10K-100K comparisons for good reward model
  • Validate on held-out comparisons

RL Optimization with PPO

Why PPO for RLHF?

Proximal Policy Optimization (PPO) became the standard for RLHF because it balances sample efficiency with stability. Unlike simpler policy gradient methods, PPO prevents catastrophically large updates that could destroy the model's capabilities.

The Core PPO Objective:

PPO uses a clipped surrogate objective to limit policy changes:

Code
L_CLIP = E[min(r(θ) × A, clip(r(θ), 1-ε, 1+ε) × A)]

Where:
- r(θ) = π(a|s) / π_old(a|s)  (probability ratio)
- A = advantage estimate
- ε = clip range (typically 0.2)

Intuition: The clipping ensures that if an action's probability changes too much (ratio far from 1), we cap its contribution to the gradient. This prevents the policy from changing too drastically in a single update.

The Full RLHF Objective:

For RLHF, we combine PPO with a KL penalty to stay close to the reference (SFT) model:

Code
Objective = E[reward - β × KL(policy || reference)]

Expanded:
reward_total = reward_model(response) - β × log(π(response) / π_ref(response))

The Four-Model Setup:

RLHF with PPO requires managing four models simultaneously:

Code
1. Policy Model (Actor): The LLM being trained
2. Reference Model: Frozen copy of SFT model for KL computation
3. Reward Model: Scores response quality
4. Value Model (Critic): Estimates expected future reward

Memory requirement: ~4× model size (can optimize with sharing)

PPO Training Loop in Detail

The PPO training loop has four distinct steps that repeat each iteration. Understanding each step is critical because they depend on each other—getting one wrong cascades through the entire training process.

Step 1: Rollout Generation

First, we generate responses from the current policy and record the log probabilities. These log probs are essential—we'll compare them to log probs after updates to compute probability ratios for the PPO objective. Using do_sample=True ensures exploration; deterministic generation would collapse to always picking the highest-probability token.

Python
def generate_rollouts(policy, prompts, generation_config):
    """Generate responses and compute log probabilities."""
    rollouts = []
    for prompt in prompts:
        # Sample response from current policy
        response = policy.generate(
            prompt,
            temperature=generation_config.temperature,
            max_tokens=generation_config.max_tokens,
            do_sample=True
        )

        # Compute log probability of generated response
        log_prob = compute_sequence_log_prob(policy, prompt, response)

        rollouts.append({
            'prompt': prompt,
            'response': response,
            'log_prob': log_prob
        })
    return rollouts

Step 2: Reward Computation

Now we score each response with the reward model and apply a KL penalty. The KL penalty is crucial—it prevents the policy from drifting too far from the reference model (the SFT checkpoint). Without it, the model would exploit reward model weaknesses, producing responses that score highly but are actually low quality ("reward hacking").

The formula: total_reward = reward_score - β × KL_divergence. Higher β means more conservative training; lower β allows more exploration but risks instability.

Python
def compute_rewards(rollouts, reward_model, ref_model, kl_coef):
    """Score responses and apply KL penalty."""
    for rollout in rollouts:
        # Get reward model score
        rm_score = reward_model(rollout['prompt'], rollout['response'])

        # Compute KL penalty
        ref_log_prob = compute_sequence_log_prob(
            ref_model, rollout['prompt'], rollout['response']
        )
        kl_penalty = rollout['log_prob'] - ref_log_prob

        # Total reward with KL penalty
        rollout['reward'] = rm_score - kl_coef * kl_penalty
        rollout['kl'] = kl_penalty
        rollout['rm_score'] = rm_score

    return rollouts

Step 3: Advantage Estimation (GAE)

The advantage function tells us "how much better was this action than average?" Raw rewards are noisy—GAE (Generalized Advantage Estimation) smooths this by mixing short-term and long-term estimates.

  • Why not just use rewards? High variance makes training unstable.
  • Why not just use value estimates? They're biased by the value model's errors.
  • GAE solution: Blend them with parameter λ. When λ=1, it's just rewards (high variance). When λ=0, it's just value estimates (high bias). λ=0.95 is a common sweet spot.

The backward pass accumulates advantages from the end of the sequence to the beginning—necessary because each token's advantage depends on future rewards.

Python
def compute_advantages(rollouts, value_model, gamma=1.0, lam=0.95):
    """Compute GAE advantages for each token."""
    for rollout in rollouts:
        values = value_model(rollout['prompt'], rollout['response'])
        rewards = rollout['token_rewards']  # Per-token rewards

        advantages = []
        gae = 0

        # Backward pass through tokens
        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_value = 0
            else:
                next_value = values[t + 1]

            delta = rewards[t] + gamma * next_value - values[t]
            gae = delta + gamma * lam * gae
            advantages.insert(0, gae)

        rollout['advantages'] = advantages
        rollout['returns'] = [a + v for a, v in zip(advantages, values)]

    return rollouts

Step 4: PPO Update

The actual policy optimization. This is where PPO's "proximal" nature matters: we compute a probability ratio (new policy prob / old policy prob) and clip it to prevent drastic updates.

The clipping logic: if an action's probability increases too much (ratio > 1+ε), we cap the gradient contribution. Same if it decreases too much (ratio < 1-ε). This keeps updates conservative, preventing the policy from changing so much that our collected rollouts become invalid.

We also update the value model using MSE loss against the computed returns—this improves advantage estimation for future iterations.

Python
def ppo_update(policy, value_model, rollouts, config):
    """Perform PPO update on policy and value model."""
    optimizer = torch.optim.Adam([
        {'params': policy.parameters(), 'lr': config.policy_lr},
        {'params': value_model.parameters(), 'lr': config.value_lr}
    ])

    for epoch in range(config.ppo_epochs):
        for batch in create_minibatches(rollouts, config.batch_size):
            # Current policy log probs
            new_log_probs = compute_sequence_log_prob(
                policy, batch['prompts'], batch['responses']
            )

            # Probability ratio
            ratio = torch.exp(new_log_probs - batch['old_log_probs'])

            # Clipped surrogate objective
            advantages = batch['advantages']
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1 - config.clip_range,
                               1 + config.clip_range) * advantages
            policy_loss = -torch.min(surr1, surr2).mean()

            # Value function loss
            values = value_model(batch['prompts'], batch['responses'])
            value_loss = F.mse_loss(values, batch['returns'])

            # Combined loss
            loss = policy_loss + config.value_coef * value_loss

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(policy.parameters(), config.max_grad_norm)
            optimizer.step()

PPO Hyperparameters Deep Dive

KL Coefficient (β): 0.01-0.2

Code
β = 0.01: Aggressive optimization, policy can drift far from reference
β = 0.1:  Balanced (common starting point)
β = 0.2:  Conservative, stays close to SFT behavior

Adaptive KL: Some implementations adjust β to hit a target KL:
- If KL > target: increase β
- If KL < target: decrease β

Clip Range (ε): 0.1-0.3

Code
ε = 0.1: Very conservative updates
ε = 0.2: Standard choice
ε = 0.3: More aggressive, use with caution

Learning Rates:

Code
Policy LR: 1e-6 to 5e-6 (much lower than SFT)
Value LR: 1e-5 to 5e-5 (can be higher than policy)

Why so low? We're fine-tuning a capable model, not training from scratch.
Large updates can catastrophically harm capabilities.

Batch and Epoch Settings:

Code
Rollout batch size: 512-2048 prompts per iteration
PPO epochs: 2-4 updates per rollout batch
Minibatch size: 64-256 for gradient updates

More PPO epochs = more sample efficiency but risk of overfitting to rollouts

Generation Settings:

Code
Temperature: 0.7-1.0 during training (exploration)
Max tokens: Task-dependent, typically 256-1024
Top-p: 0.95 (standard)

Memory Optimization Strategies

Challenge: Four models in memory is expensive.

Strategy 1: Weight Sharing

Python
# Share base weights between policy and reference
# Only the LoRA adapters differ
policy = load_base_model()
policy.load_lora_adapters("policy_lora")

reference = policy  # Same base weights
# Just disable LoRA when computing reference log probs

Strategy 2: Sequential Computation

Python
# Don't keep all models in GPU memory simultaneously
def compute_step(prompt_batch):
    # 1. Generate with policy (policy in GPU)
    responses = generate(policy, prompts)
    policy.to('cpu')

    # 2. Score with reward model (reward model in GPU)
    reward_model.to('cuda')
    rewards = score(reward_model, prompts, responses)
    reward_model.to('cpu')

    # 3. Compute reference log probs (reference in GPU)
    # ... and so on

Strategy 3: Gradient Checkpointing

Python
policy.gradient_checkpointing_enable()
# Trades compute for memory - recomputes activations during backward pass

RLHF Challenges and Mitigations

Reward Hacking: The model finds ways to maximize reward that don't align with intent.

Code
Example 1: Length exploitation
- Reward model flaw: Longer responses score slightly higher
- Result: Model becomes unnecessarily verbose
- Fix: Length normalization, penalize excessive length

Example 2: Sycophancy
- Reward model flaw: Agreeable responses score higher
- Result: Model always agrees with user, even when wrong
- Fix: Include adversarial examples where disagreement is correct

Example 3: Formatting exploits
- Reward model flaw: Bullet points and headers score higher
- Result: Every response uses unnecessary formatting
- Fix: Diverse format examples in preference data

Mitigation strategies:

Python
# Reward model ensemble
rewards = [rm(prompt, response) for rm in reward_models]
final_reward = min(rewards)  # Conservative: use minimum

# Reward clipping
reward = torch.clamp(reward, -clip_value, clip_value)

# Auxiliary losses
loss = ppo_loss + aux_weight * auxiliary_loss  # e.g., perplexity on held-out data

Mode Collapse: Model converges to narrow range of "safe" responses.

Signs:

  • Decreasing response diversity
  • High reward but repetitive outputs
  • Low entropy in generation

Mitigations:

Python
# Entropy bonus
entropy = -torch.sum(probs * torch.log(probs), dim=-1)
loss = ppo_loss - entropy_coef * entropy.mean()

# Higher temperature during training
generation_config.temperature = 1.0  # Not 0.7

# KL penalty (built into RLHF objective)
# Prevents straying too far from diverse SFT distribution

Catastrophic Forgetting: Model loses capabilities while optimizing for reward.

Monitor these metrics:

  • Performance on held-out benchmarks (MMLU, etc.)
  • Generation quality on diverse prompts
  • Task completion rates outside reward optimization

Mitigations:

Python
# Mix in SFT data
if random.random() < sft_mix_ratio:
    # Do SFT update instead of PPO
    loss = sft_loss(batch)
else:
    loss = ppo_loss(batch)

# Capability-specific evaluation
for capability in ['math', 'coding', 'reasoning']:
    score = evaluate(policy, capability_benchmark)
    if score < threshold:
        # Add capability data to training mix
        pass

RLHF Debugging Checklist

If reward increases but quality decreases:

  • Reward hacking—inspect high-reward samples manually
  • Check if reward model has obvious exploits
  • Add reward clipping or ensemble

If KL explodes:

  • Learning rate too high
  • Increase KL coefficient (β)
  • Check for numerical instabilities

If training is unstable:

  • Reduce learning rate
  • Increase batch size
  • Add gradient clipping
  • Check value function accuracy

If no learning happens:

  • Verify reward model provides meaningful signal
  • Check that advantages have reasonable variance
  • Ensure log probs are computed correctly

Direct Preference Optimization (DPO)

DPO simplifies RLHF by eliminating the explicit reward model and RL training loop. Understanding why it works requires diving into the mathematics of RLHF.

The Mathematical Foundation

The RLHF Objective:

RLHF optimizes a policy to maximize reward while staying close to a reference policy:

Code
max_π E[r(x, y)] - β × KL(π || πref)

This KL constraint prevents the policy from deviating too far from the SFT model, avoiding reward hacking.

The Key Insight:

The optimal policy for this constrained optimization has a closed-form solution:

Code
π*(y|x) = (1/Z(x)) × πref(y|x) × exp(r(x,y) / β)

Where Z(x) is the partition function (normalizer)

This means: the optimal policy is the reference policy reweighted by exponentiated reward.

The Reparameterization:

We can invert this relationship to express reward in terms of policies:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r(x, y) = \beta \cdot \log\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \cdot \log Z(x)

Since Z(x) cancels out when computing preference probabilities (it's the same for both responses), we can substitute this into the Bradley-Terry preference model to get the DPO loss directly.

DPO Loss Function

L=logσ(β(logπ(ywx)πref(ywx)logπ(ylx)πref(ylx)))\mathcal{L} = -\log\sigma\left(\beta \cdot \left(\log\frac{\pi(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log\frac{\pi(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right)

Intuitive interpretation:

  • log(π(y|x)/πref(y|x)) = how much more likely the policy makes response y compared to reference
  • We want this ratio to be higher for chosen responses than rejected ones
  • The sigmoid converts this difference into a probability
  • We minimize negative log-likelihood of the preference

Gradient behavior:

  • When policy already prefers chosen over rejected: small gradient (already good)
  • When policy prefers rejected: large gradient pushing toward chosen
  • The reference policy acts as an anchor preventing drift

DPO Advantages

  1. Simpler pipeline: No reward model training, no RL infrastructure
  2. More stable: Standard supervised learning with well-understood dynamics
  3. Lower memory: Only need policy and reference model (can share weights with LoRA)
  4. Faster iteration: Single training phase, easier hyperparameter tuning
  5. Deterministic: No sampling variance from RL rollouts

DPO Training in Practice

Data format:

JSON
{
  "prompt": "Explain machine learning",
  "chosen": "Machine learning is a subset of AI that enables systems to learn patterns from data without explicit programming. It works by...",
  "rejected": "ML is when computers learn stuff on their own I guess..."
}

Hyperparameters:

  • β (temperature): 0.1-0.5 (lower = stronger preference signal)
  • Learning rate: 5e-7 to 5e-6 (lower than SFT)
  • Epochs: 1-3 (watch for overfitting)
  • Batch size: 32-128 (larger batches help stability)

Implementation example:

Python
import torch
import torch.nn.functional as F

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             reference_chosen_logps, reference_rejected_logps,
             beta=0.1):
    """
    Compute DPO loss for a batch of preferences.

    All inputs are log probabilities of shape (batch_size,)
    """
    # Compute log ratios
    chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps)

    # DPO loss: negative log sigmoid of reward difference
    losses = -F.logsigmoid(chosen_rewards - rejected_rewards)

    # Metrics for monitoring
    chosen_better = (chosen_rewards > rejected_rewards).float().mean()
    reward_margin = (chosen_rewards - rejected_rewards).mean()

    return losses.mean(), {
        'accuracy': chosen_better.item(),
        'reward_margin': reward_margin.item()
    }

Training loop considerations:

Python
def train_dpo_epoch(model, ref_model, dataloader, optimizer, beta):
    model.train()
    ref_model.eval()  # Reference model is frozen

    for batch in dataloader:
        # Get log probs from both models
        with torch.no_grad():
            ref_chosen_logps = get_sequence_logps(ref_model, batch['chosen'])
            ref_rejected_logps = get_sequence_logps(ref_model, batch['rejected'])

        policy_chosen_logps = get_sequence_logps(model, batch['chosen'])
        policy_rejected_logps = get_sequence_logps(model, batch['rejected'])

        loss, metrics = dpo_loss(
            policy_chosen_logps, policy_rejected_logps,
            ref_chosen_logps, ref_rejected_logps,
            beta=beta
        )

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

DPO Variants

The success of DPO sparked numerous variants addressing its limitations:

IPO (Identity Preference Optimization): Addresses DPO's tendency to overfit by using a different loss:

L=(logπ(ywx)πref(ywx)logπ(ylx)πref(ylx)12β)2\mathcal{L} = \left(\log\frac{\pi(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log\frac{\pi(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \frac{1}{2\beta}\right)^2

IPO targets a fixed margin rather than pushing preferences infinitely apart. Better for noisy preference data.

KTO (Kahneman-Tversky Optimization): Works with unpaired data—you only need examples labeled "good" or "bad", not paired comparisons: Lgood=1σ(β(logπ(yx)πref(yx)zref))\mathcal{L}_{\text{good}} = 1 - \sigma\left(\beta \cdot \left(\log\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} - z_{\text{ref}}\right)\right)

Lbad=1σ(β(zreflogπ(yx)πref(yx)))\mathcal{L}_{\text{bad}} = 1 - \sigma\left(\beta \cdot \left(z_{\text{ref}} - \log\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\right)\right) Where z_ref is a reference point. Useful when paired preference data is hard to collect.

ORPO (Odds Ratio Preference Optimization): Combines SFT and preference optimization in a single objective: L=logπ(ywx)λlogσ(logodds(yw)odds(yl))\mathcal{L} = -\log\pi(y_w|x) - \lambda \cdot \log\sigma\left(\log\frac{\text{odds}(y_w)}{\text{odds}(y_l)}\right) No need for a reference model—the SFT component acts as implicit regularization.

SimPO (Simple Preference Optimization): Removes the reference model entirely by using length-normalized rewards: L=logσ(β(r(yw)ywr(yl)ylγ))\mathcal{L} = -\log\sigma\left(\beta \cdot \left(\frac{r(y_w)}{|y_w|} - \frac{r(y_l)}{|y_l|} - \gamma\right)\right) Where γ is a target margin. Simpler and often competitive with DPO.

Choosing a variant:

MethodBest forAvoids
DPOStandard preference tuning-
IPONoisy preferencesOverfitting
KTOUnpaired good/bad dataNeed for comparisons
ORPOCombined SFT + preferenceReference model overhead
SimPOMaximum simplicityReference model entirely

Common DPO Pitfalls

1. Reference model drift: If you update the reference model during training, you lose the anchor. Keep it frozen.

2. Length exploitation: Models may learn to prefer longer or shorter responses based on spurious correlations in data:

Python
# Monitor length statistics
chosen_lengths = [len(x['chosen']) for x in dataset]
rejected_lengths = [len(x['rejected']) for x in dataset]
# If significantly different, add length normalization

3. Catastrophic forgetting: Aggressive DPO can harm capabilities learned during SFT:

  • Use lower learning rates than SFT
  • Mix in SFT data (10-20% of batches)
  • Increase β to stay closer to reference

4. Preference data quality: DPO amplifies patterns in your data—including spurious ones:

  • Validate preference data manually
  • Ensure annotator agreement
  • Balance across task types

5. β tuning:

Code
β too low (0.01): Aggressive optimization, may diverge from reference
β too high (1.0): Weak signal, slow learning
Sweet spot: Usually 0.1-0.3

DPO vs. RLHF Comparison

AspectRLHF (PPO)DPO
ComplexityHigh (reward model + RL)Low (single supervised phase)
StabilityRequires careful tuningGenerally stable
Compute4 models in memory2 models in memory
PerformanceState-of-the-art ceilingCompetitive, sometimes equal
FlexibilityArbitrary reward signalsPairwise preferences only
Iteration speedSlow (RL rollouts)Fast (supervised batches)
Failure modesReward hacking, mode collapseLength bias, forgetting

When to use RLHF:

  • Complex, multi-objective reward functions
  • Iterative preference learning with online data
  • Maximum capability extraction worth the engineering cost
  • You have RL infrastructure already

When to use DPO:

  • Standard preference alignment
  • Limited compute or engineering resources
  • Rapid iteration on alignment experiments
  • Good enough performance is acceptable

Constitutional AI and RLAIF

The Labeling Bottleneck

Human labeling is expensive and doesn't scale. Constitutional AI uses AI feedback instead:

RLAIF (RL from AI Feedback)

Replace human labelers with LLMs:

Code
1. Generate response pairs
2. Ask LLM to judge which is better (with principles)
3. Train reward model on AI judgments
4. Run RL as usual

Principle-based judging:

Code
Evaluate which response better follows these principles:
1. Be helpful and informative
2. Be harmless and avoid dangerous content
3. Be honest and acknowledge uncertainty

Response A: [text]
Response B: [text]

Which response better follows the principles?

Constitutional AI

Train model to self-critique and revise:

Stage 1: Supervised self-critique

Code
Initial response: [potentially problematic response]
Critique: "This response could be harmful because..."
Revision: [improved response]

Stage 2: RL with AI feedback Use LLM to generate preference labels based on principles.

Advantages

  • Scales better than human labeling
  • Consistent application of principles
  • Can cover more edge cases
  • Enables rapid iteration

Limitations

  • AI judges have biases
  • May not capture nuanced human preferences
  • Need diverse prompts to avoid mode collapse

Practical Implementation Guide

Choosing Your Approach

Start with SFT if:

  • You have quality demonstration data
  • Task is well-defined
  • Budget is limited

Add RLHF/DPO if:

  • SFT plateau reached
  • Subtle quality improvements needed
  • Have preference data or can collect it

Data Pipeline

Python
class PostTrainingPipeline:
    def prepare_sft_data(self, raw_data):
        # Format as instruction-response pairs
        # Filter for quality
        # Deduplicate
        # Balance across task types
        return formatted_data

    def prepare_preference_data(self, sft_model, prompts):
        # Generate multiple responses per prompt
        # Collect preferences (human or AI)
        # Format as chosen/rejected pairs
        return preference_data

    def train_sft(self, base_model, sft_data):
        # Standard fine-tuning
        return sft_model

    def train_dpo(self, sft_model, preference_data):
        # DPO optimization
        return aligned_model

Evaluation

SFT evaluation:

  • Instruction-following accuracy
  • Response quality (LLM-as-judge)
  • Format compliance
  • Task-specific benchmarks

RLHF/DPO evaluation:

  • Win rate vs. SFT baseline
  • Human preference evaluation
  • Safety evaluations
  • Capability retention (no regression)

Common Pitfalls

  1. Insufficient SFT before RLHF: Get SFT right first
  2. Low-quality preference data: Garbage in, garbage out
  3. Over-optimization: Model becomes sycophantic or narrow
  4. Forgetting base capabilities: Evaluate broadly, not just on target task
  5. Ignoring safety: Preference optimization can introduce new failure modes

Conclusion

Post-training transforms base models into useful assistants. SFT teaches instruction following, RLHF/DPO aligns outputs with human preferences. The field is evolving rapidly—DPO simplified RLHF, and new methods continue to emerge.

The key is understanding the purpose of each stage and iterating based on evaluation. Start simple (SFT), measure carefully, and add complexity (RLHF/DPO) only when you have evidence it helps.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles