SFT and RLHF: The Complete Guide to Post-Training LLMs
A deep dive into Supervised Fine-Tuning and Reinforcement Learning from Human Feedback—the techniques that transform base models into useful assistants.
Table of Contents
The Post-Training Stack
Pre-trained LLMs are impressive but not immediately useful. They complete text, but they don't follow instructions, refuse harmful requests, or behave helpfully. Post-training transforms these base models into the assistants we use daily.
The modern post-training pipeline:
[Base Model] → Pre-training on internet text
↓
[SFT Model] → Supervised Fine-Tuning on demonstrations
↓
[RLHF Model] → Reinforcement Learning from Human Feedback
↓
[Production Model] → Safety fine-tuning, capability elicitation
This post explains each stage in depth, with practical guidance for implementing these techniques.
Supervised Fine-Tuning (SFT)
What SFT Does
SFT teaches the model to follow instructions by training on examples of desired behavior. Input: instruction + context. Output: ideal response.
The fundamental transformation SFT creates: Base models are trained on raw internet text to predict the next token. They're excellent at continuing text in plausible ways, but they don't understand they should answer questions rather than continue them. A base model sees "What is 2+2?" and thinks "this looks like a math worksheet, I should generate more math questions." SFT rewires this: by training on thousands of instruction→response pairs, the model learns that its job is to respond helpfully, not to continue text.
Why SFT alone creates useful assistants: After SFT, a model can follow instructions, answer questions, write code, and engage in conversation. This is sufficient for many use cases. The original ChatGPT was essentially "GPT-3 + SFT + a bit of RLHF." SFT is the workhorse transformation that creates usable AI assistants from raw language models.
Before SFT (base model):
Input: "Write a haiku about machine learning"
Output: "Write a haiku about machine learning applications
in healthcare. Discuss the benefits and challenges..."
[continues text as if completing a document]
After SFT:
Input: "Write a haiku about machine learning"
Output: "Data flows like streams
Neural networks learn and grow
Patterns emerge clear"
SFT Data
Quality data is everything. Types of SFT data:
Why data quality dominates model quality: The model can only learn behaviors present in the training data. If your SFT data contains verbose, meandering responses, your model will be verbose and meandering. If it contains concise, focused responses, your model will be concise and focused. This seems obvious but has profound implications: the best way to improve your model is usually to improve your data, not to tweak training hyperparameters.
The "demonstration distribution" problem: Your model will learn to produce responses that look like your demonstrations. If all your examples are formal, the model will be formal even when informality is appropriate. If all examples are long, the model will pad short answers. Diversity in your SFT data directly translates to flexibility in your model.
Instruction-following demonstrations:
{
"instruction": "Summarize this article in 3 bullet points",
"input": "[article text]",
"output": "• Key point one\n• Key point two\n• Key point three"
}
Conversational demonstrations:
{
"messages": [
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
{"role": "user", "content": "What's its population?"},
{"role": "assistant", "content": "Paris has a population of approximately 2.1 million people in the city proper, and about 12 million in the greater metropolitan area."}
]
}
Capability demonstrations:
{
"instruction": "Solve this math problem step by step",
"input": "If a train travels at 60 mph for 2.5 hours, how far does it travel?",
"output": "Let me solve this step by step:\n\n1. Distance = Speed × Time\n2. Speed = 60 mph\n3. Time = 2.5 hours\n4. Distance = 60 × 2.5 = 150 miles\n\nThe train travels 150 miles."
}
SFT Data Collection
Human annotation: Expert annotators write ideal responses. High quality but expensive ($5-20 per example).
Distillation from stronger models: Use GPT-4 or Claude to generate training data for smaller models. Common and effective.
User interaction data: Filter production conversations for high-quality examples. Requires quality signals.
Synthetic generation: Generate instructions programmatically, use LLM for responses, filter for quality.
SFT Training
Standard approach:
- Learning rate: 1e-5 to 5e-5
- Epochs: 1-3 (watch for overfitting)
- Batch size: 32-128 (larger is more stable)
- Warmup: 3-10% of training steps
LoRA approach (parameter-efficient):
- LoRA rank: 8-64 (higher for more capacity)
- LoRA alpha: 2× rank
- Apply to: Q, K, V, O projections at minimum; all linear layers for best results
- Learning rate: 1e-4 to 5e-4 (higher than full fine-tuning)
Training dynamics:
Epoch 1: Model learns instruction format
Epoch 2: Quality improves, more consistent outputs
Epoch 3+: Risk of overfitting, monitor validation loss
SFT Best Practices
- Diverse instructions: Cover many task types, phrasings, complexity levels
- Quality over quantity: 1000 excellent examples > 10000 mediocre ones
- Response style consistency: All examples should have consistent tone, format, style
- Include edge cases: What should the model do with ambiguous instructions?
- Balance task types: Don't over-represent any single capability
Reinforcement Learning from Human Feedback (RLHF)
Why RLHF?
SFT has limitations:
- Can only match the quality of demonstrations
- Doesn't learn preferences between acceptable responses
- Harder to encode subtle quality differences
RLHF addresses this by training on preferences—which response is better—rather than single demonstrations.
The key insight behind RLHF: It's easier for humans to compare two responses than to write a perfect response. If you ask someone "write an ideal explanation of quantum computing for a 10-year-old," they'll struggle. But if you show them two explanations and ask "which is better for a 10-year-old?", they can easily judge. RLHF exploits this asymmetry: collect comparisons (easy for humans), train a reward model to predict comparisons (ML), then optimize the LLM to score highly on the reward model (RL).
What RLHF captures that SFT can't: SFT teaches "this is a good response." RLHF teaches "this response is better than that one" across a spectrum of quality. This comparative signal enables the model to learn subtle quality distinctions: more accurate, more helpful, safer, more appropriate tone. The model doesn't just learn what good looks like—it learns to discriminate between degrees of goodness.
The RLHF Pipeline
Step 1: Collect preference data
[Prompt] → [Response A] vs [Response B] → Human labels "A is better"
Step 2: Train reward model
Reward model learns to predict which response humans prefer
Step 3: RL fine-tuning
Policy (LLM) optimizes to produce responses that maximize reward
Preference Data Collection
Pairwise comparisons: Show annotators two responses, ask which is better.
Prompt: "Explain quantum computing simply"
Response A: "Quantum computing uses qubits that can be 0 and 1 simultaneously..."
Response B: "Unlike regular computers that use bits (0 or 1), quantum computers..."
Annotator choice: B (clearer, more accessible)
Ranking: Show multiple responses, rank from best to worst.
Prompt: [same]
Responses: [A, B, C, D]
Ranking: B > D > A > C
Likert ratings: Rate each response independently (1-5 scale).
Response A: Helpfulness 4/5, Accuracy 5/5, Clarity 3/5
Response B: Helpfulness 5/5, Accuracy 4/5, Clarity 5/5
Reward Model Training
The reward model learns to predict human preferences.
Why the reward model is the bottleneck of RLHF: The reward model defines what "good" means for the RL phase. If the reward model has blind spots, the policy will exploit them. If the reward model prefers longer responses regardless of quality, the policy will learn to be verbose. If the reward model can't distinguish subtle quality differences, the policy won't learn them. The quality ceiling of your RLHF model is set by your reward model.
The reward model as a human preference simulator: Think of it this way: you can't have a human judge every response during RL training (millions of responses). So you train a model to simulate human judgment. The RL phase then optimizes against this simulation. This works when the simulation is accurate, but can fail when the reward model is overconfident about edge cases it's never seen.
Architecture: Usually the same architecture as the LLM, with a scalar output head.
Training objective:
where is the reward model's score for response , is the sigmoid function, is the human-preferred (chosen) response, and is the human-dispreferred (rejected) response.
Training considerations:
- Same base model as policy works well
- Can use LoRA for efficiency
- Need 10K-100K comparisons for good reward model
- Validate on held-out comparisons
RL Optimization with PPO
Why PPO for RLHF?
Proximal Policy Optimization (PPO) became the standard for RLHF because it balances sample efficiency with stability. Unlike simpler policy gradient methods, PPO prevents catastrophically large updates that could destroy the model's capabilities.
The Core PPO Objective:
PPO uses a clipped surrogate objective to limit policy changes:
L_CLIP = E[min(r(θ) × A, clip(r(θ), 1-ε, 1+ε) × A)]
Where:
- r(θ) = π(a|s) / π_old(a|s) (probability ratio)
- A = advantage estimate
- ε = clip range (typically 0.2)
Intuition: The clipping ensures that if an action's probability changes too much (ratio far from 1), we cap its contribution to the gradient. This prevents the policy from changing too drastically in a single update.
The Full RLHF Objective:
For RLHF, we combine PPO with a KL penalty to stay close to the reference (SFT) model:
Objective = E[reward - β × KL(policy || reference)]
Expanded:
reward_total = reward_model(response) - β × log(π(response) / π_ref(response))
The Four-Model Setup:
RLHF with PPO requires managing four models simultaneously:
1. Policy Model (Actor): The LLM being trained
2. Reference Model: Frozen copy of SFT model for KL computation
3. Reward Model: Scores response quality
4. Value Model (Critic): Estimates expected future reward
Memory requirement: ~4× model size (can optimize with sharing)
PPO Training Loop in Detail
The PPO training loop has four distinct steps that repeat each iteration. Understanding each step is critical because they depend on each other—getting one wrong cascades through the entire training process.
Step 1: Rollout Generation
First, we generate responses from the current policy and record the log probabilities. These log probs are essential—we'll compare them to log probs after updates to compute probability ratios for the PPO objective. Using do_sample=True ensures exploration; deterministic generation would collapse to always picking the highest-probability token.
def generate_rollouts(policy, prompts, generation_config):
"""Generate responses and compute log probabilities."""
rollouts = []
for prompt in prompts:
# Sample response from current policy
response = policy.generate(
prompt,
temperature=generation_config.temperature,
max_tokens=generation_config.max_tokens,
do_sample=True
)
# Compute log probability of generated response
log_prob = compute_sequence_log_prob(policy, prompt, response)
rollouts.append({
'prompt': prompt,
'response': response,
'log_prob': log_prob
})
return rollouts
Step 2: Reward Computation
Now we score each response with the reward model and apply a KL penalty. The KL penalty is crucial—it prevents the policy from drifting too far from the reference model (the SFT checkpoint). Without it, the model would exploit reward model weaknesses, producing responses that score highly but are actually low quality ("reward hacking").
The formula: total_reward = reward_score - β × KL_divergence. Higher β means more conservative training; lower β allows more exploration but risks instability.
def compute_rewards(rollouts, reward_model, ref_model, kl_coef):
"""Score responses and apply KL penalty."""
for rollout in rollouts:
# Get reward model score
rm_score = reward_model(rollout['prompt'], rollout['response'])
# Compute KL penalty
ref_log_prob = compute_sequence_log_prob(
ref_model, rollout['prompt'], rollout['response']
)
kl_penalty = rollout['log_prob'] - ref_log_prob
# Total reward with KL penalty
rollout['reward'] = rm_score - kl_coef * kl_penalty
rollout['kl'] = kl_penalty
rollout['rm_score'] = rm_score
return rollouts
Step 3: Advantage Estimation (GAE)
The advantage function tells us "how much better was this action than average?" Raw rewards are noisy—GAE (Generalized Advantage Estimation) smooths this by mixing short-term and long-term estimates.
- Why not just use rewards? High variance makes training unstable.
- Why not just use value estimates? They're biased by the value model's errors.
- GAE solution: Blend them with parameter λ. When λ=1, it's just rewards (high variance). When λ=0, it's just value estimates (high bias). λ=0.95 is a common sweet spot.
The backward pass accumulates advantages from the end of the sequence to the beginning—necessary because each token's advantage depends on future rewards.
def compute_advantages(rollouts, value_model, gamma=1.0, lam=0.95):
"""Compute GAE advantages for each token."""
for rollout in rollouts:
values = value_model(rollout['prompt'], rollout['response'])
rewards = rollout['token_rewards'] # Per-token rewards
advantages = []
gae = 0
# Backward pass through tokens
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_value = 0
else:
next_value = values[t + 1]
delta = rewards[t] + gamma * next_value - values[t]
gae = delta + gamma * lam * gae
advantages.insert(0, gae)
rollout['advantages'] = advantages
rollout['returns'] = [a + v for a, v in zip(advantages, values)]
return rollouts
Step 4: PPO Update
The actual policy optimization. This is where PPO's "proximal" nature matters: we compute a probability ratio (new policy prob / old policy prob) and clip it to prevent drastic updates.
The clipping logic: if an action's probability increases too much (ratio > 1+ε), we cap the gradient contribution. Same if it decreases too much (ratio < 1-ε). This keeps updates conservative, preventing the policy from changing so much that our collected rollouts become invalid.
We also update the value model using MSE loss against the computed returns—this improves advantage estimation for future iterations.
def ppo_update(policy, value_model, rollouts, config):
"""Perform PPO update on policy and value model."""
optimizer = torch.optim.Adam([
{'params': policy.parameters(), 'lr': config.policy_lr},
{'params': value_model.parameters(), 'lr': config.value_lr}
])
for epoch in range(config.ppo_epochs):
for batch in create_minibatches(rollouts, config.batch_size):
# Current policy log probs
new_log_probs = compute_sequence_log_prob(
policy, batch['prompts'], batch['responses']
)
# Probability ratio
ratio = torch.exp(new_log_probs - batch['old_log_probs'])
# Clipped surrogate objective
advantages = batch['advantages']
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - config.clip_range,
1 + config.clip_range) * advantages
policy_loss = -torch.min(surr1, surr2).mean()
# Value function loss
values = value_model(batch['prompts'], batch['responses'])
value_loss = F.mse_loss(values, batch['returns'])
# Combined loss
loss = policy_loss + config.value_coef * value_loss
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(policy.parameters(), config.max_grad_norm)
optimizer.step()
PPO Hyperparameters Deep Dive
KL Coefficient (β): 0.01-0.2
β = 0.01: Aggressive optimization, policy can drift far from reference
β = 0.1: Balanced (common starting point)
β = 0.2: Conservative, stays close to SFT behavior
Adaptive KL: Some implementations adjust β to hit a target KL:
- If KL > target: increase β
- If KL < target: decrease β
Clip Range (ε): 0.1-0.3
ε = 0.1: Very conservative updates
ε = 0.2: Standard choice
ε = 0.3: More aggressive, use with caution
Learning Rates:
Policy LR: 1e-6 to 5e-6 (much lower than SFT)
Value LR: 1e-5 to 5e-5 (can be higher than policy)
Why so low? We're fine-tuning a capable model, not training from scratch.
Large updates can catastrophically harm capabilities.
Batch and Epoch Settings:
Rollout batch size: 512-2048 prompts per iteration
PPO epochs: 2-4 updates per rollout batch
Minibatch size: 64-256 for gradient updates
More PPO epochs = more sample efficiency but risk of overfitting to rollouts
Generation Settings:
Temperature: 0.7-1.0 during training (exploration)
Max tokens: Task-dependent, typically 256-1024
Top-p: 0.95 (standard)
Memory Optimization Strategies
Challenge: Four models in memory is expensive.
Strategy 1: Weight Sharing
# Share base weights between policy and reference
# Only the LoRA adapters differ
policy = load_base_model()
policy.load_lora_adapters("policy_lora")
reference = policy # Same base weights
# Just disable LoRA when computing reference log probs
Strategy 2: Sequential Computation
# Don't keep all models in GPU memory simultaneously
def compute_step(prompt_batch):
# 1. Generate with policy (policy in GPU)
responses = generate(policy, prompts)
policy.to('cpu')
# 2. Score with reward model (reward model in GPU)
reward_model.to('cuda')
rewards = score(reward_model, prompts, responses)
reward_model.to('cpu')
# 3. Compute reference log probs (reference in GPU)
# ... and so on
Strategy 3: Gradient Checkpointing
policy.gradient_checkpointing_enable()
# Trades compute for memory - recomputes activations during backward pass
RLHF Challenges and Mitigations
Reward Hacking: The model finds ways to maximize reward that don't align with intent.
Example 1: Length exploitation
- Reward model flaw: Longer responses score slightly higher
- Result: Model becomes unnecessarily verbose
- Fix: Length normalization, penalize excessive length
Example 2: Sycophancy
- Reward model flaw: Agreeable responses score higher
- Result: Model always agrees with user, even when wrong
- Fix: Include adversarial examples where disagreement is correct
Example 3: Formatting exploits
- Reward model flaw: Bullet points and headers score higher
- Result: Every response uses unnecessary formatting
- Fix: Diverse format examples in preference data
Mitigation strategies:
# Reward model ensemble
rewards = [rm(prompt, response) for rm in reward_models]
final_reward = min(rewards) # Conservative: use minimum
# Reward clipping
reward = torch.clamp(reward, -clip_value, clip_value)
# Auxiliary losses
loss = ppo_loss + aux_weight * auxiliary_loss # e.g., perplexity on held-out data
Mode Collapse: Model converges to narrow range of "safe" responses.
Signs:
- Decreasing response diversity
- High reward but repetitive outputs
- Low entropy in generation
Mitigations:
# Entropy bonus
entropy = -torch.sum(probs * torch.log(probs), dim=-1)
loss = ppo_loss - entropy_coef * entropy.mean()
# Higher temperature during training
generation_config.temperature = 1.0 # Not 0.7
# KL penalty (built into RLHF objective)
# Prevents straying too far from diverse SFT distribution
Catastrophic Forgetting: Model loses capabilities while optimizing for reward.
Monitor these metrics:
- Performance on held-out benchmarks (MMLU, etc.)
- Generation quality on diverse prompts
- Task completion rates outside reward optimization
Mitigations:
# Mix in SFT data
if random.random() < sft_mix_ratio:
# Do SFT update instead of PPO
loss = sft_loss(batch)
else:
loss = ppo_loss(batch)
# Capability-specific evaluation
for capability in ['math', 'coding', 'reasoning']:
score = evaluate(policy, capability_benchmark)
if score < threshold:
# Add capability data to training mix
pass
RLHF Debugging Checklist
If reward increases but quality decreases:
- Reward hacking—inspect high-reward samples manually
- Check if reward model has obvious exploits
- Add reward clipping or ensemble
If KL explodes:
- Learning rate too high
- Increase KL coefficient (β)
- Check for numerical instabilities
If training is unstable:
- Reduce learning rate
- Increase batch size
- Add gradient clipping
- Check value function accuracy
If no learning happens:
- Verify reward model provides meaningful signal
- Check that advantages have reasonable variance
- Ensure log probs are computed correctly
Direct Preference Optimization (DPO)
DPO simplifies RLHF by eliminating the explicit reward model and RL training loop. Understanding why it works requires diving into the mathematics of RLHF.
The Mathematical Foundation
The RLHF Objective:
RLHF optimizes a policy to maximize reward while staying close to a reference policy:
max_π E[r(x, y)] - β × KL(π || πref)
This KL constraint prevents the policy from deviating too far from the SFT model, avoiding reward hacking.
The Key Insight:
The optimal policy for this constrained optimization has a closed-form solution:
π*(y|x) = (1/Z(x)) × πref(y|x) × exp(r(x,y) / β)
Where Z(x) is the partition function (normalizer)
This means: the optimal policy is the reference policy reweighted by exponentiated reward.
The Reparameterization:
We can invert this relationship to express reward in terms of policies:
Since Z(x) cancels out when computing preference probabilities (it's the same for both responses), we can substitute this into the Bradley-Terry preference model to get the DPO loss directly.
DPO Loss Function
Intuitive interpretation:
log(π(y|x)/πref(y|x))= how much more likely the policy makes response y compared to reference- We want this ratio to be higher for chosen responses than rejected ones
- The sigmoid converts this difference into a probability
- We minimize negative log-likelihood of the preference
Gradient behavior:
- When policy already prefers chosen over rejected: small gradient (already good)
- When policy prefers rejected: large gradient pushing toward chosen
- The reference policy acts as an anchor preventing drift
DPO Advantages
- Simpler pipeline: No reward model training, no RL infrastructure
- More stable: Standard supervised learning with well-understood dynamics
- Lower memory: Only need policy and reference model (can share weights with LoRA)
- Faster iteration: Single training phase, easier hyperparameter tuning
- Deterministic: No sampling variance from RL rollouts
DPO Training in Practice
Data format:
{
"prompt": "Explain machine learning",
"chosen": "Machine learning is a subset of AI that enables systems to learn patterns from data without explicit programming. It works by...",
"rejected": "ML is when computers learn stuff on their own I guess..."
}
Hyperparameters:
- β (temperature): 0.1-0.5 (lower = stronger preference signal)
- Learning rate: 5e-7 to 5e-6 (lower than SFT)
- Epochs: 1-3 (watch for overfitting)
- Batch size: 32-128 (larger batches help stability)
Implementation example:
import torch
import torch.nn.functional as F
def dpo_loss(policy_chosen_logps, policy_rejected_logps,
reference_chosen_logps, reference_rejected_logps,
beta=0.1):
"""
Compute DPO loss for a batch of preferences.
All inputs are log probabilities of shape (batch_size,)
"""
# Compute log ratios
chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps)
rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps)
# DPO loss: negative log sigmoid of reward difference
losses = -F.logsigmoid(chosen_rewards - rejected_rewards)
# Metrics for monitoring
chosen_better = (chosen_rewards > rejected_rewards).float().mean()
reward_margin = (chosen_rewards - rejected_rewards).mean()
return losses.mean(), {
'accuracy': chosen_better.item(),
'reward_margin': reward_margin.item()
}
Training loop considerations:
def train_dpo_epoch(model, ref_model, dataloader, optimizer, beta):
model.train()
ref_model.eval() # Reference model is frozen
for batch in dataloader:
# Get log probs from both models
with torch.no_grad():
ref_chosen_logps = get_sequence_logps(ref_model, batch['chosen'])
ref_rejected_logps = get_sequence_logps(ref_model, batch['rejected'])
policy_chosen_logps = get_sequence_logps(model, batch['chosen'])
policy_rejected_logps = get_sequence_logps(model, batch['rejected'])
loss, metrics = dpo_loss(
policy_chosen_logps, policy_rejected_logps,
ref_chosen_logps, ref_rejected_logps,
beta=beta
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
DPO Variants
The success of DPO sparked numerous variants addressing its limitations:
IPO (Identity Preference Optimization): Addresses DPO's tendency to overfit by using a different loss:
IPO targets a fixed margin rather than pushing preferences infinitely apart. Better for noisy preference data.
KTO (Kahneman-Tversky Optimization): Works with unpaired data—you only need examples labeled "good" or "bad", not paired comparisons:
Where z_ref is a reference point. Useful when paired preference data is hard to collect.
ORPO (Odds Ratio Preference Optimization): Combines SFT and preference optimization in a single objective: No need for a reference model—the SFT component acts as implicit regularization.
SimPO (Simple Preference Optimization): Removes the reference model entirely by using length-normalized rewards: Where γ is a target margin. Simpler and often competitive with DPO.
Choosing a variant:
| Method | Best for | Avoids |
|---|---|---|
| DPO | Standard preference tuning | - |
| IPO | Noisy preferences | Overfitting |
| KTO | Unpaired good/bad data | Need for comparisons |
| ORPO | Combined SFT + preference | Reference model overhead |
| SimPO | Maximum simplicity | Reference model entirely |
Common DPO Pitfalls
1. Reference model drift: If you update the reference model during training, you lose the anchor. Keep it frozen.
2. Length exploitation: Models may learn to prefer longer or shorter responses based on spurious correlations in data:
# Monitor length statistics
chosen_lengths = [len(x['chosen']) for x in dataset]
rejected_lengths = [len(x['rejected']) for x in dataset]
# If significantly different, add length normalization
3. Catastrophic forgetting: Aggressive DPO can harm capabilities learned during SFT:
- Use lower learning rates than SFT
- Mix in SFT data (10-20% of batches)
- Increase β to stay closer to reference
4. Preference data quality: DPO amplifies patterns in your data—including spurious ones:
- Validate preference data manually
- Ensure annotator agreement
- Balance across task types
5. β tuning:
β too low (0.01): Aggressive optimization, may diverge from reference
β too high (1.0): Weak signal, slow learning
Sweet spot: Usually 0.1-0.3
DPO vs. RLHF Comparison
| Aspect | RLHF (PPO) | DPO |
|---|---|---|
| Complexity | High (reward model + RL) | Low (single supervised phase) |
| Stability | Requires careful tuning | Generally stable |
| Compute | 4 models in memory | 2 models in memory |
| Performance | State-of-the-art ceiling | Competitive, sometimes equal |
| Flexibility | Arbitrary reward signals | Pairwise preferences only |
| Iteration speed | Slow (RL rollouts) | Fast (supervised batches) |
| Failure modes | Reward hacking, mode collapse | Length bias, forgetting |
When to use RLHF:
- Complex, multi-objective reward functions
- Iterative preference learning with online data
- Maximum capability extraction worth the engineering cost
- You have RL infrastructure already
When to use DPO:
- Standard preference alignment
- Limited compute or engineering resources
- Rapid iteration on alignment experiments
- Good enough performance is acceptable
Constitutional AI and RLAIF
The Labeling Bottleneck
Human labeling is expensive and doesn't scale. Constitutional AI uses AI feedback instead:
RLAIF (RL from AI Feedback)
Replace human labelers with LLMs:
1. Generate response pairs
2. Ask LLM to judge which is better (with principles)
3. Train reward model on AI judgments
4. Run RL as usual
Principle-based judging:
Evaluate which response better follows these principles:
1. Be helpful and informative
2. Be harmless and avoid dangerous content
3. Be honest and acknowledge uncertainty
Response A: [text]
Response B: [text]
Which response better follows the principles?
Constitutional AI
Train model to self-critique and revise:
Stage 1: Supervised self-critique
Initial response: [potentially problematic response]
Critique: "This response could be harmful because..."
Revision: [improved response]
Stage 2: RL with AI feedback Use LLM to generate preference labels based on principles.
Advantages
- Scales better than human labeling
- Consistent application of principles
- Can cover more edge cases
- Enables rapid iteration
Limitations
- AI judges have biases
- May not capture nuanced human preferences
- Need diverse prompts to avoid mode collapse
Practical Implementation Guide
Choosing Your Approach
Start with SFT if:
- You have quality demonstration data
- Task is well-defined
- Budget is limited
Add RLHF/DPO if:
- SFT plateau reached
- Subtle quality improvements needed
- Have preference data or can collect it
Data Pipeline
class PostTrainingPipeline:
def prepare_sft_data(self, raw_data):
# Format as instruction-response pairs
# Filter for quality
# Deduplicate
# Balance across task types
return formatted_data
def prepare_preference_data(self, sft_model, prompts):
# Generate multiple responses per prompt
# Collect preferences (human or AI)
# Format as chosen/rejected pairs
return preference_data
def train_sft(self, base_model, sft_data):
# Standard fine-tuning
return sft_model
def train_dpo(self, sft_model, preference_data):
# DPO optimization
return aligned_model
Evaluation
SFT evaluation:
- Instruction-following accuracy
- Response quality (LLM-as-judge)
- Format compliance
- Task-specific benchmarks
RLHF/DPO evaluation:
- Win rate vs. SFT baseline
- Human preference evaluation
- Safety evaluations
- Capability retention (no regression)
Common Pitfalls
- Insufficient SFT before RLHF: Get SFT right first
- Low-quality preference data: Garbage in, garbage out
- Over-optimization: Model becomes sycophantic or narrow
- Forgetting base capabilities: Evaluate broadly, not just on target task
- Ignoring safety: Preference optimization can introduce new failure modes
Conclusion
Post-training transforms base models into useful assistants. SFT teaches instruction following, RLHF/DPO aligns outputs with human preferences. The field is evolving rapidly—DPO simplified RLHF, and new methods continue to emerge.
The key is understanding the purpose of each stage and iterating based on evaluation. Start simple (SFT), measure carefully, and add complexity (RLHF/DPO) only when you have evidence it helps.
Frequently Asked Questions
Related Articles
Fine-Tuning vs Prompting: When to Use Each
A practical guide to deciding between fine-tuning and prompt engineering for your LLM application, based on real-world experience with both approaches.
RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond
A comprehensive guide to reinforcement learning algorithms for LLM alignment—PPO, GRPO, GSPO, REINFORCE++, DPO, and their variants. Understanding the tradeoffs that power modern AI assistants.
Synthetic Data Generation for LLM Training
How to generate high-quality synthetic training data using LLMs—from NVIDIA's Nemotron pipeline to quality filtering techniques and avoiding model collapse.