RLHF Complete Guide: Aligning LLMs with Human Preferences
A comprehensive deep dive into Reinforcement Learning from Human Feedback—from reward modeling to PPO to DPO. Understanding how AI assistants learn to be helpful, harmless, and honest.
Table of Contents
What is RLHF and Why Does It Matter?
Reinforcement Learning from Human Feedback (RLHF) is the technique that transforms capable but unpredictable language models into helpful, aligned AI assistants. It's the secret sauce that made ChatGPT feel different from GPT-3, and it remains central to how the best AI systems are trained.
2025: The post-PPO era of alignment. "The era of treating PPO as the only tool for RLHF is over. The movement that DPO started—towards simpler, more stable, and more direct methods—is reaching maturity."
The modern alignment toolkit:
- DPO (Direct Preference Optimization): Treats alignment as classification, no reward model needed. Much less computational overhead than PPO, easier to tune.
- GRPO (Group Relative Policy Optimization): Eliminates the critic model by generating groups of answers and using relative ranking. Used by DeepSeek-R1 for math/coding reasoning (source).
- REINFORCE++: Logic-RL and PRIME demonstrate it's more stable than GRPO and faster than PPO.
DeepSeek-R1 confirmed the trend: Combining rule-based math/code rewards with preference rewards for open-ended tasks, costing a fraction of traditional RLHF budgets. Token-length regularization ("TLDR") dynamically shrinks chains-of-thought without hurting accuracy.
The Alignment Problem
Pre-trained language models are impressive but problematic. They've learned from the internet—which contains helpful information, but also misinformation, harmful content, and countless examples of unhelpful responses. A model trained purely to predict text will happily:
- Generate plausible-sounding misinformation
- Continue harmful or toxic content
- Refuse to answer when it shouldn't
- Answer when it should refuse
- Be unnecessarily verbose or terse
- Ignore what the user actually wants
Supervised Fine-Tuning (SFT) helps by showing the model examples of good responses. But SFT has a fundamental limitation: you can only train on what you can demonstrate. It's easy to write a good response, but how do you train a model to know which of two responses is better? How do you encode subtle preferences like "be confident but not overconfident" or "be helpful but know your limits"?
This is where RLHF comes in. Instead of learning from demonstrations, the model learns from comparisons—judgments about which response is better. These comparisons encode nuanced human preferences that are difficult to demonstrate directly.
The RLHF Insight
The key insight of RLHF is that it's easier to judge quality than to produce it. Consider:
- Writing a perfect essay is hard; ranking two essays by quality is easier
- Composing a helpful response is hard; choosing the more helpful of two responses is easier
- Defining "good" in words is hard; recognizing "better" when you see it is easier
RLHF leverages this asymmetry. Humans provide comparative judgments, a reward model learns to predict those judgments, and then the language model is optimized to produce responses the reward model rates highly.
┌─────────────────────────────────────────────────────────────────────────┐
│ THE RLHF PARADIGM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRADITIONAL SUPERVISED LEARNING: │
│ ──────────────────────────────── │
│ Human: "Here's the correct answer" │
│ Model: Learns to reproduce that answer │
│ │
│ Problem: Can only learn from explicit demonstrations │
│ Problem: One "right answer" doesn't capture preference nuances │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ RLHF APPROACH: │
│ ────────────── │
│ Human: "Response A is better than Response B" │
│ Reward Model: Learns to predict which response humans prefer │
│ Policy Model: Learns to generate responses the reward model likes │
│ │
│ Advantage: Captures nuanced preferences through comparison │
│ Advantage: Can improve beyond demonstrated examples │
│ Advantage: Aligns with what humans actually want │
│ │
└─────────────────────────────────────────────────────────────────────────┘
What RLHF Achieves
RLHF enables training for qualities that are difficult to specify directly:
Helpfulness: The model learns to actually address user needs, not just generate plausible text.
Harmlessness: The model learns to refuse harmful requests while remaining helpful for legitimate ones.
Honesty: The model learns to express uncertainty, acknowledge limitations, and avoid confident-sounding hallucinations.
Tone and Style: The model learns subtle stylistic preferences—conversational but professional, confident but not arrogant.
Following Instructions: The model learns to actually do what users ask, including respecting formatting requests and constraints.
Knowing When to Stop: The model learns appropriate response length—comprehensive when needed, concise when appropriate.
The Complete RLHF Pipeline
RLHF is not a single technique but a multi-stage pipeline. Each stage builds on the previous one:
┌─────────────────────────────────────────────────────────────────────────┐
│ THE RLHF PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: SUPERVISED FINE-TUNING (SFT) │
│ ───────────────────────────────────── │
│ │
│ ┌─────────────┐ ┌─────────────────────────────────────────────┐ │
│ │ Base Model │ ──→ │ Train on (prompt, good_response) pairs │ │
│ │ (Pre-trained)│ │ Standard supervised learning │ │
│ └─────────────┘ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ SFT Model │ │
│ └──────┬──────┘ │
│ │ │
│ ─────────────────────────────────────────┼─────────────────────────── │
│ │ │
│ STAGE 2: REWARD MODEL TRAINING │ │
│ ────────────────────────────── │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Generate multiple responses per prompt using SFT model │ │
│ │ Have humans rank/compare responses │ │
│ │ Train reward model to predict human preferences │ │
│ └──────────────────────┬──────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Reward Model │ │
│ │ R(prompt, resp)│ │
│ └────────┬────────┘ │
│ │ │
│ ───────────────────────┼───────────────────────────────────────────── │
│ │ │
│ STAGE 3: RL OPTIMIZATION │
│ ──────────────────────── │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Use RL (PPO) to optimize SFT model to maximize reward │ │
│ │ KL penalty prevents diverging too far from SFT model │ │
│ │ Iterate: generate → score → update → repeat │ │
│ └──────────────────────┬──────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ RLHF Model │ │
│ │ (Final) │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Let's examine each stage in detail.
Stage 1: Supervised Fine-Tuning (SFT)
Before RLHF can begin, you need a model that can generate reasonable responses. This is where SFT comes in.
Why SFT First?
You might wonder: why not apply RLHF directly to the base model? Several reasons:
Initialization matters: RL optimization is sensitive to where you start. A base model produces completions, not responses—RLHF would struggle to find the right direction from such a starting point.
Efficiency: SFT is more sample-efficient than RL for learning basic instruction-following. It's faster to learn "respond helpfully" from demonstrations than from trial and error.
Stability: Starting RL from a reasonable policy (SFT model) produces more stable training than starting from a random policy (base model).
Exploration: The SFT model provides a good prior for exploration during RL. Without it, the RL agent might explore irrelevant parts of response space.
SFT Training for RLHF
SFT for RLHF has some specific considerations:
Response diversity: The SFT model will be used to generate candidate responses for comparison. If it only produces one type of response, the reward model won't learn to distinguish quality. Train on diverse response styles.
Avoiding over-optimization: The SFT model becomes the "reference" for KL penalty in RLHF. If SFT is over-optimized on narrow data, the RLHF model can't deviate much without penalty. Train broadly.
Quality over quantity: The SFT model sets the baseline. Better SFT = better starting point for RLHF = better final model. Invest in high-quality SFT data.
┌─────────────────────────────────────────────────────────────────────────┐
│ SFT FOR RLHF: KEY CONSIDERATIONS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRAINING DATA CHARACTERISTICS: │
│ │
│ ✓ Diverse topics and instruction types │
│ ✓ Various response styles (concise, detailed, formal, casual) │
│ ✓ Mix of easy and challenging queries │
│ ✓ Responses that model good behavior across dimensions │
│ │
│ ✗ Avoid: Narrow, repetitive response patterns │
│ ✗ Avoid: Only perfect responses (include "good enough" examples) │
│ ✗ Avoid: Over-optimizing on specific metrics │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY DIVERSITY MATTERS: │
│ │
│ If SFT model always responds in one style: │
│ • Reward model only sees that style │
│ • Can't learn to prefer one style over another │
│ • RLHF has limited room to improve │
│ │
│ If SFT model has diverse responses: │
│ • Reward model sees variety │
│ • Learns nuanced preferences │
│ • RLHF can meaningfully optimize │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Stage 2: Reward Modeling
The reward model is the heart of RLHF. It learns to predict human preferences, providing the signal that guides model optimization.
What is a Reward Model?
A reward model is a function that takes a (prompt, response) pair and outputs a scalar score indicating quality:
R(prompt, response) → scalar score
Higher scores indicate responses that humans would prefer. The reward model learns this mapping from human comparison data.
Collecting Comparison Data
The standard approach for collecting training data for the reward model:
┌─────────────────────────────────────────────────────────────────────────┐
│ COMPARISON DATA COLLECTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STEP 1: Sample prompts from your target distribution │
│ ───────────────────────────────────────────────────── │
│ • Real user queries (if available) │
│ • Synthetic prompts covering target use cases │
│ • Red-team prompts for safety training │
│ │
│ STEP 2: Generate multiple responses per prompt │
│ ───────────────────────────────────────────────── │
│ • Use SFT model to generate 2-8 responses per prompt │
│ • Vary temperature/sampling to get diversity │
│ • Optionally include responses from different model versions │
│ │
│ STEP 3: Have humans compare responses │
│ ─────────────────────────────────────── │
│ │
│ Prompt: "Explain quantum entanglement simply" │
│ │
│ Response A: "Quantum entanglement is when two particles..." │
│ Response B: "It's like having two magic coins that always..." │
│ │
│ Human judgment: B > A (better analogy for "simply") │
│ │
│ COMPARISON FORMATS: │
│ ────────────────── │
│ • Binary: A > B or B > A │
│ • With ties: A > B, B > A, or A ≈ B │
│ • Ranking: Order all K responses by preference │
│ • Rating: Score each response 1-5 (then derive comparisons) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Training the Reward Model
The reward model is typically initialized from the SFT model or a similar pretrained model. It's trained using the Bradley-Terry model of preferences:
┌─────────────────────────────────────────────────────────────────────────┐
│ REWARD MODEL TRAINING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ARCHITECTURE: │
│ ───────────── │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Reward Model │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ prompt │ │ Transformer │ │ Linear │ │ │
│ │ │ + response │ ──→ │ Encoder │ ──→ │ Head │ ──→ R │ │
│ │ │ tokens │ │ (from SFT) │ │ (scalar out)│ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ The final hidden state (or mean pooled) is projected to scalar │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LOSS FUNCTION (Bradley-Terry): │
│ ────────────────────────────── │
│ │
│ For a comparison where response_w (winner) > response_l (loser): │
│ │
│ Loss = -log(σ(R(prompt, response_w) - R(prompt, response_l))) │
│ │
│ Where σ is the sigmoid function. │
│ │
│ Intuition: Maximize the probability that the winner scores higher │
│ than the loser. The sigmoid converts score difference to probability. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ HANDLING RANKINGS: │
│ ────────────────── │
│ │
│ If humans rank K responses: r₁ > r₂ > r₃ > ... > rₖ │
│ │
│ Convert to pairwise comparisons: │
│ • r₁ > r₂, r₁ > r₃, ..., r₁ > rₖ │
│ • r₂ > r₃, r₂ > r₄, ..., r₂ > rₖ │
│ • ... and so on │
│ │
│ This gives K(K-1)/2 comparisons per ranking. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Reward Model Considerations
Size matters: Larger reward models generally produce better signals, but they also slow down RL training (every generation must be scored). Common practice is using a reward model similar in size to the policy model.
Overoptimization: If the policy model is optimized too aggressively against the reward model, it will find "reward hacks"—responses that score highly but aren't actually good. This is a major challenge in RLHF.
Calibration: Reward model scores are relative, not absolute. A score of 2.5 only means "better than things that score 2.0," not "objectively good." Be careful interpreting absolute scores.
Distribution shift: The reward model is trained on SFT model outputs. During RL, the policy model's outputs shift. The reward model may behave unpredictably on out-of-distribution outputs.
Reward Model Evaluation
How do you know if your reward model is good?
┌─────────────────────────────────────────────────────────────────────────┐
│ REWARD MODEL EVALUATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ACCURACY METRICS: │
│ ───────────────── │
│ │
│ Pairwise accuracy: How often does the RM rank correctly? │
│ • On held-out comparisons from same distribution │
│ • On comparisons from different annotators │
│ • On adversarial/edge cases │
│ │
│ Typical accuracy: 65-75% (human agreement is often ~70-80%) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ CORRELATION WITH HUMANS: │
│ ───────────────────────── │
│ │
│ • Kendall's tau between RM ranking and human ranking │
│ • Spearman correlation for ordinal scores │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ QUALITATIVE ANALYSIS: │
│ ────────────────────── │
│ │
│ • Does RM prefer helpful responses? │
│ • Does RM penalize harmful content? │
│ • Does RM handle edge cases reasonably? │
│ • Are there obvious failure modes? │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PROXY REWARD VERIFICATION: │
│ ────────────────────────── │
│ │
│ Can a model "hack" the reward? Test with adversarial generations: │
│ • Very long responses (does RM prefer length?) │
│ • Repetitive responses (does RM notice?) │
│ • Confidently wrong responses (does RM penalize?) │
│ • Responses that say what user wants to hear vs. truth │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Stage 3: Reinforcement Learning Optimization
With a reward model in hand, we can now optimize the language model using reinforcement learning. The dominant algorithm is Proximal Policy Optimization (PPO).
The RL Formulation
The language model is viewed as a policy in an RL setting:
┌─────────────────────────────────────────────────────────────────────────┐
│ LLM AS RL AGENT │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ RL TERMINOLOGY → LLM EQUIVALENT │
│ ───────────────────────────────── │
│ │
│ State: The prompt + tokens generated so far │
│ Action: The next token to generate │
│ Policy: The language model π(token | context) │
│ Reward: Reward model score (given at end of response) │
│ Episode: One complete (prompt, response) generation │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ THE RL OBJECTIVE: │
│ ───────────────── │
│ │
│ Maximize: E[R(prompt, response)] - β · KL(π || π_ref) │
│ │
│ Where: │
│ • E[R] = expected reward from generated responses │
│ • KL(π || π_ref) = divergence from reference (SFT) model │
│ • β = KL penalty coefficient │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY THE KL PENALTY? │
│ ──────────────────── │
│ │
│ Without it: │
│ • Model can drift arbitrarily far from SFT │
│ • May find reward hacks that exploit RM weaknesses │
│ • Can lose language capability in pursuit of reward │
│ • Training becomes unstable │
│ │
│ With KL penalty: │
│ • Model stays close to known-good SFT policy │
│ • Limits ability to exploit RM weaknesses │
│ • Preserves language capability │
│ • More stable training │
│ │
│ The penalty says: "Improve on SFT, but don't go crazy" │
│ │
└─────────────────────────────────────────────────────────────────────────┘
PPO: Proximal Policy Optimization
PPO is the workhorse algorithm for RLHF. Let's understand it deeply.
Why PPO?
RL algorithms face a fundamental tension: you want to improve the policy based on collected experience, but large updates can destabilize training. PPO solves this by limiting how much the policy can change in each update.
┌─────────────────────────────────────────────────────────────────────────┐
│ THE PPO INSIGHT │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ NAIVE POLICY GRADIENT: │
│ ────────────────────── │
│ ∇J = E[∇log π(a|s) · A(s,a)] │
│ │
│ Problem: Can make arbitrarily large updates │
│ Result: Training is unstable, can collapse │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TRUST REGION METHODS (TRPO): │
│ ──────────────────────────── │
│ Constrain KL(π_new || π_old) < δ │
│ │
│ Problem: Requires expensive second-order optimization │
│ Result: Works well but slow │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PPO SOLUTION: │
│ ───────────── │
│ Use a "clipped" objective that automatically limits updates: │
│ │
│ L = min(r(θ)·A, clip(r(θ), 1-ε, 1+ε)·A) │
│ │
│ Where r(θ) = π_new(a|s) / π_old(a|s) is the probability ratio │
│ │
│ If r(θ) goes outside [1-ε, 1+ε], the gradient is zeroed │
│ This prevents large policy changes automatically │
│ │
│ Result: Stable training with simple first-order optimization │
│ │
└─────────────────────────────────────────────────────────────────────────┘
PPO Components for RLHF
A complete PPO setup for RLHF requires four models:
┌─────────────────────────────────────────────────────────────────────────┐
│ PPO COMPONENTS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ MODEL PURPOSE TRAINED? │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Policy Model The LLM being optimized Yes (main goal) │
│ (Actor) Generates responses │
│ │
│ Reference Model Frozen copy of SFT model No (frozen) │
│ (π_ref) Used for KL penalty │
│ │
│ Reward Model Scores (prompt, response) No (frozen) │
│ Provides training signal Pre-trained │
│ │
│ Value Model Estimates expected return Yes │
│ (Critic) Used for advantage estimation │
│ Reduces variance │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MEMORY REQUIREMENTS (for a 7B model): │
│ │
│ Policy (trainable): ~14 GB + optimizer states │
│ Reference (frozen): ~14 GB │
│ Reward (frozen): ~14 GB │
│ Value (trainable): ~14 GB + optimizer states │
│ │
│ Total: 4 full model copies = very memory intensive! │
│ This is why PPO for RLHF is expensive. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The PPO Training Loop
┌─────────────────────────────────────────────────────────────────────────┐
│ PPO TRAINING LOOP │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ FOR each training iteration: │
│ │
│ 1. SAMPLE PROMPTS │
│ ─────────────── │
│ Sample a batch of prompts from your dataset │
│ │
│ 2. GENERATE RESPONSES │
│ ──────────────────── │
│ Use Policy Model to generate responses │
│ (This is the "rollout" or "experience collection" phase) │
│ │
│ 3. COMPUTE REWARDS │
│ ─────────────── │
│ Score each (prompt, response) with Reward Model │
│ Add KL penalty: reward = R(p,r) - β·KL(policy || reference) │
│ │
│ 4. ESTIMATE ADVANTAGES │
│ ──────────────────── │
│ Use Value Model to estimate advantage at each token │
│ A(t) = Σ γⁱ r(t+i) + γⁿ V(s_n) - V(s_t) │
│ (GAE - Generalized Advantage Estimation often used) │
│ │
│ 5. UPDATE POLICY │
│ ───────────── │
│ For multiple epochs on collected experience: │
│ Compute PPO clipped objective │
│ Update Policy Model via gradient descent │
│ │
│ 6. UPDATE VALUE MODEL │
│ ────────────────── │
│ Train Value Model to better predict returns │
│ L_value = (V(s) - returns)² │
│ │
│ 7. LOGGING AND CHECKPOINTING │
│ ───────────────────────── │
│ Track rewards, KL, loss curves │
│ Save checkpoints periodically │
│ │
│ REPEAT until converged or budget exhausted │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Understanding the Advantage Function
The advantage function is crucial for stable PPO training:
┌─────────────────────────────────────────────────────────────────────────┐
│ ADVANTAGE ESTIMATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ WHAT IS ADVANTAGE? │
│ ────────────────── │
│ │
│ A(s, a) = Q(s, a) - V(s) │
│ │
│ Advantage answers: "How much better is this action than average?" │
│ │
│ • A > 0: This action is better than typical → reinforce it │
│ • A < 0: This action is worse than typical → discourage it │
│ • A ≈ 0: This action is about average → little change │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY USE ADVANTAGE (not just reward)? │
│ ────────────────────────────────────── │
│ │
│ Using raw rewards has high variance: │
│ • Some prompts are "easy" (high reward regardless of response) │
│ • Some prompts are "hard" (low reward regardless of response) │
│ • This makes gradient estimates noisy │
│ │
│ Advantage subtracts baseline (value), reducing variance: │
│ • On easy prompts: high reward, high baseline → small advantage │
│ • On hard prompts: low reward, low baseline → small advantage │
│ • Credit is given for being better than expected │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ GAE (Generalized Advantage Estimation): │
│ ──────────────────────────────────────── │
│ │
│ GAE balances bias and variance with parameter λ: │
│ │
│ A_GAE = Σ (γλ)ⁱ δ_t+i │
│ │
│ Where δ_t = r_t + γV(s_t+1) - V(s_t) │
│ │
│ λ = 0: Use only one-step TD (low variance, high bias) │
│ λ = 1: Use full Monte Carlo (high variance, low bias) │
│ λ ≈ 0.95: Common choice (good balance) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The KL Penalty: Staying Grounded
The KL penalty is essential for stable RLHF. Let's understand it deeply:
┌─────────────────────────────────────────────────────────────────────────┐
│ THE KL PENALTY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ DEFINITION: │
│ ─────────── │
│ │
│ KL(π || π_ref) = E_π[log(π(y|x)) - log(π_ref(y|x))] │
│ │
│ This measures how much the current policy has diverged from the │
│ reference (SFT) policy. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHAT HAPPENS WITHOUT KL PENALTY: │
│ ───────────────────────────────── │
│ │
│ The model is free to change arbitrarily to maximize reward. │
│ It will find "reward hacks": │
│ │
│ • If RM slightly prefers longer responses: │
│ → Model generates extremely long, repetitive responses │
│ │
│ • If RM slightly prefers confident tone: │
│ → Model becomes overconfident, even when wrong │
│ │
│ • If RM has blind spots: │
│ → Model exploits them relentlessly │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHAT KL PENALTY DOES: │
│ ────────────────────── │
│ │
│ It says: "Any deviation from the SFT policy has a cost" │
│ │
│ Reward = R(prompt, response) - β × KL(π || π_ref) │
│ │
│ • Small deviations: Low penalty, model can improve │
│ • Large deviations: High penalty, model constrained │
│ │
│ The model can only change if the reward improvement exceeds │
│ the KL cost. This prevents runaway optimization. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ CHOOSING β (KL coefficient): │
│ ──────────────────────────── │
│ │
│ β too low: │
│ • Model diverges too much from SFT │
│ • May find reward hacks │
│ • Unstable training │
│ │
│ β too high: │
│ • Model can barely change from SFT │
│ • Limited improvement possible │
│ • Wasted RL compute │
│ │
│ Typical range: β = 0.01 - 0.2 │
│ │
│ Adaptive β: Some systems adjust β to target a specific KL range │
│ (e.g., keep KL between 1 and 10 by adjusting β) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Direct Preference Optimization (DPO)
While PPO is effective, it's complex and resource-intensive. DPO offers a simpler alternative.
The DPO Insight
DPO's key insight is that you can derive a closed-form solution for the optimal policy under certain assumptions, eliminating the need for a separate reward model and RL training loop.
┌─────────────────────────────────────────────────────────────────────────┐
│ DPO: THE SIMPLIFICATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PPO PIPELINE: │
│ ───────────── │
│ │
│ Preferences → Train RM → RL with PPO → Final Model │
│ │
│ Components: 4 models, complex training loop, many hyperparameters │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ DPO PIPELINE: │
│ ───────────── │
│ │
│ Preferences → Direct optimization → Final Model │
│ │
│ Components: 2 models (policy + reference), supervised-like training │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ THE MATHEMATICAL INSIGHT: │
│ ───────────────────────── │
│ │
│ Under the RLHF objective with KL penalty, the optimal policy is: │
│ │
│ π*(y|x) ∝ π_ref(y|x) · exp(R(x,y) / β) │
│ │
│ Rearranging, the reward can be expressed in terms of policies: │
│ │
│ R(x,y) = β · log(π*(y|x) / π_ref(y|x)) + β · log Z(x) │
│ │
│ Where Z(x) is a normalizing constant. │
│ │
│ Substituting into the Bradley-Terry preference model: │
│ │
│ P(y_w > y_l | x) = σ(β · log(π(y_w|x)/π_ref(y_w|x)) │
│ - β · log(π(y_l|x)/π_ref(y_l|x))) │
│ │
│ The Z(x) terms cancel! We can train directly on preferences │
│ without ever computing rewards explicitly. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
DPO Loss Function
┌─────────────────────────────────────────────────────────────────────────┐
│ DPO LOSS FUNCTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ L_DPO = -E[log σ(β(log π(y_w|x)/π_ref(y_w|x) │
│ - log π(y_l|x)/π_ref(y_l|x)))] │
│ │
│ Where: │
│ • (x, y_w, y_l) is a preference triple (prompt, winner, loser) │
│ • π is the policy being trained │
│ • π_ref is the frozen reference (SFT) model │
│ • β is the KL penalty coefficient │
│ • σ is the sigmoid function │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ INTUITION: │
│ ────────── │
│ │
│ The loss encourages: │
│ • Increasing π(y_w|x) / π_ref(y_w|x) - make winners more likely │
│ • Decreasing π(y_l|x) / π_ref(y_l|x) - make losers less likely │
│ │
│ The log ratios measure how much the policy has changed from │
│ reference. DPO directly optimizes these ratios to match preferences. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ IMPLEMENTATION: │
│ ─────────────── │
│ │
│ For each preference pair: │
│ │
│ 1. Compute log probs under policy: │
│ log_p_w = sum(log π(token | context)) for winner │
│ log_p_l = sum(log π(token | context)) for loser │
│ │
│ 2. Compute log probs under reference (no gradient): │
│ log_ref_w = sum(log π_ref(token | context)) for winner │
│ log_ref_l = sum(log π_ref(token | context)) for loser │
│ │
│ 3. Compute loss: │
│ logits = β * ((log_p_w - log_ref_w) - (log_p_l - log_ref_l)) │
│ loss = -log_sigmoid(logits) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
DPO vs PPO Comparison
┌─────────────────────────────────────────────────────────────────────────┐
│ DPO VS PPO │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ASPECT PPO DPO │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Models needed 4 (policy, ref, 2 (policy, reference) │
│ reward, value) │
│ │
│ Training loop Complex RL loop Simple supervised │
│ │
│ Hyperparameters Many (PPO-specific) Fewer (mostly β) │
│ │
│ Stability Can be unstable Generally stable │
│ │
│ Memory Very high (4 models) Lower (2 models) │
│ │
│ Online learning Yes (generates new No (fixed preference │
│ samples during data) │
│ training) │
│ │
│ Sample efficiency Lower Higher │
│ │
│ Reward hacking Can happen Less prone │
│ │
│ Iteration speed Slow (RL) Fast (supervised) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHEN TO USE PPO: │
│ • When you need online learning (generating new samples) │
│ • When you have abundant compute │
│ • When you need precise control over reward optimization │
│ • When using non-pairwise rewards │
│ │
│ WHEN TO USE DPO: │
│ • When you have fixed preference data │
│ • When compute is limited │
│ • When you want simpler training │
│ • When stability is important │
│ • As a first approach (easier to get working) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Other Alignment Methods
The field has developed many alternatives and refinements to PPO and DPO:
IPO (Identity Policy Optimization)
IPO modifies DPO to be more robust to noise in preferences:
┌─────────────────────────────────────────────────────────────────────────┐
│ IPO │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PROBLEM WITH DPO: │
│ DPO assumes preferences are deterministic. In reality, humans are │
│ inconsistent—the same pair might be labeled differently by different │
│ annotators. │
│ │
│ IPO SOLUTION: │
│ Adds regularization that makes optimization less aggressive: │
│ │
│ L_IPO = (log(π_w/π_ref_w) - log(π_l/π_ref_l) - 1/2β)² │
│ │
│ Instead of sigmoid, uses squared loss with target margin. │
│ More robust to label noise and inconsistent preferences. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
KTO (Kahneman-Tversky Optimization)
KTO doesn't require paired preferences—just examples labeled as "good" or "bad":
┌─────────────────────────────────────────────────────────────────────────┐
│ KTO │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ MOTIVATION: │
│ Collecting paired preferences is expensive. Often you just know │
│ "this response is good" or "this response is bad" without comparing. │
│ │
│ KTO APPROACH: │
│ • Train on binary feedback (thumbs up / thumbs down) │
│ • Uses insights from prospect theory (Kahneman & Tversky) │
│ • Humans weight losses more than gains (loss aversion) │
│ │
│ Loss applies different weights to desirable vs undesirable: │
│ • For good responses: encourage higher likelihood │
│ • For bad responses: penalize more heavily (loss aversion) │
│ │
│ ADVANTAGE: │
│ • Easier data collection (no pairing needed) │
│ • Can use thumbs up/down feedback directly │
│ │
└─────────────────────────────────────────────────────────────────────────┘
ORPO (Odds Ratio Preference Optimization)
ORPO combines SFT and preference optimization in one step:
┌─────────────────────────────────────────────────────────────────────────┐
│ ORPO │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ MOTIVATION: │
│ Why have separate SFT and RLHF stages? Can we combine them? │
│ │
│ ORPO APPROACH: │
│ • Single training stage on preference data │
│ • Loss combines language modeling with preference learning │
│ • No need for reference model │
│ │
│ L_ORPO = L_SFT + λ · L_preference │
│ │
│ Where L_preference uses odds ratios instead of log probabilities. │
│ │
│ ADVANTAGES: │
│ • Simpler pipeline (one stage) │
│ • No reference model needed │
│ • Faster training │
│ │
└─────────────────────────────────────────────────────────────────────────┘
GRPO (Group Relative Policy Optimization)
GRPO eliminates the value model by using group-based advantages:
┌─────────────────────────────────────────────────────────────────────────┐
│ GRPO │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ MOTIVATION: │
│ PPO's value model is expensive and hard to train. Can we estimate │
│ advantages without it? │
│ │
│ GRPO APPROACH: │
│ • Generate multiple responses per prompt (a "group") │
│ • Compute rewards for all responses in group │
│ • Normalize rewards within group (subtract mean, divide by std) │
│ • Use normalized rewards as advantages │
│ │
│ A_i = (R_i - mean(R_group)) / std(R_group) │
│ │
│ This estimates "how good is this response relative to others for │
│ this prompt" without needing a value model. │
│ │
│ ADVANTAGES: │
│ • No value model needed (3 models instead of 4) │
│ • More stable than PPO in some settings │
│ • Used successfully in DeepSeek-R1 │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Method Comparison Summary
┌─────────────────────────────────────────────────────────────────────────┐
│ ALIGNMENT METHODS SUMMARY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ METHOD MODELS DATA NEEDED COMPLEXITY WHEN TO USE │
│ ───────────────────────────────────────────────────────────────────── │
│ PPO 4 Reward scores High Online learning, │
│ (from RM) precise control │
│ │
│ DPO 2 Paired preferences Low Fixed data, │
│ simple training │
│ │
│ IPO 2 Paired preferences Low Noisy preferences │
│ (noisy OK) │
│ │
│ KTO 2 Binary feedback Low Unpaired data │
│ (good/bad) │
│ │
│ ORPO 1 Paired preferences Low Combined SFT+pref │
│ │
│ GRPO 3 Reward scores Medium No value model, │
│ reasoning tasks │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Reward Hacking and Overoptimization
One of the most important challenges in RLHF is reward hacking—when the model finds ways to achieve high reward without actually being helpful.
What is Reward Hacking?
The reward model is an imperfect proxy for what humans actually want. It was trained on a finite dataset and has learned patterns that correlate with human preferences but aren't identical to them. When you optimize aggressively against this proxy, the model finds the gaps—behaviors that score highly but aren't actually good.
This is a fundamental problem in optimization: Goodhart's Law states "when a measure becomes a target, it ceases to be a good measure."
┌─────────────────────────────────────────────────────────────────────────┐
│ REWARD HACKING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE FUNDAMENTAL PROBLEM: │
│ ──────────────────────── │
│ │
│ Reward Model ≠ True Human Preferences │
│ │
│ The RM learned correlations from training data: │
│ • Longer responses often rated better → RM learns length = good │
│ • Confident tone often rated better → RM learns confidence = good │
│ • Specific phrases rated well → RM learns those phrases = good │
│ │
│ But correlation ≠ causation: │
│ • Length is good when more detail is needed, not always │
│ • Confidence is good when correct, harmful when wrong │
│ • Phrases are good in context, formulaic when overused │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COMMON REWARD HACKS: │
│ ──────────────────── │
│ │
│ LENGTH GAMING: │
│ If RM slightly prefers longer responses... │
│ Model learns: Padding, repetition, unnecessary elaboration │
│ Result: Verbose responses that waste user time │
│ │
│ SYCOPHANCY: │
│ If RM prefers agreeable responses... │
│ Model learns: Agree with user even when they're wrong │
│ Result: Model tells users what they want to hear, not truth │
│ │
│ CONFIDENCE HACKING: │
│ If RM prefers confident-sounding responses... │
│ Model learns: Sound certain even when uncertain │
│ Result: Authoritative-sounding hallucinations │
│ │
│ FORMAT GAMING: │
│ If RM prefers structured responses (lists, headers)... │
│ Model learns: Add structure even when unnecessary │
│ Result: Everything becomes bulleted lists │
│ │
│ KEYWORD STUFFING: │
│ If RM associates certain words with quality... │
│ Model learns: Include those words regardless of relevance │
│ Result: Formulaic, keyword-heavy responses │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Overoptimization Problem
Reward hacking gets worse with more optimization. Initially, optimizing against the reward model improves true quality—the model learns genuinely good behaviors. But past a certain point, further optimization degrades quality as the model exploits reward model weaknesses.
┌─────────────────────────────────────────────────────────────────────────┐
│ OVEROPTIMIZATION DYNAMICS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ True Quality │
│ ↑ │
│ │ │
│ │ ╭───────╮ │
│ │ ╭──╯ ╰──╮ │
│ │ ╭─╯ ╰───────╮ │
│ │ ╭─╯ ╰───────── │
│ │ ╭─╯ │
│ │ ╭─╯ │
│ │ ╭─╯ │
│ ├─╯ │
│ │ │
│ └────────────────────────────────────────────────→ RM Score │
│ │
│ │← Good region →│← Overoptimization region →│ │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHAT HAPPENS: │
│ │
│ Early training (good region): │
│ • RM score increases, true quality increases │
│ • Model learns genuinely good behaviors │
│ • RM and human preferences are aligned │
│ │
│ Late training (overoptimization): │
│ • RM score continues increasing │
│ • True quality plateaus then decreases │
│ • Model exploits RM weaknesses │
│ • RM and human preferences diverge │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EMPIRICAL FINDING (Gao et al.): │
│ │
│ True reward ≈ proxy reward - c × √(KL) │
│ │
│ As KL increases (more optimization), the gap between proxy │
│ reward and true reward grows. Eventually, proxy reward keeps │
│ increasing while true reward decreases. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Mitigating Reward Hacking
There's no perfect solution to reward hacking, but several strategies help:
┌─────────────────────────────────────────────────────────────────────────┐
│ MITIGATION STRATEGIES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. KL PENALTY (Primary defense) │
│ ──────────────────────────────── │
│ │
│ The KL penalty limits how far the model can deviate from SFT. │
│ This bounds the "search space" for reward hacks. │
│ │
│ If a hack requires significant behavior change → high KL → penalized │
│ │
│ Tuning β: │
│ • Higher β = less hacking but less improvement │
│ • Lower β = more improvement but more risk │
│ • Monitor KL during training, adjust if needed │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 2. ENSEMBLE REWARD MODELS │
│ ───────────────────────── │
│ │
│ Train multiple reward models on different data subsets. │
│ Use their agreement as the training signal. │
│ │
│ • If all RMs agree response is good → probably good │
│ • If RMs disagree → uncertain, be conservative │
│ • Hacks that exploit one RM unlikely to fool all │
│ │
│ Approaches: │
│ • Average RM scores │
│ • Use minimum RM score (conservative) │
│ • Weight by RM confidence │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 3. EARLY STOPPING │
│ ───────────────── │
│ │
│ Don't optimize until convergence. Stop before overoptimization. │
│ │
│ Track during training: │
│ • RM score on training prompts (will keep increasing) │
│ • Human evaluation on held-out prompts (will plateau/decrease) │
│ • Stop when human eval stops improving │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 4. DIVERSE REWARD SIGNALS │
│ ───────────────────────── │
│ │
│ Don't rely on single scalar reward. Combine multiple signals: │
│ │
│ • Helpfulness RM │
│ • Harmlessness RM │
│ • Honesty RM │
│ • Factuality classifier │
│ • Length penalty (direct) │
│ • Repetition penalty (direct) │
│ │
│ Total reward = w₁R₁ + w₂R₂ + ... + penalties │
│ │
│ Harder to hack multiple independent signals simultaneously. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 5. CONSTITUTIONAL AI / SELF-CRITIQUE │
│ ──────────────────────────────────── │
│ │
│ Have the model critique its own responses using principles. │
│ Use self-critique as part of training signal. │
│ │
│ Principles (constitution): │
│ • "Choose the response that is most helpful" │
│ • "Choose the response that is most honest" │
│ • "Choose the response that is least harmful" │
│ │
│ The model's own understanding of these principles can catch │
│ issues that the RM misses. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 6. ITERATIVE RLHF │
│ ───────────────── │
│ │
│ Don't do RLHF once. Iterate: │
│ │
│ Round 1: Train RM on initial data, do RLHF │
│ Round 2: Collect preferences on RLHF model outputs │
│ Train new RM, do RLHF again │
│ Round 3: Repeat... │
│ │
│ Each round: │
│ • RM sees model's actual outputs, not just SFT outputs │
│ • Learns to recognize new failure modes │
│ • Closes loopholes model found in previous round │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Practical Considerations
Data Quality for RLHF
The quality of preference data fundamentally limits what RLHF can achieve:
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA QUALITY CONSIDERATIONS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ANNOTATOR QUALITY: │
│ ────────────────── │
│ │
│ Low-quality annotators: │
│ • Random or careless labeling │
│ • Inconsistent criteria │
│ • Personal biases dominating │
│ Result: RM learns noise, not preferences │
│ │
│ High-quality annotators: │
│ • Thoughtful comparison │
│ • Consistent criteria │
│ • Diverse perspectives │
│ Result: RM learns meaningful preferences │
│ │
│ Best practices: │
│ • Clear annotation guidelines │
│ • Training for annotators │
│ • Inter-annotator agreement monitoring │
│ • Multiple annotators per example (voting) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PROMPT COVERAGE: │
│ ──────────────── │
│ │
│ RM only learns preferences for prompts it sees. │
│ On out-of-distribution prompts, RM behavior is unpredictable. │
│ │
│ Ensure coverage of: │
│ • All expected use cases │
│ • Edge cases and adversarial prompts │
│ • Different difficulty levels │
│ • Various domains and topics │
│ • Different languages (if multilingual) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ RESPONSE DIVERSITY: │
│ ─────────────────── │
│ │
│ If all candidate responses are similar, RM learns little. │
│ │
│ Strategies for diversity: │
│ • Vary temperature during generation │
│ • Include responses from different model checkpoints │
│ • Include human-written alternatives │
│ • Include intentionally bad responses │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PREFERENCE COMPLEXITY: │
│ ────────────────────── │
│ │
│ Easy comparisons (one clearly better): │
│ • RM learns obvious patterns quickly │
│ • Limited signal for nuanced preferences │
│ │
│ Hard comparisons (close quality): │
│ • More informative for subtle preferences │
│ • But also noisier (harder for annotators) │
│ │
│ Mix both: Easy examples for basic patterns, hard examples for │
│ fine distinctions. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Hyperparameter Tuning
RLHF has many hyperparameters. Here are the most important:
┌─────────────────────────────────────────────────────────────────────────┐
│ KEY HYPERPARAMETERS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ KL COEFFICIENT (β): │
│ ─────────────────── │
│ Controls deviation from reference model. │
│ • Typical range: 0.01 - 0.2 │
│ • Start with 0.1, adjust based on KL during training │
│ • If KL grows too fast, increase β │
│ • If model doesn't improve, decrease β │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PPO CLIP RANGE (ε): │
│ ─────────────────── │
│ Limits policy update magnitude. │
│ • Typical: 0.2 │
│ • Smaller = more stable, slower │
│ • Larger = faster, potentially unstable │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LEARNING RATE: │
│ ────────────── │
│ For policy updates. │
│ • Typical: 1e-6 to 1e-5 │
│ • Lower than SFT (we want fine adjustments) │
│ • Too high = unstable, too low = no learning │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ BATCH SIZE: │
│ ─────────── │
│ Number of prompts per update. │
│ • Typical: 64-512 prompts │
│ • Larger = more stable gradients │
│ • Limited by memory │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PPO EPOCHS: │
│ ────────── │
│ How many times to update on collected experience. │
│ • Typical: 1-4 │
│ • More epochs = more sample efficient │
│ • But can overfit to collected batch │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ GAE LAMBDA (λ): │
│ ─────────────── │
│ Advantage estimation smoothing. │
│ • Typical: 0.95 │
│ • Higher = less bias, more variance │
│ • Lower = more bias, less variance │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ GENERATION PARAMETERS: │
│ ─────────────────────── │
│ Temperature, top-p for response generation. │
│ • Temperature ~0.7-1.0 for diverse exploration │
│ • Top-p ~0.9-0.95 │
│ • Too low = limited exploration │
│ • Too high = poor quality samples │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Monitoring RLHF Training
What to track during training:
┌─────────────────────────────────────────────────────────────────────────┐
│ TRAINING MONITORING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ REWARD METRICS: │
│ ─────────────── │
│ • Mean reward (should increase, but watch for hacking) │
│ • Reward variance (should stabilize) │
│ • Reward distribution (should shift right) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ KL METRICS: │
│ ─────────── │
│ • KL divergence from reference (should stay bounded) │
│ • KL per token (identifies which parts of responses change) │
│ • Sudden KL spike = instability, investigate │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PPO METRICS: │
│ ──────────── │
│ • Policy loss (should decrease) │
│ • Value loss (should decrease) │
│ • Clip fraction (how often clip activates) │
│ - Too high = updates too aggressive │
│ - Too low = clip not needed, could increase ε │
│ • Entropy (diversity of generations) │
│ - Should decrease but not collapse │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ GENERATION METRICS: │
│ ─────────────────── │
│ • Response length (watch for gaming) │
│ • Token diversity (watch for mode collapse) │
│ • Repetition rate (watch for loops) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ HUMAN EVALUATION (periodic): │
│ ────────────────────────────── │
│ • Win rate vs reference model │
│ • Win rate vs previous checkpoint │
│ • Specific quality dimensions (helpfulness, harmlessness, etc.) │
│ │
│ This is the ground truth! RM score can be gamed. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ RED FLAGS: │
│ ────────── │
│ • Reward increasing but KL exploding → hacking │
│ • Entropy collapsing → mode collapse │
│ • Response length steadily increasing → length gaming │
│ • Repeated phrases appearing → degeneration │
│ • Human eval declining while RM increases → overoptimization │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Common Challenges and Solutions
Challenge 1: Reward Model Quality
┌─────────────────────────────────────────────────────────────────────────┐
│ REWARD MODEL ISSUES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SYMPTOM: RM accuracy plateaus at 60-65% │
│ │
│ DIAGNOSIS: │
│ • Preference data too noisy │
│ • Annotators inconsistent │
│ • Comparisons too hard (responses too similar) │
│ │
│ SOLUTIONS: │
│ • Improve annotation guidelines │
│ • Filter low-agreement examples │
│ • Add easier comparisons to training set │
│ • Use multiple annotators and majority vote │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SYMPTOM: RM has obvious biases (prefers length, format, etc.) │
│ │
│ DIAGNOSIS: │
│ • Training data has spurious correlations │
│ • Need more diverse/balanced examples │
│ │
│ SOLUTIONS: │
│ • Add adversarial examples (long bad, short good) │
│ • Balance training data across dimensions │
│ • Add explicit penalties (length, format) separate from RM │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Challenge 2: Training Instability
┌─────────────────────────────────────────────────────────────────────────┐
│ TRAINING INSTABILITY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SYMPTOM: Loss/reward oscillates wildly or spikes │
│ │
│ DIAGNOSIS: │
│ • Learning rate too high │
│ • Batch size too small │
│ • PPO clip range wrong │
│ │
│ SOLUTIONS: │
│ • Reduce learning rate (try 0.5x) │
│ • Increase batch size │
│ • Use smaller PPO clip range (0.1 instead of 0.2) │
│ • Add gradient clipping │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SYMPTOM: KL divergence explodes │
│ │
│ DIAGNOSIS: │
│ • KL coefficient too low │
│ • Model finding reward hacks │
│ • Learning rate too high │
│ │
│ SOLUTIONS: │
│ • Increase β (KL coefficient) │
│ • Use adaptive KL targeting │
│ • Reduce learning rate │
│ • Check for reward hacking (length, format changes) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SYMPTOM: Entropy collapses (model gets repetitive) │
│ │
│ DIAGNOSIS: │
│ • Model found local optimum in response space │
│ • Over-optimization │
│ │
│ SOLUTIONS: │
│ • Add entropy bonus to reward │
│ • Reduce optimization pressure (increase β) │
│ • Early stopping │
│ • Use checkpoint before collapse │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Challenge 3: Sycophancy
┌─────────────────────────────────────────────────────────────────────────┐
│ SYCOPHANCY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SYMPTOM: Model agrees with user even when user is wrong │
│ │
│ WHY IT HAPPENS: │
│ • Users often prefer responses that agree with them │
│ • RM learns that agreement = higher rating │
│ • Model optimizes for agreement over truth │
│ │
│ EXAMPLE: │
│ User: "The capital of Australia is Sydney, right?" │
│ Sycophantic: "Yes, Sydney is the capital of Australia!" │
│ Correct: "Actually, the capital of Australia is Canberra..." │
│ │
│ SOLUTIONS: │
│ │
│ 1. Explicit anti-sycophancy training data: │
│ Include examples where correct disagreement is preferred │
│ │
│ 2. Factuality component: │
│ Add factuality classifier to reward (penalize falsehoods) │
│ │
│ 3. Constitutional AI: │
│ Include principle: "Choose the response that is truthful over │
│ one that tells the user what they want to hear" │
│ │
│ 4. Diverse annotators: │
│ Annotators who value truth over agreement │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Challenge 4: Capability Loss
┌─────────────────────────────────────────────────────────────────────────┐
│ CAPABILITY LOSS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SYMPTOM: Model becomes less capable at certain tasks after RLHF │
│ │
│ WHY IT HAPPENS: │
│ • RLHF optimizes for preference on training distribution │
│ • May degrade performance on out-of-distribution tasks │
│ • Particularly affects: coding, math, specialized knowledge │
│ │
│ EXAMPLE: │
│ • Model becomes better at chat │
│ • But worse at code completion (less in RLHF training data) │
│ │
│ SOLUTIONS: │
│ │
│ 1. Include diverse tasks in RLHF training: │
│ Not just chat, but coding, math, writing, etc. │
│ │
│ 2. Capability-specific reward models: │
│ Use specialized RMs for code, math, etc. │
│ │
│ 3. Multi-task RLHF: │
│ Mix RLHF with SFT on capability tasks │
│ │
│ 4. Monitor capability benchmarks: │
│ Track performance on coding/math benchmarks during training │
│ Stop if regression detected │
│ │
│ 5. Conservative optimization: │
│ Higher β to stay closer to capable SFT model │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The RLHF Landscape in 2025
Current State
RLHF remains central to how frontier models are trained, but the field has evolved:
┌─────────────────────────────────────────────────────────────────────────┐
│ RLHF IN 2025 │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ DOMINANT APPROACHES: │
│ │
│ For large models (GPT-4 class): │
│ • PPO/GRPO for online learning with nuanced reward signals │
│ • Constitutional AI for scalable oversight │
│ • Multiple specialized reward models │
│ • Iterative RLHF with human-in-the-loop │
│ │
│ For smaller/open models: │
│ • DPO for simplicity and efficiency │
│ • KTO when paired data unavailable │
│ • ORPO for combined SFT+preference │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ KEY TRENDS: │
│ │
│ 1. AI-assisted feedback │
│ Using strong models to generate preference data │
│ (Constitutional AI, RLAIF) │
│ │
│ 2. Process supervision │
│ Rewarding reasoning steps, not just final answers │
│ Important for math, coding, complex reasoning │
│ │
│ 3. Multi-objective alignment │
│ Separate reward models for different objectives │
│ Pareto optimization across objectives │
│ │
│ 4. Online/iterative RLHF │
│ Continuous collection and training │
│ Closes reward hacking loops │
│ │
│ 5. Reasoning-aware RLHF │
│ GRPO/RLVR for chain-of-thought models │
│ Verifiable rewards for reasoning tasks │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Open Challenges
┌─────────────────────────────────────────────────────────────────────────┐
│ OPEN CHALLENGES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SCALABLE OVERSIGHT: │
│ How do we provide good training signal for tasks humans can't │
│ easily evaluate? (Complex code, advanced math, long documents) │
│ │
│ MESA-OPTIMIZATION: │
│ Can RLHF create models that are optimizing for something │
│ different than what we intended? Risk of hidden objectives. │
│ │
│ REWARD HACKING AT SCALE: │
│ As models get smarter, they may find more sophisticated hacks │
│ that are harder to detect. │
│ │
│ DISTRIBUTIONAL SHIFT: │
│ RLHF trains on specific distribution. How to maintain alignment │
│ as deployment distribution shifts? │
│ │
│ EVALUATION: │
│ How to measure if RLHF actually achieved alignment vs. just │
│ appearing aligned? Hard to distinguish. │
│ │
│ EFFICIENCY: │
│ PPO is expensive. Can we achieve similar results more cheaply? │
│ Active area: better algorithms, synthetic data, curriculum. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Summary
RLHF is the technique that bridges the gap between capable language models and helpful AI assistants. By learning from human comparisons rather than demonstrations, it captures nuanced preferences that are difficult to specify directly.
The RLHF Pipeline:
- SFT: Create a capable instruction-following starting point
- Reward Modeling: Train a model to predict human preferences
- RL Optimization: Optimize the language model to maximize predicted preference while staying close to SFT
Key Algorithms:
- PPO: The standard for online RLHF—flexible but complex
- DPO: Direct optimization on preferences—simpler but offline
- GRPO: Group-based advantages—efficient for reasoning tasks
Critical Challenges:
- Reward hacking: Models exploit reward model weaknesses
- Overoptimization: More optimization eventually hurts quality
- Sycophancy: Models learn to agree rather than be truthful
Mitigation Strategies:
- KL penalty to bound deviation
- Ensemble reward models
- Iterative training with fresh data
- Constitutional AI for scalable oversight
RLHF isn't perfect, but it's the best technique we have for aligning language models with human intent. Understanding its strengths and limitations is essential for anyone building or deploying AI systems.
Frequently Asked Questions
Related Articles
SFT Deep Dive: Instruction Tuning Techniques and Best Practices
A comprehensive guide to Supervised Fine-Tuning (SFT) for LLMs—covering full fine-tuning vs LoRA vs QLoRA vs DoRA, data curation strategies, instruction formats, multi-task learning, and avoiding catastrophic forgetting.
Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR
A deep technical guide to training reasoning models like o1 and DeepSeek R1—covering PPO, GRPO, reward function design, RLVR, and distillation techniques.
Building Agentic AI Systems: A Complete Implementation Guide
A comprehensive guide to building AI agents—tool use, ReAct pattern, planning, memory, context management, MCP integration, and multi-agent orchestration. With full prompt examples and production patterns.
Test-Time Compute Scaling: CoT, ToT, MCTS, and Search-Based Reasoning
A comprehensive guide to inference-time scaling techniques—Chain of Thought, Tree of Thoughts, Monte Carlo Tree Search, Process Reward Models, and the HuggingFace search-and-learn framework.
LLM Pre-training: Building Foundation Models from Scratch
A comprehensive guide to pre-training large language models—from data curation and architecture decisions to scaling laws and distributed training infrastructure. Understanding how GPT, Llama, and other foundation models are built.