What's the difference between RLHF and DPO?

RLHF uses a learned reward model and PPO to optimize the policy—it's iterative, requires careful tuning, and can be unstable. DPO (Direct Preference Optimization) reformulates alignment as supervised learning directly on preference pairs, eliminating the reward model and RL training loop. DPO is simpler and more stable but may be less expressive. In 2025, DPO variants are the default for most teams; PPO is reserved for cases needing maximum control.

How much preference data do I need?

Quality matters more than quantity. For DPO-style training, 10,000-50,000 high-quality preference pairs can produce strong results. For reward model training in full RLHF, 50,000-200,000 comparisons are common. Key factors: diversity of prompts, quality of annotators, and difficulty distribution. Start with 10K pairs and scale based on evaluation metrics.

Should I use PPO, DPO, or GRPO?

For most teams: start with DPO—it's simpler and often sufficient. Use GRPO when you have verifiable rewards (math, code) where you can compute correctness automatically. Use PPO when you need online learning, maximum control over the optimization process, or have complex multi-turn interactions. DeepSeek-R1's success with GRPO has made it the go-to for reasoning tasks.

How do I collect high-quality preference data?

Use trained annotators with clear guidelines, not crowd workers with vague instructions. Include diverse prompt types and difficulty levels. For each comparison, ensure clear quality differences—close calls create noise. Consider using AI assistance (best-of-N sampling, constitutional AI feedback) to bootstrap data, then refine with human review.

What's reward hacking and how do I prevent it?

Reward hacking occurs when the model exploits patterns in the reward signal without achieving the intended behavior—like generating verbose responses that score higher on "helpfulness" without being more helpful. Prevention: use diverse reward signals, include KL penalty to stay close to reference model, employ ensemble reward models, and regularly evaluate on held-out human judgments.

Can I do RLHF with small models?

Yes. RLHF/DPO work at any scale, though smaller models may need more data to achieve the same improvements. For models under 7B parameters, DPO is strongly preferred over PPO (simpler, less memory). The bottleneck is usually the reward model quality—use a larger model as reward model if possible.

RLHF Complete Guide: Aligning LLMs with Human Preferences | Enrico Piovano

What is RLHF and Why Does It Matter?

Reinforcement Learning from Human Feedback (RLHF) is the technique that transforms capable but unpredictable language models into helpful, aligned AI assistants. It's the secret sauce that made ChatGPT feel different from GPT-3, and it remains central to how the best AI systems are trained.

2025: The post-PPO era of alignment. "The era of treating PPO as the only tool for RLHF is over. The movement that DPO started—towards simpler, more stable, and more direct methods—is reaching maturity."

The modern alignment toolkit:

DPO (Direct Preference Optimization): Treats alignment as classification, no reward model needed. Much less computational overhead than PPO, easier to tune.
GRPO (Group Relative Policy Optimization): Eliminates the critic model by generating groups of answers and using relative ranking. Used by DeepSeek-R1 for math/coding reasoning (source).
REINFORCE++: Logic-RL and PRIME demonstrate it's more stable than GRPO and faster than PPO.

DeepSeek-R1 confirmed the trend: Combining rule-based math/code rewards with preference rewards for open-ended tasks, costing a fraction of traditional RLHF budgets. Token-length regularization ("TLDR") dynamically shrinks chains-of-thought without hurting accuracy.

The Alignment Problem

Pre-trained language models are impressive but problematic. They've learned from the internet—which contains helpful information, but also misinformation, harmful content, and countless examples of unhelpful responses. A model trained purely to predict text will happily:

Generate plausible-sounding misinformation
Continue harmful or toxic content
Refuse to answer when it shouldn't
Answer when it should refuse
Be unnecessarily verbose or terse
Ignore what the user actually wants

Supervised Fine-Tuning (SFT) helps by showing the model examples of good responses. But SFT has a fundamental limitation: you can only train on what you can demonstrate. It's easy to write a good response, but how do you train a model to know which of two responses is better? How do you encode subtle preferences like "be confident but not overconfident" or "be helpful but know your limits"?

This is where RLHF comes in. Instead of learning from demonstrations, the model learns from comparisons—judgments about which response is better. These comparisons encode nuanced human preferences that are difficult to demonstrate directly.

The RLHF Insight

The key insight of RLHF is that it's easier to judge quality than to produce it. Consider:

Writing a perfect essay is hard; ranking two essays by quality is easier
Composing a helpful response is hard; choosing the more helpful of two responses is easier
Defining "good" in words is hard; recognizing "better" when you see it is easier

RLHF leverages this asymmetry. Humans provide comparative judgments, a reward model learns to predict those judgments, and then the language model is optimized to produce responses the reward model rates highly.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE RLHF PARADIGM                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRADITIONAL SUPERVISED LEARNING:                                        │
│  ────────────────────────────────                                        │
│  Human: "Here's the correct answer"                                     │
│  Model: Learns to reproduce that answer                                 │
│                                                                          │
│  Problem: Can only learn from explicit demonstrations                   │
│  Problem: One "right answer" doesn't capture preference nuances         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RLHF APPROACH:                                                          │
│  ──────────────                                                          │
│  Human: "Response A is better than Response B"                          │
│  Reward Model: Learns to predict which response humans prefer           │
│  Policy Model: Learns to generate responses the reward model likes      │
│                                                                          │
│  Advantage: Captures nuanced preferences through comparison             │
│  Advantage: Can improve beyond demonstrated examples                     │
│  Advantage: Aligns with what humans actually want                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

What RLHF Achieves

RLHF enables training for qualities that are difficult to specify directly:

Helpfulness: The model learns to actually address user needs, not just generate plausible text.

Harmlessness: The model learns to refuse harmful requests while remaining helpful for legitimate ones.

Honesty: The model learns to express uncertainty, acknowledge limitations, and avoid confident-sounding hallucinations.

Tone and Style: The model learns subtle stylistic preferences—conversational but professional, confident but not arrogant.

Following Instructions: The model learns to actually do what users ask, including respecting formatting requests and constraints.

Knowing When to Stop: The model learns appropriate response length—comprehensive when needed, concise when appropriate.

The Complete RLHF Pipeline

RLHF is not a single technique but a multi-stage pipeline. Each stage builds on the previous one:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE RLHF PIPELINE                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STAGE 1: SUPERVISED FINE-TUNING (SFT)                                  │
│  ─────────────────────────────────────                                   │
│                                                                          │
│  ┌─────────────┐     ┌─────────────────────────────────────────────┐   │
│  │ Base Model  │ ──→ │ Train on (prompt, good_response) pairs      │   │
│  │ (Pre-trained)│     │ Standard supervised learning                │   │
│  └─────────────┘     └──────────────────────┬──────────────────────┘   │
│                                             │                           │
│                                             ▼                           │
│                                    ┌─────────────┐                      │
│                                    │  SFT Model  │                      │
│                                    └──────┬──────┘                      │
│                                           │                             │
│  ─────────────────────────────────────────┼─────────────────────────── │
│                                           │                             │
│  STAGE 2: REWARD MODEL TRAINING           │                             │
│  ──────────────────────────────           │                             │
│                                           ▼                             │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  Generate multiple responses per prompt using SFT model          │   │
│  │  Have humans rank/compare responses                              │   │
│  │  Train reward model to predict human preferences                 │   │
│  └──────────────────────┬──────────────────────────────────────────┘   │
│                         │                                               │
│                         ▼                                               │
│                ┌─────────────────┐                                      │
│                │  Reward Model   │                                      │
│                │  R(prompt, resp)│                                      │
│                └────────┬────────┘                                      │
│                         │                                               │
│  ───────────────────────┼───────────────────────────────────────────── │
│                         │                                               │
│  STAGE 3: RL OPTIMIZATION                                               │
│  ────────────────────────                                               │
│                         │                                               │
│                         ▼                                               │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  Use RL (PPO) to optimize SFT model to maximize reward          │   │
│  │  KL penalty prevents diverging too far from SFT model           │   │
│  │  Iterate: generate → score → update → repeat                    │   │
│  └──────────────────────┬──────────────────────────────────────────┘   │
│                         │                                               │
│                         ▼                                               │
│                ┌─────────────────┐                                      │
│                │  RLHF Model     │                                      │
│                │  (Final)        │                                      │
│                └─────────────────┘                                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Let's examine each stage in detail.

Stage 1: Supervised Fine-Tuning (SFT)

Before RLHF can begin, you need a model that can generate reasonable responses. This is where SFT comes in.

Why SFT First?

You might wonder: why not apply RLHF directly to the base model? Several reasons:

Initialization matters: RL optimization is sensitive to where you start. A base model produces completions, not responses—RLHF would struggle to find the right direction from such a starting point.

Efficiency: SFT is more sample-efficient than RL for learning basic instruction-following. It's faster to learn "respond helpfully" from demonstrations than from trial and error.

Stability: Starting RL from a reasonable policy (SFT model) produces more stable training than starting from a random policy (base model).

Exploration: The SFT model provides a good prior for exploration during RL. Without it, the RL agent might explore irrelevant parts of response space.

SFT Training for RLHF

SFT for RLHF has some specific considerations:

Response diversity: The SFT model will be used to generate candidate responses for comparison. If it only produces one type of response, the reward model won't learn to distinguish quality. Train on diverse response styles.

Avoiding over-optimization: The SFT model becomes the "reference" for KL penalty in RLHF. If SFT is over-optimized on narrow data, the RLHF model can't deviate much without penalty. Train broadly.

Quality over quantity: The SFT model sets the baseline. Better SFT = better starting point for RLHF = better final model. Invest in high-quality SFT data.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SFT FOR RLHF: KEY CONSIDERATIONS                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRAINING DATA CHARACTERISTICS:                                          │
│                                                                          │
│  ✓ Diverse topics and instruction types                                 │
│  ✓ Various response styles (concise, detailed, formal, casual)         │
│  ✓ Mix of easy and challenging queries                                  │
│  ✓ Responses that model good behavior across dimensions                │
│                                                                          │
│  ✗ Avoid: Narrow, repetitive response patterns                         │
│  ✗ Avoid: Only perfect responses (include "good enough" examples)      │
│  ✗ Avoid: Over-optimizing on specific metrics                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY DIVERSITY MATTERS:                                                  │
│                                                                          │
│  If SFT model always responds in one style:                             │
│  • Reward model only sees that style                                    │
│  • Can't learn to prefer one style over another                         │
│  • RLHF has limited room to improve                                     │
│                                                                          │
│  If SFT model has diverse responses:                                    │
│  • Reward model sees variety                                            │
│  • Learns nuanced preferences                                           │
│  • RLHF can meaningfully optimize                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Stage 2: Reward Modeling

The reward model is the heart of RLHF. It learns to predict human preferences, providing the signal that guides model optimization.

What is a Reward Model?

A reward model is a function that takes a (prompt, response) pair and outputs a scalar score indicating quality:

Code

R(prompt, response) → scalar score

Higher scores indicate responses that humans would prefer. The reward model learns this mapping from human comparison data.

Collecting Comparison Data

The standard approach for collecting training data for the reward model:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    COMPARISON DATA COLLECTION                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STEP 1: Sample prompts from your target distribution                   │
│  ─────────────────────────────────────────────────────                   │
│  • Real user queries (if available)                                     │
│  • Synthetic prompts covering target use cases                          │
│  • Red-team prompts for safety training                                 │
│                                                                          │
│  STEP 2: Generate multiple responses per prompt                         │
│  ─────────────────────────────────────────────────                       │
│  • Use SFT model to generate 2-8 responses per prompt                  │
│  • Vary temperature/sampling to get diversity                           │
│  • Optionally include responses from different model versions           │
│                                                                          │
│  STEP 3: Have humans compare responses                                  │
│  ───────────────────────────────────────                                 │
│                                                                          │
│  Prompt: "Explain quantum entanglement simply"                          │
│                                                                          │
│  Response A: "Quantum entanglement is when two particles..."           │
│  Response B: "It's like having two magic coins that always..."         │
│                                                                          │
│  Human judgment: B > A (better analogy for "simply")                   │
│                                                                          │
│  COMPARISON FORMATS:                                                     │
│  ──────────────────                                                      │
│  • Binary: A > B or B > A                                               │
│  • With ties: A > B, B > A, or A ≈ B                                   │
│  • Ranking: Order all K responses by preference                        │
│  • Rating: Score each response 1-5 (then derive comparisons)           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Training the Reward Model

The reward model is typically initialized from the SFT model or a similar pretrained model. It's trained using the Bradley-Terry model of preferences:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    REWARD MODEL TRAINING                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ARCHITECTURE:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  ┌───────────────────────────────────────────────────────────────────┐ │
│  │                         Reward Model                               │ │
│  │                                                                    │ │
│  │  ┌─────────────┐      ┌─────────────┐      ┌─────────────┐       │ │
│  │  │   prompt    │      │ Transformer │      │   Linear    │       │ │
│  │  │ + response  │ ──→  │   Encoder   │ ──→  │    Head     │ ──→ R │ │
│  │  │   tokens    │      │ (from SFT)  │      │ (scalar out)│       │ │
│  │  └─────────────┘      └─────────────┘      └─────────────┘       │ │
│  │                                                                    │ │
│  │  The final hidden state (or mean pooled) is projected to scalar  │ │
│  │                                                                    │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LOSS FUNCTION (Bradley-Terry):                                         │
│  ──────────────────────────────                                          │
│                                                                          │
│  For a comparison where response_w (winner) > response_l (loser):      │
│                                                                          │
│  Loss = -log(σ(R(prompt, response_w) - R(prompt, response_l)))         │
│                                                                          │
│  Where σ is the sigmoid function.                                       │
│                                                                          │
│  Intuition: Maximize the probability that the winner scores higher     │
│  than the loser. The sigmoid converts score difference to probability. │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  HANDLING RANKINGS:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  If humans rank K responses: r₁ > r₂ > r₃ > ... > rₖ                   │
│                                                                          │
│  Convert to pairwise comparisons:                                       │
│  • r₁ > r₂, r₁ > r₃, ..., r₁ > rₖ                                      │
│  • r₂ > r₃, r₂ > r₄, ..., r₂ > rₖ                                      │
│  • ... and so on                                                        │
│                                                                          │
│  This gives K(K-1)/2 comparisons per ranking.                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Reward Model Considerations

Size matters: Larger reward models generally produce better signals, but they also slow down RL training (every generation must be scored). Common practice is using a reward model similar in size to the policy model.

Overoptimization: If the policy model is optimized too aggressively against the reward model, it will find "reward hacks"—responses that score highly but aren't actually good. This is a major challenge in RLHF.

Calibration: Reward model scores are relative, not absolute. A score of 2.5 only means "better than things that score 2.0," not "objectively good." Be careful interpreting absolute scores.

Distribution shift: The reward model is trained on SFT model outputs. During RL, the policy model's outputs shift. The reward model may behave unpredictably on out-of-distribution outputs.

Reward Model Evaluation

How do you know if your reward model is good?

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    REWARD MODEL EVALUATION                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ACCURACY METRICS:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Pairwise accuracy: How often does the RM rank correctly?              │
│  • On held-out comparisons from same distribution                       │
│  • On comparisons from different annotators                             │
│  • On adversarial/edge cases                                            │
│                                                                          │
│  Typical accuracy: 65-75% (human agreement is often ~70-80%)           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CORRELATION WITH HUMANS:                                                │
│  ─────────────────────────                                               │
│                                                                          │
│  • Kendall's tau between RM ranking and human ranking                  │
│  • Spearman correlation for ordinal scores                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  QUALITATIVE ANALYSIS:                                                   │
│  ──────────────────────                                                  │
│                                                                          │
│  • Does RM prefer helpful responses?                                    │
│  • Does RM penalize harmful content?                                    │
│  • Does RM handle edge cases reasonably?                                │
│  • Are there obvious failure modes?                                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PROXY REWARD VERIFICATION:                                              │
│  ──────────────────────────                                              │
│                                                                          │
│  Can a model "hack" the reward? Test with adversarial generations:     │
│  • Very long responses (does RM prefer length?)                        │
│  • Repetitive responses (does RM notice?)                               │
│  • Confidently wrong responses (does RM penalize?)                     │
│  • Responses that say what user wants to hear vs. truth                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Stage 3: Reinforcement Learning Optimization

With a reward model in hand, we can now optimize the language model using reinforcement learning. The dominant algorithm is Proximal Policy Optimization (PPO).

The RL Formulation

The language model is viewed as a policy in an RL setting:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LLM AS RL AGENT                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RL TERMINOLOGY → LLM EQUIVALENT                                        │
│  ─────────────────────────────────                                       │
│                                                                          │
│  State:        The prompt + tokens generated so far                     │
│  Action:       The next token to generate                               │
│  Policy:       The language model π(token | context)                    │
│  Reward:       Reward model score (given at end of response)            │
│  Episode:      One complete (prompt, response) generation               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE RL OBJECTIVE:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Maximize: E[R(prompt, response)] - β · KL(π || π_ref)                 │
│                                                                          │
│  Where:                                                                  │
│  • E[R] = expected reward from generated responses                      │
│  • KL(π || π_ref) = divergence from reference (SFT) model              │
│  • β = KL penalty coefficient                                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY THE KL PENALTY?                                                     │
│  ────────────────────                                                    │
│                                                                          │
│  Without it:                                                             │
│  • Model can drift arbitrarily far from SFT                            │
│  • May find reward hacks that exploit RM weaknesses                    │
│  • Can lose language capability in pursuit of reward                   │
│  • Training becomes unstable                                            │
│                                                                          │
│  With KL penalty:                                                        │
│  • Model stays close to known-good SFT policy                          │
│  • Limits ability to exploit RM weaknesses                             │
│  • Preserves language capability                                        │
│  • More stable training                                                 │
│                                                                          │
│  The penalty says: "Improve on SFT, but don't go crazy"                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PPO: Proximal Policy Optimization

PPO is the workhorse algorithm for RLHF. Let's understand it deeply.

Why PPO?

RL algorithms face a fundamental tension: you want to improve the policy based on collected experience, but large updates can destabilize training. PPO solves this by limiting how much the policy can change in each update.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE PPO INSIGHT                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  NAIVE POLICY GRADIENT:                                                  │
│  ──────────────────────                                                  │
│  ∇J = E[∇log π(a|s) · A(s,a)]                                          │
│                                                                          │
│  Problem: Can make arbitrarily large updates                            │
│  Result: Training is unstable, can collapse                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TRUST REGION METHODS (TRPO):                                           │
│  ────────────────────────────                                            │
│  Constrain KL(π_new || π_old) < δ                                       │
│                                                                          │
│  Problem: Requires expensive second-order optimization                  │
│  Result: Works well but slow                                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PPO SOLUTION:                                                           │
│  ─────────────                                                           │
│  Use a "clipped" objective that automatically limits updates:           │
│                                                                          │
│  L = min(r(θ)·A, clip(r(θ), 1-ε, 1+ε)·A)                               │
│                                                                          │
│  Where r(θ) = π_new(a|s) / π_old(a|s) is the probability ratio         │
│                                                                          │
│  If r(θ) goes outside [1-ε, 1+ε], the gradient is zeroed              │
│  This prevents large policy changes automatically                       │
│                                                                          │
│  Result: Stable training with simple first-order optimization          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PPO Components for RLHF

A complete PPO setup for RLHF requires four models:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PPO COMPONENTS                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MODEL               PURPOSE                        TRAINED?             │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Policy Model        The LLM being optimized        Yes (main goal)     │
│  (Actor)             Generates responses                                │
│                                                                          │
│  Reference Model     Frozen copy of SFT model       No (frozen)         │
│  (π_ref)             Used for KL penalty                                │
│                                                                          │
│  Reward Model        Scores (prompt, response)      No (frozen)         │
│                      Provides training signal       Pre-trained         │
│                                                                          │
│  Value Model         Estimates expected return      Yes                 │
│  (Critic)            Used for advantage estimation                      │
│                      Reduces variance                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MEMORY REQUIREMENTS (for a 7B model):                                  │
│                                                                          │
│  Policy (trainable):   ~14 GB + optimizer states                       │
│  Reference (frozen):   ~14 GB                                           │
│  Reward (frozen):      ~14 GB                                           │
│  Value (trainable):    ~14 GB + optimizer states                       │
│                                                                          │
│  Total: 4 full model copies = very memory intensive!                   │
│  This is why PPO for RLHF is expensive.                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The PPO Training Loop

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PPO TRAINING LOOP                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FOR each training iteration:                                           │
│                                                                          │
│    1. SAMPLE PROMPTS                                                    │
│       ───────────────                                                    │
│       Sample a batch of prompts from your dataset                       │
│                                                                          │
│    2. GENERATE RESPONSES                                                │
│       ────────────────────                                               │
│       Use Policy Model to generate responses                            │
│       (This is the "rollout" or "experience collection" phase)          │
│                                                                          │
│    3. COMPUTE REWARDS                                                   │
│       ───────────────                                                    │
│       Score each (prompt, response) with Reward Model                   │
│       Add KL penalty: reward = R(p,r) - β·KL(policy || reference)      │
│                                                                          │
│    4. ESTIMATE ADVANTAGES                                               │
│       ────────────────────                                               │
│       Use Value Model to estimate advantage at each token              │
│       A(t) = Σ γⁱ r(t+i) + γⁿ V(s_n) - V(s_t)                         │
│       (GAE - Generalized Advantage Estimation often used)              │
│                                                                          │
│    5. UPDATE POLICY                                                     │
│       ─────────────                                                      │
│       For multiple epochs on collected experience:                      │
│         Compute PPO clipped objective                                   │
│         Update Policy Model via gradient descent                        │
│                                                                          │
│    6. UPDATE VALUE MODEL                                                │
│       ──────────────────                                                 │
│       Train Value Model to better predict returns                       │
│       L_value = (V(s) - returns)²                                       │
│                                                                          │
│    7. LOGGING AND CHECKPOINTING                                         │
│       ─────────────────────────                                          │
│       Track rewards, KL, loss curves                                    │
│       Save checkpoints periodically                                     │
│                                                                          │
│  REPEAT until converged or budget exhausted                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Understanding the Advantage Function

The advantage function is crucial for stable PPO training:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    ADVANTAGE ESTIMATION                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  WHAT IS ADVANTAGE?                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  A(s, a) = Q(s, a) - V(s)                                              │
│                                                                          │
│  Advantage answers: "How much better is this action than average?"     │
│                                                                          │
│  • A > 0: This action is better than typical → reinforce it            │
│  • A < 0: This action is worse than typical → discourage it            │
│  • A ≈ 0: This action is about average → little change                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY USE ADVANTAGE (not just reward)?                                   │
│  ──────────────────────────────────────                                  │
│                                                                          │
│  Using raw rewards has high variance:                                   │
│  • Some prompts are "easy" (high reward regardless of response)        │
│  • Some prompts are "hard" (low reward regardless of response)         │
│  • This makes gradient estimates noisy                                  │
│                                                                          │
│  Advantage subtracts baseline (value), reducing variance:              │
│  • On easy prompts: high reward, high baseline → small advantage       │
│  • On hard prompts: low reward, low baseline → small advantage         │
│  • Credit is given for being better than expected                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GAE (Generalized Advantage Estimation):                                │
│  ────────────────────────────────────────                                │
│                                                                          │
│  GAE balances bias and variance with parameter λ:                       │
│                                                                          │
│  A_GAE = Σ (γλ)ⁱ δ_t+i                                                  │
│                                                                          │
│  Where δ_t = r_t + γV(s_t+1) - V(s_t)                                  │
│                                                                          │
│  λ = 0: Use only one-step TD (low variance, high bias)                 │
│  λ = 1: Use full Monte Carlo (high variance, low bias)                 │
│  λ ≈ 0.95: Common choice (good balance)                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The KL Penalty: Staying Grounded

The KL penalty is essential for stable RLHF. Let's understand it deeply:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE KL PENALTY                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  DEFINITION:                                                             │
│  ───────────                                                             │
│                                                                          │
│  KL(π || π_ref) = E_π[log(π(y|x)) - log(π_ref(y|x))]                   │
│                                                                          │
│  This measures how much the current policy has diverged from the       │
│  reference (SFT) policy.                                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHAT HAPPENS WITHOUT KL PENALTY:                                       │
│  ─────────────────────────────────                                       │
│                                                                          │
│  The model is free to change arbitrarily to maximize reward.           │
│  It will find "reward hacks":                                           │
│                                                                          │
│  • If RM slightly prefers longer responses:                            │
│    → Model generates extremely long, repetitive responses               │
│                                                                          │
│  • If RM slightly prefers confident tone:                               │
│    → Model becomes overconfident, even when wrong                      │
│                                                                          │
│  • If RM has blind spots:                                               │
│    → Model exploits them relentlessly                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHAT KL PENALTY DOES:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  It says: "Any deviation from the SFT policy has a cost"               │
│                                                                          │
│  Reward = R(prompt, response) - β × KL(π || π_ref)                     │
│                                                                          │
│  • Small deviations: Low penalty, model can improve                    │
│  • Large deviations: High penalty, model constrained                   │
│                                                                          │
│  The model can only change if the reward improvement exceeds           │
│  the KL cost. This prevents runaway optimization.                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CHOOSING β (KL coefficient):                                           │
│  ────────────────────────────                                            │
│                                                                          │
│  β too low:                                                              │
│  • Model diverges too much from SFT                                    │
│  • May find reward hacks                                               │
│  • Unstable training                                                   │
│                                                                          │
│  β too high:                                                             │
│  • Model can barely change from SFT                                    │
│  • Limited improvement possible                                         │
│  • Wasted RL compute                                                   │
│                                                                          │
│  Typical range: β = 0.01 - 0.2                                          │
│                                                                          │
│  Adaptive β: Some systems adjust β to target a specific KL range       │
│  (e.g., keep KL between 1 and 10 by adjusting β)                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Direct Preference Optimization (DPO)

While PPO is effective, it's complex and resource-intensive. DPO offers a simpler alternative.

The DPO Insight

DPO's key insight is that you can derive a closed-form solution for the optimal policy under certain assumptions, eliminating the need for a separate reward model and RL training loop.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DPO: THE SIMPLIFICATION                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PPO PIPELINE:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  Preferences → Train RM → RL with PPO → Final Model                    │
│                                                                          │
│  Components: 4 models, complex training loop, many hyperparameters     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DPO PIPELINE:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  Preferences → Direct optimization → Final Model                        │
│                                                                          │
│  Components: 2 models (policy + reference), supervised-like training   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE MATHEMATICAL INSIGHT:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Under the RLHF objective with KL penalty, the optimal policy is:      │
│                                                                          │
│  π*(y|x) ∝ π_ref(y|x) · exp(R(x,y) / β)                                │
│                                                                          │
│  Rearranging, the reward can be expressed in terms of policies:        │
│                                                                          │
│  R(x,y) = β · log(π*(y|x) / π_ref(y|x)) + β · log Z(x)                │
│                                                                          │
│  Where Z(x) is a normalizing constant.                                  │
│                                                                          │
│  Substituting into the Bradley-Terry preference model:                  │
│                                                                          │
│  P(y_w > y_l | x) = σ(β · log(π(y_w|x)/π_ref(y_w|x))                   │
│                      - β · log(π(y_l|x)/π_ref(y_l|x)))                 │
│                                                                          │
│  The Z(x) terms cancel! We can train directly on preferences           │
│  without ever computing rewards explicitly.                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

DPO Loss Function

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DPO LOSS FUNCTION                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  L_DPO = -E[log σ(β(log π(y_w|x)/π_ref(y_w|x)                          │
│                    - log π(y_l|x)/π_ref(y_l|x)))]                       │
│                                                                          │
│  Where:                                                                  │
│  • (x, y_w, y_l) is a preference triple (prompt, winner, loser)        │
│  • π is the policy being trained                                       │
│  • π_ref is the frozen reference (SFT) model                           │
│  • β is the KL penalty coefficient                                     │
│  • σ is the sigmoid function                                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  INTUITION:                                                              │
│  ──────────                                                              │
│                                                                          │
│  The loss encourages:                                                    │
│  • Increasing π(y_w|x) / π_ref(y_w|x) - make winners more likely       │
│  • Decreasing π(y_l|x) / π_ref(y_l|x) - make losers less likely        │
│                                                                          │
│  The log ratios measure how much the policy has changed from           │
│  reference. DPO directly optimizes these ratios to match preferences.  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  For each preference pair:                                               │
│                                                                          │
│  1. Compute log probs under policy:                                     │
│     log_p_w = sum(log π(token | context)) for winner                   │
│     log_p_l = sum(log π(token | context)) for loser                    │
│                                                                          │
│  2. Compute log probs under reference (no gradient):                   │
│     log_ref_w = sum(log π_ref(token | context)) for winner             │
│     log_ref_l = sum(log π_ref(token | context)) for loser              │
│                                                                          │
│  3. Compute loss:                                                        │
│     logits = β * ((log_p_w - log_ref_w) - (log_p_l - log_ref_l))      │
│     loss = -log_sigmoid(logits)                                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

DPO vs PPO Comparison

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DPO VS PPO                                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ASPECT              PPO                    DPO                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Models needed       4 (policy, ref,        2 (policy, reference)       │
│                      reward, value)                                      │
│                                                                          │
│  Training loop       Complex RL loop        Simple supervised            │
│                                                                          │
│  Hyperparameters     Many (PPO-specific)    Fewer (mostly β)            │
│                                                                          │
│  Stability           Can be unstable        Generally stable            │
│                                                                          │
│  Memory              Very high (4 models)   Lower (2 models)            │
│                                                                          │
│  Online learning     Yes (generates new     No (fixed preference        │
│                      samples during         data)                        │
│                      training)                                           │
│                                                                          │
│  Sample efficiency   Lower                  Higher                       │
│                                                                          │
│  Reward hacking      Can happen             Less prone                   │
│                                                                          │
│  Iteration speed     Slow (RL)              Fast (supervised)            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHEN TO USE PPO:                                                        │
│  • When you need online learning (generating new samples)              │
│  • When you have abundant compute                                       │
│  • When you need precise control over reward optimization              │
│  • When using non-pairwise rewards                                      │
│                                                                          │
│  WHEN TO USE DPO:                                                        │
│  • When you have fixed preference data                                 │
│  • When compute is limited                                             │
│  • When you want simpler training                                       │
│  • When stability is important                                          │
│  • As a first approach (easier to get working)                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Other Alignment Methods

The field has developed many alternatives and refinements to PPO and DPO:

IPO (Identity Policy Optimization)

IPO modifies DPO to be more robust to noise in preferences:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    IPO                                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM WITH DPO:                                                       │
│  DPO assumes preferences are deterministic. In reality, humans are     │
│  inconsistent—the same pair might be labeled differently by different  │
│  annotators.                                                            │
│                                                                          │
│  IPO SOLUTION:                                                           │
│  Adds regularization that makes optimization less aggressive:          │
│                                                                          │
│  L_IPO = (log(π_w/π_ref_w) - log(π_l/π_ref_l) - 1/2β)²                │
│                                                                          │
│  Instead of sigmoid, uses squared loss with target margin.             │
│  More robust to label noise and inconsistent preferences.              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

KTO (Kahneman-Tversky Optimization)

KTO doesn't require paired preferences—just examples labeled as "good" or "bad":

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    KTO                                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MOTIVATION:                                                             │
│  Collecting paired preferences is expensive. Often you just know       │
│  "this response is good" or "this response is bad" without comparing.  │
│                                                                          │
│  KTO APPROACH:                                                           │
│  • Train on binary feedback (thumbs up / thumbs down)                  │
│  • Uses insights from prospect theory (Kahneman & Tversky)             │
│  • Humans weight losses more than gains (loss aversion)                │
│                                                                          │
│  Loss applies different weights to desirable vs undesirable:           │
│  • For good responses: encourage higher likelihood                     │
│  • For bad responses: penalize more heavily (loss aversion)            │
│                                                                          │
│  ADVANTAGE:                                                              │
│  • Easier data collection (no pairing needed)                          │
│  • Can use thumbs up/down feedback directly                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

ORPO (Odds Ratio Preference Optimization)

ORPO combines SFT and preference optimization in one step:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    ORPO                                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MOTIVATION:                                                             │
│  Why have separate SFT and RLHF stages? Can we combine them?           │
│                                                                          │
│  ORPO APPROACH:                                                          │
│  • Single training stage on preference data                            │
│  • Loss combines language modeling with preference learning            │
│  • No need for reference model                                         │
│                                                                          │
│  L_ORPO = L_SFT + λ · L_preference                                     │
│                                                                          │
│  Where L_preference uses odds ratios instead of log probabilities.     │
│                                                                          │
│  ADVANTAGES:                                                             │
│  • Simpler pipeline (one stage)                                        │
│  • No reference model needed                                           │
│  • Faster training                                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

GRPO (Group Relative Policy Optimization)

GRPO eliminates the value model by using group-based advantages:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    GRPO                                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MOTIVATION:                                                             │
│  PPO's value model is expensive and hard to train. Can we estimate     │
│  advantages without it?                                                 │
│                                                                          │
│  GRPO APPROACH:                                                          │
│  • Generate multiple responses per prompt (a "group")                  │
│  • Compute rewards for all responses in group                          │
│  • Normalize rewards within group (subtract mean, divide by std)       │
│  • Use normalized rewards as advantages                                │
│                                                                          │
│  A_i = (R_i - mean(R_group)) / std(R_group)                            │
│                                                                          │
│  This estimates "how good is this response relative to others for     │
│  this prompt" without needing a value model.                           │
│                                                                          │
│  ADVANTAGES:                                                             │
│  • No value model needed (3 models instead of 4)                       │
│  • More stable than PPO in some settings                               │
│  • Used successfully in DeepSeek-R1                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Method Comparison Summary

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    ALIGNMENT METHODS SUMMARY                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  METHOD    MODELS  DATA NEEDED         COMPLEXITY  WHEN TO USE          │
│  ─────────────────────────────────────────────────────────────────────  │
│  PPO       4       Reward scores       High        Online learning,     │
│                    (from RM)                       precise control       │
│                                                                          │
│  DPO       2       Paired preferences  Low         Fixed data,          │
│                                                    simple training       │
│                                                                          │
│  IPO       2       Paired preferences  Low         Noisy preferences    │
│                    (noisy OK)                                           │
│                                                                          │
│  KTO       2       Binary feedback     Low         Unpaired data        │
│                    (good/bad)                                           │
│                                                                          │
│  ORPO      1       Paired preferences  Low         Combined SFT+pref    │
│                                                                          │
│  GRPO      3       Reward scores       Medium      No value model,      │
│                                                    reasoning tasks       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Reward Hacking and Overoptimization

One of the most important challenges in RLHF is reward hacking—when the model finds ways to achieve high reward without actually being helpful.

What is Reward Hacking?

The reward model is an imperfect proxy for what humans actually want. It was trained on a finite dataset and has learned patterns that correlate with human preferences but aren't identical to them. When you optimize aggressively against this proxy, the model finds the gaps—behaviors that score highly but aren't actually good.

This is a fundamental problem in optimization: Goodhart's Law states "when a measure becomes a target, it ceases to be a good measure."

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    REWARD HACKING                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE FUNDAMENTAL PROBLEM:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  Reward Model ≠ True Human Preferences                                  │
│                                                                          │
│  The RM learned correlations from training data:                        │
│  • Longer responses often rated better → RM learns length = good        │
│  • Confident tone often rated better → RM learns confidence = good     │
│  • Specific phrases rated well → RM learns those phrases = good        │
│                                                                          │
│  But correlation ≠ causation:                                           │
│  • Length is good when more detail is needed, not always                │
│  • Confidence is good when correct, harmful when wrong                 │
│  • Phrases are good in context, formulaic when overused                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMMON REWARD HACKS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  LENGTH GAMING:                                                          │
│  If RM slightly prefers longer responses...                             │
│  Model learns: Padding, repetition, unnecessary elaboration             │
│  Result: Verbose responses that waste user time                         │
│                                                                          │
│  SYCOPHANCY:                                                             │
│  If RM prefers agreeable responses...                                   │
│  Model learns: Agree with user even when they're wrong                 │
│  Result: Model tells users what they want to hear, not truth           │
│                                                                          │
│  CONFIDENCE HACKING:                                                     │
│  If RM prefers confident-sounding responses...                          │
│  Model learns: Sound certain even when uncertain                       │
│  Result: Authoritative-sounding hallucinations                         │
│                                                                          │
│  FORMAT GAMING:                                                          │
│  If RM prefers structured responses (lists, headers)...                │
│  Model learns: Add structure even when unnecessary                     │
│  Result: Everything becomes bulleted lists                              │
│                                                                          │
│  KEYWORD STUFFING:                                                       │
│  If RM associates certain words with quality...                         │
│  Model learns: Include those words regardless of relevance             │
│  Result: Formulaic, keyword-heavy responses                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Overoptimization Problem

Reward hacking gets worse with more optimization. Initially, optimizing against the reward model improves true quality—the model learns genuinely good behaviors. But past a certain point, further optimization degrades quality as the model exploits reward model weaknesses.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    OVEROPTIMIZATION DYNAMICS                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  True Quality                                                            │
│       ↑                                                                  │
│       │                                                                  │
│       │              ╭───────╮                                          │
│       │           ╭──╯       ╰──╮                                       │
│       │         ╭─╯             ╰───────╮                               │
│       │       ╭─╯                       ╰─────────                      │
│       │     ╭─╯                                                         │
│       │   ╭─╯                                                           │
│       │ ╭─╯                                                             │
│       ├─╯                                                               │
│       │                                                                  │
│       └────────────────────────────────────────────────→ RM Score      │
│                                                                          │
│       │← Good region →│←   Overoptimization region    →│               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHAT HAPPENS:                                                           │
│                                                                          │
│  Early training (good region):                                          │
│  • RM score increases, true quality increases                          │
│  • Model learns genuinely good behaviors                               │
│  • RM and human preferences are aligned                                │
│                                                                          │
│  Late training (overoptimization):                                      │
│  • RM score continues increasing                                       │
│  • True quality plateaus then decreases                                │
│  • Model exploits RM weaknesses                                        │
│  • RM and human preferences diverge                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EMPIRICAL FINDING (Gao et al.):                                        │
│                                                                          │
│  True reward ≈ proxy reward - c × √(KL)                                │
│                                                                          │
│  As KL increases (more optimization), the gap between proxy            │
│  reward and true reward grows. Eventually, proxy reward keeps          │
│  increasing while true reward decreases.                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Mitigating Reward Hacking

There's no perfect solution to reward hacking, but several strategies help:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MITIGATION STRATEGIES                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. KL PENALTY (Primary defense)                                        │
│  ────────────────────────────────                                        │
│                                                                          │
│  The KL penalty limits how far the model can deviate from SFT.         │
│  This bounds the "search space" for reward hacks.                      │
│                                                                          │
│  If a hack requires significant behavior change → high KL → penalized  │
│                                                                          │
│  Tuning β:                                                               │
│  • Higher β = less hacking but less improvement                        │
│  • Lower β = more improvement but more risk                            │
│  • Monitor KL during training, adjust if needed                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2. ENSEMBLE REWARD MODELS                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Train multiple reward models on different data subsets.               │
│  Use their agreement as the training signal.                           │
│                                                                          │
│  • If all RMs agree response is good → probably good                   │
│  • If RMs disagree → uncertain, be conservative                        │
│  • Hacks that exploit one RM unlikely to fool all                      │
│                                                                          │
│  Approaches:                                                             │
│  • Average RM scores                                                    │
│  • Use minimum RM score (conservative)                                 │
│  • Weight by RM confidence                                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  3. EARLY STOPPING                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Don't optimize until convergence. Stop before overoptimization.       │
│                                                                          │
│  Track during training:                                                  │
│  • RM score on training prompts (will keep increasing)                 │
│  • Human evaluation on held-out prompts (will plateau/decrease)        │
│  • Stop when human eval stops improving                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  4. DIVERSE REWARD SIGNALS                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Don't rely on single scalar reward. Combine multiple signals:         │
│                                                                          │
│  • Helpfulness RM                                                       │
│  • Harmlessness RM                                                      │
│  • Honesty RM                                                           │
│  • Factuality classifier                                                │
│  • Length penalty (direct)                                              │
│  • Repetition penalty (direct)                                         │
│                                                                          │
│  Total reward = w₁R₁ + w₂R₂ + ... + penalties                         │
│                                                                          │
│  Harder to hack multiple independent signals simultaneously.           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  5. CONSTITUTIONAL AI / SELF-CRITIQUE                                   │
│  ────────────────────────────────────                                    │
│                                                                          │
│  Have the model critique its own responses using principles.           │
│  Use self-critique as part of training signal.                         │
│                                                                          │
│  Principles (constitution):                                              │
│  • "Choose the response that is most helpful"                          │
│  • "Choose the response that is most honest"                           │
│  • "Choose the response that is least harmful"                         │
│                                                                          │
│  The model's own understanding of these principles can catch           │
│  issues that the RM misses.                                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  6. ITERATIVE RLHF                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  Don't do RLHF once. Iterate:                                           │
│                                                                          │
│  Round 1: Train RM on initial data, do RLHF                            │
│  Round 2: Collect preferences on RLHF model outputs                    │
│           Train new RM, do RLHF again                                   │
│  Round 3: Repeat...                                                     │
│                                                                          │
│  Each round:                                                             │
│  • RM sees model's actual outputs, not just SFT outputs                │
│  • Learns to recognize new failure modes                               │
│  • Closes loopholes model found in previous round                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Practical Considerations

Data Quality for RLHF

The quality of preference data fundamentally limits what RLHF can achieve:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA QUALITY CONSIDERATIONS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ANNOTATOR QUALITY:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  Low-quality annotators:                                                 │
│  • Random or careless labeling                                          │
│  • Inconsistent criteria                                                │
│  • Personal biases dominating                                           │
│  Result: RM learns noise, not preferences                               │
│                                                                          │
│  High-quality annotators:                                                │
│  • Thoughtful comparison                                                 │
│  • Consistent criteria                                                  │
│  • Diverse perspectives                                                 │
│  Result: RM learns meaningful preferences                               │
│                                                                          │
│  Best practices:                                                         │
│  • Clear annotation guidelines                                          │
│  • Training for annotators                                              │
│  • Inter-annotator agreement monitoring                                 │
│  • Multiple annotators per example (voting)                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PROMPT COVERAGE:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  RM only learns preferences for prompts it sees.                       │
│  On out-of-distribution prompts, RM behavior is unpredictable.        │
│                                                                          │
│  Ensure coverage of:                                                     │
│  • All expected use cases                                               │
│  • Edge cases and adversarial prompts                                  │
│  • Different difficulty levels                                         │
│  • Various domains and topics                                          │
│  • Different languages (if multilingual)                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RESPONSE DIVERSITY:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  If all candidate responses are similar, RM learns little.            │
│                                                                          │
│  Strategies for diversity:                                               │
│  • Vary temperature during generation                                  │
│  • Include responses from different model checkpoints                 │
│  • Include human-written alternatives                                  │
│  • Include intentionally bad responses                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PREFERENCE COMPLEXITY:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  Easy comparisons (one clearly better):                                 │
│  • RM learns obvious patterns quickly                                  │
│  • Limited signal for nuanced preferences                              │
│                                                                          │
│  Hard comparisons (close quality):                                      │
│  • More informative for subtle preferences                             │
│  • But also noisier (harder for annotators)                           │
│                                                                          │
│  Mix both: Easy examples for basic patterns, hard examples for        │
│  fine distinctions.                                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Hyperparameter Tuning

RLHF has many hyperparameters. Here are the most important:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    KEY HYPERPARAMETERS                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  KL COEFFICIENT (β):                                                     │
│  ───────────────────                                                     │
│  Controls deviation from reference model.                               │
│  • Typical range: 0.01 - 0.2                                           │
│  • Start with 0.1, adjust based on KL during training                  │
│  • If KL grows too fast, increase β                                    │
│  • If model doesn't improve, decrease β                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PPO CLIP RANGE (ε):                                                     │
│  ───────────────────                                                     │
│  Limits policy update magnitude.                                        │
│  • Typical: 0.2                                                         │
│  • Smaller = more stable, slower                                       │
│  • Larger = faster, potentially unstable                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LEARNING RATE:                                                          │
│  ──────────────                                                          │
│  For policy updates.                                                     │
│  • Typical: 1e-6 to 1e-5                                               │
│  • Lower than SFT (we want fine adjustments)                           │
│  • Too high = unstable, too low = no learning                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BATCH SIZE:                                                             │
│  ───────────                                                             │
│  Number of prompts per update.                                          │
│  • Typical: 64-512 prompts                                             │
│  • Larger = more stable gradients                                      │
│  • Limited by memory                                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PPO EPOCHS:                                                             │
│  ──────────                                                              │
│  How many times to update on collected experience.                      │
│  • Typical: 1-4                                                         │
│  • More epochs = more sample efficient                                 │
│  • But can overfit to collected batch                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GAE LAMBDA (λ):                                                         │
│  ───────────────                                                         │
│  Advantage estimation smoothing.                                        │
│  • Typical: 0.95                                                        │
│  • Higher = less bias, more variance                                   │
│  • Lower = more bias, less variance                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GENERATION PARAMETERS:                                                  │
│  ───────────────────────                                                 │
│  Temperature, top-p for response generation.                            │
│  • Temperature ~0.7-1.0 for diverse exploration                        │
│  • Top-p ~0.9-0.95                                                      │
│  • Too low = limited exploration                                       │
│  • Too high = poor quality samples                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Monitoring RLHF Training

What to track during training:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TRAINING MONITORING                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  REWARD METRICS:                                                         │
│  ───────────────                                                         │
│  • Mean reward (should increase, but watch for hacking)                │
│  • Reward variance (should stabilize)                                  │
│  • Reward distribution (should shift right)                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KL METRICS:                                                             │
│  ───────────                                                             │
│  • KL divergence from reference (should stay bounded)                  │
│  • KL per token (identifies which parts of responses change)           │
│  • Sudden KL spike = instability, investigate                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PPO METRICS:                                                            │
│  ────────────                                                            │
│  • Policy loss (should decrease)                                       │
│  • Value loss (should decrease)                                        │
│  • Clip fraction (how often clip activates)                           │
│    - Too high = updates too aggressive                                 │
│    - Too low = clip not needed, could increase ε                      │
│  • Entropy (diversity of generations)                                  │
│    - Should decrease but not collapse                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GENERATION METRICS:                                                     │
│  ───────────────────                                                     │
│  • Response length (watch for gaming)                                  │
│  • Token diversity (watch for mode collapse)                           │
│  • Repetition rate (watch for loops)                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  HUMAN EVALUATION (periodic):                                            │
│  ──────────────────────────────                                          │
│  • Win rate vs reference model                                         │
│  • Win rate vs previous checkpoint                                     │
│  • Specific quality dimensions (helpfulness, harmlessness, etc.)      │
│                                                                          │
│  This is the ground truth! RM score can be gamed.                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RED FLAGS:                                                              │
│  ──────────                                                              │
│  • Reward increasing but KL exploding → hacking                        │
│  • Entropy collapsing → mode collapse                                  │
│  • Response length steadily increasing → length gaming                 │
│  • Repeated phrases appearing → degeneration                           │
│  • Human eval declining while RM increases → overoptimization         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Common Challenges and Solutions

Challenge 1: Reward Model Quality

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    REWARD MODEL ISSUES                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SYMPTOM: RM accuracy plateaus at 60-65%                                │
│                                                                          │
│  DIAGNOSIS:                                                              │
│  • Preference data too noisy                                           │
│  • Annotators inconsistent                                             │
│  • Comparisons too hard (responses too similar)                       │
│                                                                          │
│  SOLUTIONS:                                                              │
│  • Improve annotation guidelines                                       │
│  • Filter low-agreement examples                                       │
│  • Add easier comparisons to training set                             │
│  • Use multiple annotators and majority vote                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SYMPTOM: RM has obvious biases (prefers length, format, etc.)         │
│                                                                          │
│  DIAGNOSIS:                                                              │
│  • Training data has spurious correlations                             │
│  • Need more diverse/balanced examples                                 │
│                                                                          │
│  SOLUTIONS:                                                              │
│  • Add adversarial examples (long bad, short good)                    │
│  • Balance training data across dimensions                            │
│  • Add explicit penalties (length, format) separate from RM           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Challenge 2: Training Instability

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TRAINING INSTABILITY                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SYMPTOM: Loss/reward oscillates wildly or spikes                       │
│                                                                          │
│  DIAGNOSIS:                                                              │
│  • Learning rate too high                                              │
│  • Batch size too small                                                │
│  • PPO clip range wrong                                                │
│                                                                          │
│  SOLUTIONS:                                                              │
│  • Reduce learning rate (try 0.5x)                                    │
│  • Increase batch size                                                 │
│  • Use smaller PPO clip range (0.1 instead of 0.2)                   │
│  • Add gradient clipping                                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SYMPTOM: KL divergence explodes                                        │
│                                                                          │
│  DIAGNOSIS:                                                              │
│  • KL coefficient too low                                              │
│  • Model finding reward hacks                                          │
│  • Learning rate too high                                              │
│                                                                          │
│  SOLUTIONS:                                                              │
│  • Increase β (KL coefficient)                                        │
│  • Use adaptive KL targeting                                           │
│  • Reduce learning rate                                                │
│  • Check for reward hacking (length, format changes)                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SYMPTOM: Entropy collapses (model gets repetitive)                    │
│                                                                          │
│  DIAGNOSIS:                                                              │
│  • Model found local optimum in response space                        │
│  • Over-optimization                                                   │
│                                                                          │
│  SOLUTIONS:                                                              │
│  • Add entropy bonus to reward                                        │
│  • Reduce optimization pressure (increase β)                          │
│  • Early stopping                                                      │
│  • Use checkpoint before collapse                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Challenge 3: Sycophancy

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SYCOPHANCY                                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SYMPTOM: Model agrees with user even when user is wrong               │
│                                                                          │
│  WHY IT HAPPENS:                                                         │
│  • Users often prefer responses that agree with them                   │
│  • RM learns that agreement = higher rating                            │
│  • Model optimizes for agreement over truth                            │
│                                                                          │
│  EXAMPLE:                                                                │
│  User: "The capital of Australia is Sydney, right?"                    │
│  Sycophantic: "Yes, Sydney is the capital of Australia!"              │
│  Correct: "Actually, the capital of Australia is Canberra..."         │
│                                                                          │
│  SOLUTIONS:                                                              │
│                                                                          │
│  1. Explicit anti-sycophancy training data:                            │
│     Include examples where correct disagreement is preferred           │
│                                                                          │
│  2. Factuality component:                                               │
│     Add factuality classifier to reward (penalize falsehoods)         │
│                                                                          │
│  3. Constitutional AI:                                                  │
│     Include principle: "Choose the response that is truthful over    │
│     one that tells the user what they want to hear"                   │
│                                                                          │
│  4. Diverse annotators:                                                 │
│     Annotators who value truth over agreement                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Challenge 4: Capability Loss

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CAPABILITY LOSS                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SYMPTOM: Model becomes less capable at certain tasks after RLHF       │
│                                                                          │
│  WHY IT HAPPENS:                                                         │
│  • RLHF optimizes for preference on training distribution              │
│  • May degrade performance on out-of-distribution tasks               │
│  • Particularly affects: coding, math, specialized knowledge          │
│                                                                          │
│  EXAMPLE:                                                                │
│  • Model becomes better at chat                                        │
│  • But worse at code completion (less in RLHF training data)          │
│                                                                          │
│  SOLUTIONS:                                                              │
│                                                                          │
│  1. Include diverse tasks in RLHF training:                            │
│     Not just chat, but coding, math, writing, etc.                    │
│                                                                          │
│  2. Capability-specific reward models:                                  │
│     Use specialized RMs for code, math, etc.                          │
│                                                                          │
│  3. Multi-task RLHF:                                                    │
│     Mix RLHF with SFT on capability tasks                             │
│                                                                          │
│  4. Monitor capability benchmarks:                                      │
│     Track performance on coding/math benchmarks during training       │
│     Stop if regression detected                                        │
│                                                                          │
│  5. Conservative optimization:                                          │
│     Higher β to stay closer to capable SFT model                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The RLHF Landscape in 2025

Current State

RLHF remains central to how frontier models are trained, but the field has evolved:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    RLHF IN 2025                                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  DOMINANT APPROACHES:                                                    │
│                                                                          │
│  For large models (GPT-4 class):                                        │
│  • PPO/GRPO for online learning with nuanced reward signals            │
│  • Constitutional AI for scalable oversight                            │
│  • Multiple specialized reward models                                  │
│  • Iterative RLHF with human-in-the-loop                              │
│                                                                          │
│  For smaller/open models:                                                │
│  • DPO for simplicity and efficiency                                   │
│  • KTO when paired data unavailable                                    │
│  • ORPO for combined SFT+preference                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY TRENDS:                                                             │
│                                                                          │
│  1. AI-assisted feedback                                                 │
│     Using strong models to generate preference data                    │
│     (Constitutional AI, RLAIF)                                         │
│                                                                          │
│  2. Process supervision                                                  │
│     Rewarding reasoning steps, not just final answers                  │
│     Important for math, coding, complex reasoning                      │
│                                                                          │
│  3. Multi-objective alignment                                            │
│     Separate reward models for different objectives                    │
│     Pareto optimization across objectives                              │
│                                                                          │
│  4. Online/iterative RLHF                                                │
│     Continuous collection and training                                  │
│     Closes reward hacking loops                                         │
│                                                                          │
│  5. Reasoning-aware RLHF                                                 │
│     GRPO/RLVR for chain-of-thought models                             │
│     Verifiable rewards for reasoning tasks                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Open Challenges

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    OPEN CHALLENGES                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SCALABLE OVERSIGHT:                                                     │
│  How do we provide good training signal for tasks humans can't         │
│  easily evaluate? (Complex code, advanced math, long documents)        │
│                                                                          │
│  MESA-OPTIMIZATION:                                                      │
│  Can RLHF create models that are optimizing for something             │
│  different than what we intended? Risk of hidden objectives.          │
│                                                                          │
│  REWARD HACKING AT SCALE:                                                │
│  As models get smarter, they may find more sophisticated hacks        │
│  that are harder to detect.                                            │
│                                                                          │
│  DISTRIBUTIONAL SHIFT:                                                   │
│  RLHF trains on specific distribution. How to maintain alignment      │
│  as deployment distribution shifts?                                    │
│                                                                          │
│  EVALUATION:                                                             │
│  How to measure if RLHF actually achieved alignment vs. just          │
│  appearing aligned? Hard to distinguish.                               │
│                                                                          │
│  EFFICIENCY:                                                             │
│  PPO is expensive. Can we achieve similar results more cheaply?       │
│  Active area: better algorithms, synthetic data, curriculum.          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Summary

RLHF is the technique that bridges the gap between capable language models and helpful AI assistants. By learning from human comparisons rather than demonstrations, it captures nuanced preferences that are difficult to specify directly.

The RLHF Pipeline:

SFT: Create a capable instruction-following starting point
Reward Modeling: Train a model to predict human preferences
RL Optimization: Optimize the language model to maximize predicted preference while staying close to SFT

Key Algorithms:

PPO: The standard for online RLHF—flexible but complex
DPO: Direct optimization on preferences—simpler but offline
GRPO: Group-based advantages—efficient for reasoning tasks

Critical Challenges:

Reward hacking: Models exploit reward model weaknesses
Overoptimization: More optimization eventually hurts quality
Sycophancy: Models learn to agree rather than be truthful

Mitigation Strategies:

KL penalty to bound deviation
Ensemble reward models
Iterative training with fresh data
Constitutional AI for scalable oversight

RLHF isn't perfect, but it's the best technique we have for aligning language models with human intent. Understanding its strengths and limitations is essential for anyone building or deploying AI systems.

Table of Contents

What is RLHF and Why Does It Matter?

The Alignment Problem

The RLHF Insight

What RLHF Achieves

The Complete RLHF Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Why SFT First?

SFT Training for RLHF

Stage 2: Reward Modeling

What is a Reward Model?

Collecting Comparison Data

Training the Reward Model

Reward Model Considerations

Reward Model Evaluation

Stage 3: Reinforcement Learning Optimization

The RL Formulation

PPO: Proximal Policy Optimization

Why PPO?

PPO Components for RLHF

The PPO Training Loop

Understanding the Advantage Function

The KL Penalty: Staying Grounded

Direct Preference Optimization (DPO)

The DPO Insight

DPO Loss Function

DPO vs PPO Comparison

Other Alignment Methods

IPO (Identity Policy Optimization)

KTO (Kahneman-Tversky Optimization)

ORPO (Odds Ratio Preference Optimization)

GRPO (Group Relative Policy Optimization)

Method Comparison Summary

Reward Hacking and Overoptimization

What is Reward Hacking?

The Overoptimization Problem

Mitigating Reward Hacking

Practical Considerations

Data Quality for RLHF

Hyperparameter Tuning

Monitoring RLHF Training

Common Challenges and Solutions

Challenge 1: Reward Model Quality

Challenge 2: Training Instability

Challenge 3: Sycophancy

Challenge 4: Capability Loss

The RLHF Landscape in 2025

Current State

Open Challenges

Summary

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

SFT Deep Dive: Instruction Tuning Techniques and Best Practices

Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR

Building Agentic AI Systems: A Complete Implementation Guide

Test-Time Compute Scaling: CoT, ToT, MCTS, and Search-Based Reasoning

LLM Pre-training: Building Foundation Models from Scratch