When should I use DPO versus GRPO?

DPO works with static preference datasets and is simpler to run. GRPO requires online generation but can continue improving beyond the dataset. Use DPO for quick experiments and when you have high-quality preference data. Use GRPO when you have good reward signals and need maximum performance.

How do I handle multiple reward signals?

Pass a list to reward_funcs. TRL sums all rewards to produce the final signal. You can weight different rewards by scaling their outputs in custom reward functions. For rewards that don't apply to all samples, return None and TRL will exclude that reward for those samples.

What batch sizes work best?

For SFT, larger batches generally work well with gradient accumulation. For online RL methods, the effective batch size includes multiple generations per prompt—balance generation throughput against update frequency.

How do I debug reward hacking?

Monitor sample generations throughout training. If completions become degenerate despite high rewards, increase the KL penalty coefficient. Consider adding output diversity to reward functions. Reward model quality is often the limiting factor.

Can I resume training from a checkpoint?

Yes, all trainers support checkpointing. Resume with resume_from_checkpoint parameter. For online RL, the reference model state is also saved and restored.

How does TRL handle multi-turn conversations?

The conversational data format supports multi-turn naturally. Messages alternate between user and assistant, and the trainer applies appropriate masking to train only on assistant responses. For online methods, generation continues the conversation from the provided history.

HuggingFace TRL: A Deep Dive into the Transformer Reinforcement Learning Library | Enrico Piovano

Introduction

HuggingFace's TRL (Transformer Reinforcement Learning) has emerged as the definitive library for fine-tuning large language models. Originally focused on reinforcement learning from human feedback, TRL has evolved into a comprehensive toolkit covering the entire post-training pipeline: supervised fine-tuning, reward modeling, preference optimization, and online RL methods.

This post explores TRL's internal architecture by examining its actual implementation. We'll understand how the trainer ecosystem is organized, how data flows through the pipeline, how different optimization algorithms are implemented, and how TRL integrates with modern inference engines like vLLM for efficient generation during training.

Architecture Overview

TRL builds on top of HuggingFace Transformers and Accelerate, extending the Trainer class with specialized trainers for different training objectives. The library is organized around several key components: trainers that implement different algorithms, configs that parameterize those algorithms, data collators that prepare batches, reward functions that provide learning signals, and utilities that handle model management.

The trainer hierarchy starts with BaseTrainer, which provides common functionality like logging, checkpointing, and distributed training setup. Each specialized trainer inherits from this base and implements its specific training loop. The design philosophy prioritizes ease of use—most trainers can be instantiated with just a model ID and dataset, with sensible defaults for everything else.

The SFT Trainer

Supervised Fine-Tuning (SFT) is typically the first step in post-training. The SFTTrainer handles instruction tuning where the model learns to follow prompts by training on prompt-completion pairs.

Data Format Flexibility

SFTTrainer accepts data in multiple formats. The standard format uses a single text column containing the full training example. The conversational format uses structured messages with roles (user, assistant) that get formatted using chat templates. The prompt-completion format separates prompts and completions, enabling loss computation only on completions.

The trainer automatically detects the format and applies appropriate preprocessing. For conversational data, it applies the model's chat template (or a custom one) to convert messages to the expected format. This automatic handling eliminates boilerplate that would otherwise be needed in every training script.

Completion-Only Loss

A critical feature for instruction tuning is computing loss only on the completion tokens, not the prompt. When training a model to answer questions, you don't want the model to learn to predict the question—only the answer.

The DataCollatorForLanguageModeling handles this through completion masks. When a dataset includes completion_mask fields (indicating which tokens are part of the completion), the collator sets labels to -100 (the ignore index for cross-entropy loss) for non-completion tokens. This ensures gradients only flow from completion token predictions.

For conversational data with alternating user and assistant messages, assistant_masks serve a similar purpose—the model learns to predict assistant responses while ignoring user messages in the loss computation.

Padding-Free Training

For efficiency with variable-length sequences, SFTTrainer supports padding-free training. Rather than padding each sequence to a common length (wasting compute on padding tokens), sequences are concatenated and position IDs are used to track sequence boundaries.

The collator generates position IDs that reset at the start of each sequence: if sequence 1 has 5 tokens and sequence 2 has 3 tokens, the position IDs would be [0, 1, 2, 3, 4, 0, 1, 2]. Flash Attention and other efficient attention implementations use these position IDs to compute attention correctly without explicit padding.

Dataset Packing

For maximum efficiency, SFTTrainer can pack multiple short sequences into single training examples. The pack_dataset function concatenates sequences up to a maximum length, reducing the number of forward passes needed.

Packing requires tracking sequence boundaries so attention doesn't cross between packed sequences. The implementation maintains seq_lengths metadata that the data collator uses to generate correct position IDs for padding-free training.

Multimodal Support

The DataCollatorForVisionLanguageModeling extends SFT to vision-language models. It handles the complexity of on-the-fly image processing—converting images to pixel values during batching rather than preprocessing the entire dataset upfront.

This approach is necessary because image preprocessing is disk-intensive and the processed representations are large. The collator uses the model's processor to tokenize text and process images together, producing batches with input_ids, attention_mask, pixel_values, and labels.

The DPO Trainer

Direct Preference Optimization (DPO) learns from human preferences without explicitly training a reward model. Given pairs of preferred (chosen) and rejected responses, DPO directly optimizes the policy to prefer the chosen response.

The DPO Loss

DPO's core insight is that the optimal policy for a given reward function can be expressed in closed form, allowing direct optimization without RL. The loss compares log probabilities under the policy and a reference model.

The implementation computes per-token log probabilities for both chosen and rejected responses, sums them to get sequence-level log probabilities, and computes the DPO loss using these along with reference model log probabilities. The reference log probabilities can be precomputed and cached or computed on-the-fly.

Reference Model Management

DPO requires a frozen reference model to provide a baseline for the KL divergence penalty. TRL provides several options for this. The simplest is loading a separate copy of the model weights as the reference. More memory-efficient is sharing the base model weights and only keeping separate LoRA adapters for policy and reference.

The SyncRefModelCallback handles keeping the reference model synchronized when using techniques like soft updates where the reference slowly tracks the policy.

Data Collation for Preferences

The DataCollatorForPreference handles the unique structure of preference data. Each example contains prompt, chosen_input_ids, and rejected_input_ids. The collator pads these appropriately, handling the asymmetry where prompts are left-padded (for generation) while completions are right-padded.

The collator can also include precomputed reference log probabilities if available, avoiding the need to compute them during training.

Loss Variants

TRL implements multiple DPO loss variants through the FDivergenceType parameter. Beyond the standard DPO loss, options include IPO (Identity Preference Optimization) which uses a simpler squared error formulation, and other f-divergence variants that provide different tradeoffs between optimization stability and alignment strength.

The GRPO Trainer

Group Relative Policy Optimization (GRPO), introduced with DeepSeekMath, represents the state-of-the-art in online RL for LLMs. Unlike DPO which works with static preference data, GRPO generates completions online and learns from reward signals.

Online Generation

GRPO's training loop alternates between generation and optimization. Given prompts, the model generates multiple completions per prompt (the "group"). A reward function scores each completion, and the policy is updated to increase the probability of higher-reward completions relative to lower-reward ones within the same group.

The generation step is critical for efficiency. TRL supports three backends: native HuggingFace generation, vLLM for high-throughput generation, and external generation services. The vLLM integration is particularly important for large models where generation would otherwise dominate training time.

vLLM Integration

When use_vllm=True, the GRPOTrainer spawns a vLLM server for generation. The integration handles the complexity of keeping model weights synchronized—after each training step, updated weights must be transmitted to the vLLM server.

The VLLMClient class manages communication with the vLLM server. It handles batched generation requests, retrieves log probabilities needed for the policy gradient, and manages server lifecycle. For distributed training, coordination ensures weight updates are properly broadcast.

Reward Functions

GRPO supports flexible reward specification. Rewards can come from a trained reward model (specified by model ID or PreTrainedModel instance), from a custom callable function, or from multiple sources that are summed together.

Custom reward functions receive prompts, completions, and any additional dataset columns. They return per-completion rewards that can be floats or None (for samples where the reward doesn't apply, useful for multi-task training).

The reward function interface also receives the trainer's state, enabling rewards that depend on training progress (like curriculum learning where reward difficulty increases over time).

The GRPO Loss

The core GRPO objective uses the group structure to normalize rewards. For each prompt, the mean reward across the group becomes the baseline. Completions above the baseline get positive advantage signals while those below get negative signals.

The implementation computes per-token log probabilities under both policy and reference models, calculates advantages from normalized rewards, and applies the policy gradient with a KL penalty to prevent the policy from drifting too far from the reference.

The loss includes several components: the policy gradient loss weighted by advantages, a KL divergence term between policy and reference, and optional entropy regularization to encourage exploration.

Asynchronous Reward Computation

For expensive reward functions (like calling external APIs), GRPO supports asynchronous reward computation. The training loop can continue with the next generation batch while rewards for the previous batch are still being computed, hiding latency.

The implementation uses an event loop running in a daemon thread to manage async reward functions. Rewards are gathered using asyncio and integrated into the training loop when available.

The RLOO Trainer

REINFORCE Leave-One-Out (RLOO) is an alternative to GRPO that uses a different baseline computation. Instead of averaging rewards across the group, RLOO uses a leave-one-out estimator where each completion's baseline is the average of all other completions in the group.

This approach has lower variance than GRPO's baseline because the completion being updated doesn't influence its own baseline. The implementation is structurally similar to GRPO—online generation, reward computation, policy gradient—but with the modified baseline calculation.

RLOO requires the group size to be at least 2 (so there's always at least one other completion to form the baseline). The optimal group size balances variance reduction (larger groups) against compute cost (more generations per prompt).

The PPO Trainer

Proximal Policy Optimization (PPO) is the classical RLHF algorithm that uses a learned value function alongside the policy. While more complex than DPO or GRPO, PPO provides principled handling of the exploration-exploitation tradeoff through its clipped objective.

Value Head

Unlike DPO/GRPO which only need the policy model, PPO requires a value function that predicts expected future rewards. TRL implements this through AutoModelForCausalLMWithValueHead, which adds a value head on top of the language model.

The value head is typically a linear layer on the model's hidden states. It outputs a scalar value estimate for each token position, representing the expected cumulative reward from that position onward.

The PPO Training Step

Each PPO update involves multiple steps. First, generate completions and compute rewards. Second, compute advantages using Generalized Advantage Estimation (GAE), which uses the value function to reduce variance while trading off some bias. Third, perform multiple epochs of gradient updates on the collected data, using the PPO clipped objective to prevent too-large policy changes.

The clipped objective ensures updates are conservative. If the new policy assigns much higher probability to an action than the old policy, the objective clips the benefit to prevent destabilizing large updates.

KL Penalty

PPO for RLHF includes a KL divergence penalty to keep the policy close to the initial supervised fine-tuned model. This prevents reward hacking where the model finds degenerate high-reward behaviors that diverge from natural language.

The KL penalty can be fixed or adaptive. Adaptive KL adjusts the penalty coefficient to maintain a target KL divergence, increasing the penalty if the policy diverges too fast and decreasing it if updates are too conservative.

The Reward Trainer

Training reward models from human preference data is a prerequisite for PPO and can improve DPO training. The RewardTrainer handles this specialized classification task.

Preference Data Format

Reward training data consists of (prompt, chosen, rejected) triples. The trainer formats these as a classification problem: given a response, predict whether it's the preferred one. The model is trained to assign higher scores to chosen responses than rejected ones.

Bradley-Terry Loss

The standard reward modeling loss is based on the Bradley-Terry model of pairwise preferences. Given reward scores r_chosen and r_rejected, the loss is -log(sigmoid(r_chosen - r_rejected)). This encourages the model to assign higher rewards to chosen responses with a margin.

The implementation uses AutoModelForSequenceClassification with num_labels=1 to produce scalar rewards. The DataCollatorForPreference interleaves chosen and rejected examples in batches, and the training step computes the pairwise loss.

Margin-Aware Training

Some preference datasets include confidence margins—stronger preferences should have larger reward gaps. The reward trainer supports margin-aware losses where the target gap scales with the margin annotation.

The KTO Trainer

Kahneman-Tversky Optimization (KTO) is a preference optimization method that doesn't require paired data. Unlike DPO which needs chosen/rejected pairs for the same prompt, KTO works with unpaired data labeled as "good" or "bad".

This relaxed data requirement is valuable because obtaining explicit preference comparisons is expensive, while binary quality labels are easier to collect. KTO uses insights from prospect theory to weight positive and negative examples appropriately.

Callbacks and Monitoring

TRL provides callbacks for common training workflows. The LogCompletionsCallback logs sample generations during training to track qualitative progress. The WinRateCallback evaluates model generations against a baseline using a judge model, providing a more interpretable metric than loss.

Integration with experiment tracking platforms (Weights & Biases, MLflow, Comet) allows monitoring training curves, generation samples, and evaluation metrics throughout training.

Distributed Training Support

TRL inherits distributed training capabilities from Accelerate and extends them for RL-specific needs. FSDP (Fully Sharded Data Parallel) and DeepSpeed are supported for training large models across multiple GPUs.

For online RL methods, distributed training introduces additional complexity around generation. Each rank generates completions for its local batch, rewards are computed (potentially involving communication for model-based rewards), and gradients are synchronized during the optimization step.

The prepare_deepspeed and prepare_fsdp utilities handle model wrapping for different distributed backends, ensuring proper handling of the policy model, reference model, and reward model.

PEFT Integration

TRL deeply integrates with PEFT (Parameter-Efficient Fine-Tuning) for LoRA and other adapter methods. All trainers accept a peft_config parameter to wrap models with PEFT.

For DPO with LoRA, a particularly efficient setup is sharing base model weights between policy and reference, with separate LoRA adapters. The policy LoRA is trained while the reference LoRA is frozen. This dramatically reduces memory compared to loading two full model copies.

The get_peft_config utility helps construct PEFT configurations, and helper functions handle the complexity of merging adapters, switching between adapters, and saving/loading PEFT checkpoints.

Practical Usage Patterns

Starting Simple

For most use cases, the recommended workflow starts with SFT on instruction data. This establishes strong instruction-following before preference optimization. A minimal SFTTrainer setup requires only a model ID and dataset.

Adding Preference Optimization

After SFT, preference optimization improves response quality. DPO is the simplest choice for offline preference data. For online improvement, GRPO or RLOO generate completions and learn from reward signals.

Scaling Considerations

For large models, key optimizations include using LoRA to reduce trainable parameters, enabling gradient checkpointing to reduce memory, using vLLM for efficient generation in online methods, and leveraging FSDP or DeepSpeed for multi-GPU training.

Custom Rewards

For domain-specific applications, custom reward functions provide flexibility. Math problems might use execution-based correctness checking. Code generation might use test case passing rates. Factual QA might use retrieval-based verification.

Table of Contents