How does Open R1 compare to the original DeepSeek-R1?

Open R1's distilled models (Step 1) achieve comparable performance to DeepSeek's distilled models on key benchmarks. The gap is typically within 1-5 points, attributable to evaluation variance. The pure RL reproduction (Step 2) is still in progress.

Can I train on my own dataset?

Yes. The configuration system accepts any HuggingFace dataset. You'll need columns for prompts and (for math/code) ground truth answers or test cases for reward computation.

What hardware do I need?

For development with 1.5B models, a single A100 or H100 (40GB+) is sufficient. For 7B model training, 8x A100/H100 is recommended. For 32B+ models, you'll need multi-node setups with ZeRO-3 sharding.

How do I add a custom reward function?

Add your function to `rewards.py` and register it in `REWARD_FUNCS_REGISTRY`. Your function receives completions and dataset columns, returns a list of float rewards.

Why use GRPO instead of PPO?

Memory efficiency. GRPO eliminates the critic network, roughly halving memory requirements. For large models where memory is the bottleneck, this enables training that wouldn't otherwise fit.

How does code execution work during training?

Generated code is sent to E2B or Morph sandboxes via API. The sandbox executes against test cases and returns pass/fail results. The pass rate becomes the reward.

What reward functions are available?

Open R1 provides 14 reward functions: accuracy (math verification), format (tag structure), tag_count (partial credit), reasoning_steps (step-by-step patterns), cosine and len (length-based), repetition_penalty (n-gram detection), soft_overlong_punishment (DAPO-style), code (execution with pass rate), binary_code (pass/fail threshold), ioi_code (IOI problems), cf_code (CodeForces), and code_format (formatting validation).

What's the difference between SFT distillation and pure RL?

SFT distillation learns from reasoning traces generated by a teacher model (DeepSeek-R1)—the student imitates the teacher's patterns, which is simpler and more stable but requires high-quality traces. Pure RL (R1-Zero style) starts from a base model with no reasoning traces, discovering reasoning through exploration guided only by correctness rewards—harder but potentially discovers novel strategies.

How do I handle chat template issues?

Different base models have different chat templates. DeepSeek's template has quirks (hides thinking block, prefills ` `). Open R1 provides corrected templates in the recipe configs. For custom base models, you may need to specify `--chat_template` and `--eos_token` explicitly.

Open R1: Hugging Face's Open Reproduction of DeepSeek-R1 | Enrico Piovano

Introduction: Why Open R1 Matters

When DeepSeek released R1 in January 2025, it demonstrated that reasoning models could achieve state-of-the-art performance on mathematical, coding, and scientific benchmarks. But while the model weights and technical report were public, the training recipes, datasets, and infrastructure remained proprietary. Hugging Face's Open R1 project aims to change that—building a fully open reproduction that the community can learn from, modify, and build upon.

Open R1 isn't just another fine-tuning script. It's a complete implementation of the training pipeline that produced reasoning capabilities in DeepSeek-R1, including:

SFT distillation from R1's reasoning traces
GRPO (Group Relative Policy Optimization) for reinforcement learning
Rule-based and verifiable reward functions for math and code
Distributed training infrastructure with vLLM integration
Evaluation pipelines that reproduce DeepSeek's reported benchmarks

This post provides a deep technical analysis of Open R1's architecture and implementation. If you're unfamiliar with the theoretical foundations, I recommend reading Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR first, as Open R1 is essentially the practical implementation of those concepts.

The Three-Step Plan: From Distillation to Pure RL

Open R1 follows DeepSeek's technical report as a guide, breaking the reproduction into three phases:

Step 1: SFT Distillation (Completed) The first milestone was replicating the R1-Distill models by fine-tuning on reasoning traces generated by DeepSeek-R1. This produced the Mixture-of-Thoughts dataset—350,000 verified reasoning traces spanning mathematics, coding, and science. The resulting OpenR1-Distill-7B matches DeepSeek's distilled model on key benchmarks.

Step 2: Pure RL Pipeline (In Progress) The more ambitious goal is replicating R1-Zero—the model that developed reasoning capabilities purely through reinforcement learning, without any reasoning traces in the training data. This requires GRPO training at scale with carefully designed reward functions.

Step 3: Multi-Stage Training (Planned) The final phase will demonstrate the complete pipeline: taking a base model through SFT warmup, then RL fine-tuning, matching DeepSeek's multi-stage approach.

This phased approach is pragmatic. Distillation validates that the training infrastructure works and produces competitive models. Pure RL is harder—it requires the model to discover reasoning patterns through exploration, with only correctness signals as guidance.

Architecture Overview: What's in the Repository

Open R1 follows a deliberately simple design philosophy. Rather than building a complex framework, it provides clean training scripts that leverage existing libraries—particularly TRL (Transformers Reinforcement Learning) for the trainer implementations.

The core components are:

Training Scripts

sft.py - Supervised fine-tuning using TRL's SFTTrainer
grpo.py - GRPO training using TRL's GRPOTrainer
generate.py - Synthetic data generation using Distilabel

Supporting Infrastructure

rewards.py - Comprehensive reward function library (14+ implementations)
configs.py - Configuration dataclasses extending TRL's configs
utils/ - Dataset loading, model utilities, callbacks, evaluation

Training Recipes

YAML configurations for different models and tasks
Accelerate configs for DDP, FSDP, ZeRO-2, and ZeRO-3
Slurm scripts for cluster training

The design emphasizes reproducibility. Every training run can be launched with a single command pointing to a YAML config file. Hyperparameters are documented, and the recipes folder provides working configurations for various model sizes.

Understanding the SFT Pipeline

Supervised fine-tuning in Open R1 is straightforward but includes important details that affect downstream GRPO training.

The Training Flow

The SFT script follows a standard pattern: load dataset, load tokenizer, load model, initialize trainer, train, save. What makes it notable is the attention to chat template configuration.

DeepSeek's R1 models use a specific chat template that affects how reasoning traces are formatted. The model expects outputs structured as:

Code

<think>
[reasoning steps]
</think>
<answer>
[final answer]
</answer>

Getting this right matters for two reasons. First, the format reward functions in GRPO training check for these exact tags. Second, the original DeepSeek chat template has quirks—it hides the thinking block content and prefills the assistant response with <think>, which interferes with format rewards. Open R1 provides a corrected chat template that includes the full reasoning block in completions.

Dataset: Mixture-of-Thoughts

The Mixture-of-Thoughts dataset represents Step 1's completion. It contains 350,000 reasoning traces generated by DeepSeek-R1 across three domains:

Mathematics - Problems from NuminaMath and similar sources
Coding - Programming problems with test cases
Science - Scientific reasoning problems (GPQA-style)

Each example includes the problem, the complete reasoning trace (including the thinking block), and the final answer. The traces are verified—meaning the final answers were checked against ground truth, and incorrect traces were filtered out.

This verification step is crucial. Distillation from unverified traces would teach the model to generate plausible-sounding but incorrect reasoning. By filtering to only correct traces, the model learns patterns that actually lead to right answers.

Model Selection and Scaling

Open R1 supports various base models, but the primary recipes target the Qwen 2.5 family:

Qwen2.5-1.5B - For development and debugging
Qwen2.5-7B - The main distillation target (matching DeepSeek-R1-Distill-Qwen-7B)
Qwen2.5-32B - For scaling experiments

The choice of Qwen as the base architecture matches DeepSeek's distillation targets. Importantly, Open R1 uses open-r1/Qwen2.5-Math-7B-RoPE-300k as the base for the 7B recipe—a Qwen 2.5 Math model with extended RoPE context length (300k tokens). This extended context is necessary because reasoning traces can be very long—mathematical proofs often exceed 10,000 tokens.

GRPO Training: From Theory to Implementation

GRPO is where Open R1 implements the core innovation behind R1-style training. If you've read my GRPO deep dive, you know the algorithm eliminates the value function by using group statistics for advantage estimation. Open R1 makes this practical.

How the GRPOTrainer Works

The training script is surprisingly compact—about 180 lines. It:

Loads the dataset and tokenizer
Loads the model with appropriate optimizations (Flash Attention, gradient checkpointing)
Initializes reward functions from the registry
Formats prompts into conversations with optional system prompts
Initializes the GRPOTrainer from TRL
Runs the training loop

The heavy lifting happens in TRL's GRPOTrainer, which handles:

Generating multiple completions per prompt (the "group" in GRPO)
Computing rewards for each completion
Estimating advantages using group statistics
Running the PPO-style policy update with clipping

Open R1's contribution is the reward function library and the infrastructure to run this at scale.

The vLLM Integration

GRPO requires generating many completions per training step. For a batch of 16 prompts with 16 generations each, that's 256 completions per step. Standard HuggingFace generate() would be painfully slow.

Open R1 uses vLLM as the generation backend. TRL's vLLM integration runs a vLLM server either colocated with training (for single-node experiments) or on separate nodes (for multi-node scaling). The training process sends prompts to vLLM via HTTP, which handles batched, continuous generation with PagedAttention.

For multi-node GRPO, the architecture is:

N nodes running the training loop (gradient computation)
1 node running the vLLM server (generation)

This separation allows scaling generation independently from training. The vLLM server can use tensor parallelism to handle larger models, while training nodes use data parallelism for throughput.

Configuration Deep Dive

Looking at a representative GRPO config reveals the key hyperparameters:

Generation Parameters

num_generations: 16 - Generate 16 completions per prompt
temperature: 0.7 - Moderate temperature for diversity
max_completion_length: 2048 - Maximum tokens per completion

Training Parameters

learning_rate: 1.0e-6 - Very low learning rate (typical for RL fine-tuning)
per_device_train_batch_size: 16 - Prompts per device per step
gradient_accumulation_steps: 4 - Effective batch of 64 prompts

Reward Configuration

reward_funcs: [accuracy, format, tag_count] - Which rewards to use
reward_weights: [1.0, 1.0, 1.0] - How to weight them

The generation count (16) is a critical tradeoff. More generations give better advantage estimates but cost more compute. DeepSeek used larger groups (reportedly 64+), but 16 works well for smaller-scale experiments.

The learning rate (1e-6) is deliberately tiny. RL fine-tuning is notoriously unstable—large updates can collapse the model's capabilities. The low learning rate, combined with PPO's clipping, keeps updates conservative.

The Reward Function Library: Open R1's Core Innovation

The rewards.py module is perhaps Open R1's most valuable contribution. It provides 14 production-ready reward functions covering mathematics, code execution, and formatting. Understanding these rewards is essential for building reasoning models.

Mathematical Accuracy Rewards

The accuracy_reward function implements mathematical verification using the math_verify library. It doesn't just check string equality—it parses LaTeX expressions, normalizes mathematical notation, and verifies symbolic equivalence.

The verification flow:

Parse the ground truth answer using LaTeX extraction
Parse the model's response, looking for \boxed{} answers first
Normalize both answers (handle equivalent notations)
Compare symbolically using math_verify

This matters because "1/2" and "0.5" and "0.500" and "\frac{1}{2}" are all the same answer. String matching would mark these wrong; symbolic verification correctly identifies them as equivalent.

The function also handles edge cases gracefully. If the ground truth can't be parsed (malformed LaTeX in the dataset), it returns None to skip that example rather than assigning an arbitrary reward. This prevents training on noisy signals.

Format Rewards: Teaching Structure

DeepSeek R1's distinctive feature is its structured output—thinking inside <think> tags, answers inside <answer> tags. Three reward functions encourage this:

format_reward - Binary reward for perfect formatting. The response must exactly match the pattern:

Code

<think>
[content]
</think>
<answer>
[content]
</answer>

This is strict—extra whitespace or missing newlines cause failure.

tag_count_reward - Partial credit for structural elements. It awards 0.25 points each for:

Having exactly one <think>\n opening
Having exactly one \n</think>\n closing
Having exactly one \n<answer>\n opening
Having exactly one \n</answer> closing

This softer signal helps early in training when the model is learning the format.

reasoning_steps_reward - Encourages step-by-step reasoning by detecting patterns like "Step 1:", numbered lists, bullet points, and transition words ("First,", "Therefore,"). It caps at 1.0 after detecting 3+ reasoning indicators.

Length-Based Rewards: Preventing Overthinking

Two rewards address a subtle problem: models under RL pressure sometimes generate unnecessarily long responses, padding with repetitive reasoning or redundant verification steps.

len_reward (from Kimi 1.5) computes:

For correct answers: reward = 0.5 - (length - min_length) / (max_length - min_length)
For incorrect answers: reward = min(0, above_formula)

Correct answers are rewarded for brevity (shorter = higher reward). Incorrect answers are never rewarded for length, only penalized if very long.

cosine_scaled_reward applies a cosine schedule to length, providing smoother gradients than the linear formula. Shorter correct solutions get rewards closer to max_value_correct, while longer ones approach min_value_correct.

soft_overlong_punishment (from the DAPO paper) implements a different approach: it only penalizes, never rewards. The logic is:

If completion length ≤ (max_length - cache_size): reward = 0 (no penalty)
If (max_length - cache_size) < length ≤ max_length: reward scales linearly from 0 to -1
If length > max_length: reward = -1 (full penalty)

This prevents the model from gaming length rewards by generating artificially short responses. It only intervenes when completions approach the maximum allowed length.

Repetition Penalty: Avoiding Degenerate Loops

The repetition_penalty_reward implements n-gram repetition detection, crucial for preventing a known failure mode where models get stuck repeating phrases.

For English, it splits text into words and counts unique n-grams (default n=3). The penalty scales with repetition: $s = 1 - \frac{N_{\text{unique}}}{N_{\text{total}}}, \quad r = s \cdot P_{\max}$

A response with no repetition (all unique trigrams) gets reward 0. A highly repetitive response approaches max_penalty (typically -1.0).

The function also supports Chinese via jieba segmentation, acknowledging that word boundaries work differently across languages.

Code Execution Rewards

For coding tasks, Open R1 provides execution-based rewards that actually run the generated code against test cases.

code_reward extracts code from the response, sends it to an execution sandbox (E2B or Morph), runs it against test cases, and returns the pass rate. A response passing 7 of 10 tests gets reward 0.7.

binary_code_reward thresholds the above at 0.99—only (near) perfect solutions get reward 1.0, everything else gets 0.0. This binary signal is cleaner for training but provides less gradient information.

ioi_code_reward and cf_code_reward handle competitive programming formats from IOI and CodeForces respectively. These involve more complex grading (subtask scores, partial credit) and require specific execution infrastructure (Piston or Morph sandboxes).

The Reward Registry Pattern

Rewards are accessed through a registry pattern in get_reward_funcs(). The config file specifies reward names and weights:

YAML

reward_funcs:
  - accuracy
  - format
  - tag_count
reward_weights:
  - 1.0
  - 1.0
  - 1.0

The registry maps names to functions, applying any necessary configuration (like cosine parameters or repetition n-gram size). This clean interface makes it easy to experiment with different reward combinations without modifying training code.

Dataset Mixtures: Combining Multiple Sources

Real training rarely uses a single dataset. Open R1 supports weighted dataset mixtures through its configuration system.

The mixture config specifies multiple datasets with sampling weights:

YAML

dataset_mixture:
  datasets:
    - id: open-r1/OpenR1-Math-220k
      weight: 0.6
      columns: [problem, solution]
    - id: open-r1/codeforces-cots
      weight: 0.4
      columns: [problem, solution]
  seed: 42
  test_split_size: 0.1

The loading logic shuffles each dataset, subsamples according to weight, concatenates, and shuffles again. This ensures training sees examples from all sources in proportion to their weights, with randomization preventing ordering effects.

Weights don't need to sum to 1.0—they're proportional. Setting weight 0.6 and 0.4 means 60% of training examples come from math, 40% from code.

Data Generation: Creating Reasoning Traces at Scale

Step 1 required generating 350,000 reasoning traces from DeepSeek-R1. The generate.py module handles this using Distilabel, a library for synthetic data generation.

The Generation Pipeline

The pipeline:

Loads a source dataset (e.g., NuminaMath problems)
Initializes a vLLM backend pointing to the generation model
Creates a TextGeneration step with appropriate templates
Runs generation in batches, optionally distributed via Ray
Pushes results to the Hugging Face Hub

For large-scale generation (millions of examples), Ray distributes work across multiple nodes, each running a vLLM instance. This parallelism is necessary because generating long reasoning traces is slow—even with vLLM, a single 8xH100 node might only generate a few thousand traces per hour.

Data Decontamination

Before training, datasets must be decontaminated—removing examples that overlap with evaluation benchmarks. Open R1 provides decontaminate.py which:

Builds 8-gram lookup tables from training data
Checks each evaluation benchmark problem for n-gram overlap
Removes contaminated training examples
Pushes the cleaned dataset

This is critical for valid evaluation. Without decontamination, models might achieve high benchmark scores by memorizing answers rather than learning to reason.

Pass Rate Filtering: Quality Control for Training Data

Before GRPO training, Open R1 supports filtering datasets by generating completions and computing pass rates on verifiable tasks. This ensures the training data contains a mix of solvable and challenging problems—not problems that are too easy (always solved) or too hard (never solved).

The filtering pipeline:

Generate multiple completions for each problem using the current model
Verify each completion against ground truth or test cases
Compute pass rate (fraction of correct completions)
Filter to keep problems within a desired pass rate range (e.g., 10%-60%)

Problems with very high pass rates (>90%) provide little learning signal—the model already solves them consistently. Problems with very low pass rates (<10%) may be too hard for productive learning. The sweet spot is problems the model sometimes solves, creating meaningful reward variance within groups.

This filtering is particularly important for curriculum learning approaches, where you progressively train on harder problems as the model improves.

Callbacks: Automated Checkpoint Management

Open R1 includes a callback system for automating checkpoint management and evaluation:

PushToHubRevisionCallback handles model checkpoint publishing:

After each save, pushes the checkpoint to a uniquely-named Hub branch (e.g., main-step-000001000)
Excludes optimizer states (saving bandwidth, since they're not needed for inference)
Optionally triggers benchmark evaluation via Slurm job submission

This enables continuous evaluation during training. Each checkpoint gets its own Hub branch, and benchmark jobs can run in parallel with training. You can track model progress across training by comparing benchmark scores at different steps.

The callback system is extensible—you can register custom callbacks for additional automation (logging, alerts, custom metrics).

Evaluation: Reproducing Benchmark Results

Open R1 integrates with LightEval for standardized evaluation. The primary benchmarks are:

Benchmark	Description	Samples per Query
AIME 2024	American math competition	64
AIME 2025	American math competition (latest)	64
MATH-500	Mathematical problem solving	4
GPQA Diamond	Graduate-level science QA	8
LiveCodeBench v5	Code generation	16

The high sample counts (especially 64 for AIME) reflect reasoning model evaluation methodology. Because inference involves sampling, results are stochastic. Pass@1 with many samples estimates the probability that at least one of K samples is correct.

Open R1 also provides IOI24—a benchmark of very hard problems from international olympiads. This tests the ceiling of reasoning capabilities, where even R1-level models struggle.

Reproducing DeepSeek's Numbers

Open R1's evaluation reproduces DeepSeek's reported results within 1-3 standard deviations:

Model	AIME 2024 (Open R1)	AIME 2024 (DeepSeek)
R1-Distill-Qwen-7B	50.8	55.5
R1-Distill-Qwen-32B	69.7	72.6

The small differences likely reflect sampling variance and potentially different evaluation protocols (exact prompts, temperature settings). The key point is that results are in the same ballpark, validating that Open R1's implementation is correct.

Distributed Training: Scaling to Large Models

Open R1 provides multiple parallelism strategies through Accelerate configs.

DeepSpeed ZeRO Stages

ZeRO-2 shards optimizer states and gradients across GPUs. Good for models that fit in memory but need gradient accumulation. Lower communication overhead than ZeRO-3.

ZeRO-3 additionally shards model parameters, enabling training of models larger than single-GPU memory. Higher communication overhead but necessary for 32B+ models.

The choice depends on model size and available hardware:

1.5B models: ZeRO-2 or even DDP on 8xH100
7B models: ZeRO-2 with gradient checkpointing
32B+ models: ZeRO-3 required

Multi-Node GRPO Architecture

For GRPO specifically, multi-node training uses a N+1 architecture:

1 node runs the vLLM generation server
N nodes run training

The Slurm script handles this coordination:

Bash

sbatch --nodes=2 slurm/train.slurm --model Qwen2.5-1.5B-Instruct \
    --task grpo --config demo --accelerator zero2 --dp 8 --tp 1

This launches 2 nodes total: 1 for vLLM (with tensor parallelism if needed), 1 for training (with data parallelism).

Code Interpreter Integration: Training with Execution Feedback

A distinctive feature of Open R1 is integrated code execution during training. Models can generate code, have it executed, and receive reward based on test results—all within the training loop.

Sandbox Providers

Three execution providers are supported:

E2B - Fast cloud-based Python execution with good rate limits. Best for Python-focused training. Requires an API key.

Morph - Cloud-based sandboxes with broader language support (Python, JavaScript, C++, Rust). Better for competitive programming problems that require compiled languages.

Local - Direct subprocess execution on the training machine. No API costs, but requires careful security consideration since it executes untrusted code locally.

For production training, cloud providers (E2B or Morph) are recommended—they provide isolated environments that safely execute arbitrary generated programs without risking the training infrastructure.

Dataset Requirements for Code Training

Datasets for code reward training must include a verification_info column with test cases:

JSON

{
  "language": "python",
  "test_cases": [
    {
      "input": "5\n1 2 3 4 5\n",
      "output": "15\n",
      "type": "stdin_stdout"
    }
  ]
}

The type field specifies how inputs/outputs are provided (stdin/stdout is most common for competitive programming). Multiple test cases per problem enable partial credit scoring.

Execution Architecture

To handle rate limits and parallelism, Open R1 supports router services. A CPU node runs a router that manages execution requests:

Code

Training Node → Router → E2B/Morph API
              → Router → E2B/Morph API
              → Router → E2B/Morph API

The router batches requests, manages API quotas, and distributes load. All training jobs can share a single router, ensuring coordinated resource usage.

Competitive Programming: IOI and CodeForces

For harder problems, Open R1 includes specialized reward functions:

ioi_code_reward handles International Olympiad in Informatics format:

C++ code extraction with automatic header inclusion
Subtask-based scoring (partial credit)
Piston or Morph backends for execution

cf_code_reward handles CodeForces problems:

Multiple scoring modes (pass/fail, partial, weighted sum)
Test batching for efficiency (stop early on failure)
Language detection from problem metadata

These enable training on the hardest competitive programming problems, where even R1-level models struggle.

Connection to Theoretical Foundations

Open R1 is the practical manifestation of concepts from the GRPO and reasoning training literature. Let me explicitly connect the implementation to theory.

GRPO: Theory to Practice

In my GRPO post, I explained how GRPO eliminates the critic by using group statistics:

PPO: $A = r - V(s) \quad \text{(learned value function)}$

GRPO: $A = r - \mu_g \quad \text{(computed from group)}$

Open R1's num_generations: 16 parameter is the group size. TRL's GRPOTrainer generates 16 completions per prompt, computes rewards for each, and uses the mean as the baseline. No value function needed.

The reward_weights correspond to combining multiple reward signals. When config specifies reward_funcs: [accuracy, format, tag_count] with reward_weights: [1.0, 1.0, 1.0], the total reward is:

Code

total = 1.0 * accuracy + 1.0 * format + 1.0 * tag_count

This weighted sum becomes the reward for advantage computation.

RLVR: Verifiable Rewards in Action

The RLVR (Reinforcement Learning with Verifiable Rewards) concept emphasizes training on domains where correctness can be automatically verified. Open R1's reward functions are precisely this:

accuracy_reward - Mathematical verification via symbolic comparison
code_reward - Code verification via test execution
ioi_code_reward / cf_code_reward - Competitive programming verification

No learned reward model, no human preferences—just automated verification. This is what enabled R1-Zero to develop reasoning without explicit reasoning supervision.

Rule-Based Rewards: DeepSeek's Approach

DeepSeek's technical report emphasized simple rule-based rewards. Open R1 implements these directly:

format_reward - Regex matching for tag structure
tag_count_reward - Counting structural elements
reasoning_steps_reward - Pattern matching for reasoning indicators

These provide training signal without any learned components. The model learns to produce well-structured reasoning because those patterns correlate with higher rewards.

Practical Usage: Getting Started

If you want to run Open R1, here's the minimal path:

Installation

Bash

# Create environment
uv venv openr1 --python 3.11 && source openr1/bin/activate

# Install vLLM and Flash Attention
uv pip install vllm==0.8.5.post1
uv pip install setuptools && uv pip install flash-attn --no-build-isolation

# Install Open R1
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"

SFT Training

Bash

accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
    src/open_r1/sft.py \
    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml

GRPO Training

For single-node with colocated vLLM:

Bash

accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
    src/open_r1/grpo.py \
    --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml \
    --vllm_mode colocate

Evaluation

Bash

MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
lighteval vllm "model_name=$MODEL,dtype=bfloat16,max_model_length=32768" \
    "lighteval|aime24|0|0" --use-chat-template

Table of Contents

Introduction: Why Open R1 Matters

The Three-Step Plan: From Distillation to Pure RL

Architecture Overview: What's in the Repository

Understanding the SFT Pipeline

The Training Flow

Dataset: Mixture-of-Thoughts

Model Selection and Scaling

GRPO Training: From Theory to Implementation

How the GRPOTrainer Works

The vLLM Integration

Configuration Deep Dive

The Reward Function Library: Open R1's Core Innovation

Mathematical Accuracy Rewards

Format Rewards: Teaching Structure

Length-Based Rewards: Preventing Overthinking

Repetition Penalty: Avoiding Degenerate Loops

Code Execution Rewards

The Reward Registry Pattern

Dataset Mixtures: Combining Multiple Sources

Data Generation: Creating Reasoning Traces at Scale

The Generation Pipeline

Data Decontamination

Pass Rate Filtering: Quality Control for Training Data

Callbacks: Automated Checkpoint Management

Evaluation: Reproducing Benchmark Results

Reproducing DeepSeek's Numbers

Distributed Training: Scaling to Large Models

DeepSpeed ZeRO Stages

Multi-Node GRPO Architecture

Code Interpreter Integration: Training with Execution Feedback

Sandbox Providers

Dataset Requirements for Code Training

Execution Architecture

Competitive Programming: IOI and CodeForces

Connection to Theoretical Foundations

GRPO: Theory to Practice

RLVR: Verifiable Rewards in Action

Rule-Based Rewards: DeepSeek's Approach

Practical Usage: Getting Started

Installation

SFT Training

GRPO Training

Evaluation

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR

GRPO: Group Relative Policy Optimization Explained

Reasoning Models: A Brief Framework

HuggingFace TRL: A Deep Dive into the Transformer Reinforcement Learning Library

vLLM in Production: The Complete Guide to High-Performance LLM Serving