Open R1: Hugging Face's Open Reproduction of DeepSeek-R1
Detailed analysis of Hugging Face's Open R1 project—the fully open reproduction of DeepSeek-R1. Learn how it implements GRPO, reward functions, distillation, and the complete training pipeline for reasoning models.
Table of Contents
Introduction: Why Open R1 Matters
When DeepSeek released R1 in January 2025, it demonstrated that reasoning models could achieve state-of-the-art performance on mathematical, coding, and scientific benchmarks. But while the model weights and technical report were public, the training recipes, datasets, and infrastructure remained proprietary. Hugging Face's Open R1 project aims to change that—building a fully open reproduction that the community can learn from, modify, and build upon.
Open R1 isn't just another fine-tuning script. It's a complete implementation of the training pipeline that produced reasoning capabilities in DeepSeek-R1, including:
- SFT distillation from R1's reasoning traces
- GRPO (Group Relative Policy Optimization) for reinforcement learning
- Rule-based and verifiable reward functions for math and code
- Distributed training infrastructure with vLLM integration
- Evaluation pipelines that reproduce DeepSeek's reported benchmarks
This post provides a deep technical analysis of Open R1's architecture and implementation. If you're unfamiliar with the theoretical foundations, I recommend reading Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR first, as Open R1 is essentially the practical implementation of those concepts.
The Three-Step Plan: From Distillation to Pure RL
Open R1 follows DeepSeek's technical report as a guide, breaking the reproduction into three phases:
Step 1: SFT Distillation (Completed) The first milestone was replicating the R1-Distill models by fine-tuning on reasoning traces generated by DeepSeek-R1. This produced the Mixture-of-Thoughts dataset—350,000 verified reasoning traces spanning mathematics, coding, and science. The resulting OpenR1-Distill-7B matches DeepSeek's distilled model on key benchmarks.
Step 2: Pure RL Pipeline (In Progress) The more ambitious goal is replicating R1-Zero—the model that developed reasoning capabilities purely through reinforcement learning, without any reasoning traces in the training data. This requires GRPO training at scale with carefully designed reward functions.
Step 3: Multi-Stage Training (Planned) The final phase will demonstrate the complete pipeline: taking a base model through SFT warmup, then RL fine-tuning, matching DeepSeek's multi-stage approach.
This phased approach is pragmatic. Distillation validates that the training infrastructure works and produces competitive models. Pure RL is harder—it requires the model to discover reasoning patterns through exploration, with only correctness signals as guidance.
Architecture Overview: What's in the Repository
Open R1 follows a deliberately simple design philosophy. Rather than building a complex framework, it provides clean training scripts that leverage existing libraries—particularly TRL (Transformers Reinforcement Learning) for the trainer implementations.
The core components are:
Training Scripts
sft.py- Supervised fine-tuning using TRL's SFTTrainergrpo.py- GRPO training using TRL's GRPOTrainergenerate.py- Synthetic data generation using Distilabel
Supporting Infrastructure
rewards.py- Comprehensive reward function library (14+ implementations)configs.py- Configuration dataclasses extending TRL's configsutils/- Dataset loading, model utilities, callbacks, evaluation
Training Recipes
- YAML configurations for different models and tasks
- Accelerate configs for DDP, FSDP, ZeRO-2, and ZeRO-3
- Slurm scripts for cluster training
The design emphasizes reproducibility. Every training run can be launched with a single command pointing to a YAML config file. Hyperparameters are documented, and the recipes folder provides working configurations for various model sizes.
Understanding the SFT Pipeline
Supervised fine-tuning in Open R1 is straightforward but includes important details that affect downstream GRPO training.
The Training Flow
The SFT script follows a standard pattern: load dataset, load tokenizer, load model, initialize trainer, train, save. What makes it notable is the attention to chat template configuration.
DeepSeek's R1 models use a specific chat template that affects how reasoning traces are formatted. The model expects outputs structured as:
<think>
[reasoning steps]
</think>
<answer>
[final answer]
</answer>
Getting this right matters for two reasons. First, the format reward functions in GRPO training check for these exact tags. Second, the original DeepSeek chat template has quirks—it hides the thinking block content and prefills the assistant response with <think>, which interferes with format rewards. Open R1 provides a corrected chat template that includes the full reasoning block in completions.
Dataset: Mixture-of-Thoughts
The Mixture-of-Thoughts dataset represents Step 1's completion. It contains 350,000 reasoning traces generated by DeepSeek-R1 across three domains:
- Mathematics - Problems from NuminaMath and similar sources
- Coding - Programming problems with test cases
- Science - Scientific reasoning problems (GPQA-style)
Each example includes the problem, the complete reasoning trace (including the thinking block), and the final answer. The traces are verified—meaning the final answers were checked against ground truth, and incorrect traces were filtered out.
This verification step is crucial. Distillation from unverified traces would teach the model to generate plausible-sounding but incorrect reasoning. By filtering to only correct traces, the model learns patterns that actually lead to right answers.
Model Selection and Scaling
Open R1 supports various base models, but the primary recipes target the Qwen 2.5 family:
- Qwen2.5-1.5B - For development and debugging
- Qwen2.5-7B - The main distillation target (matching DeepSeek-R1-Distill-Qwen-7B)
- Qwen2.5-32B - For scaling experiments
The choice of Qwen as the base architecture matches DeepSeek's distillation targets. Importantly, Open R1 uses open-r1/Qwen2.5-Math-7B-RoPE-300k as the base for the 7B recipe—a Qwen 2.5 Math model with extended RoPE context length (300k tokens). This extended context is necessary because reasoning traces can be very long—mathematical proofs often exceed 10,000 tokens.
GRPO Training: From Theory to Implementation
GRPO is where Open R1 implements the core innovation behind R1-style training. If you've read my GRPO deep dive, you know the algorithm eliminates the value function by using group statistics for advantage estimation. Open R1 makes this practical.
How the GRPOTrainer Works
The training script is surprisingly compact—about 180 lines. It:
- Loads the dataset and tokenizer
- Loads the model with appropriate optimizations (Flash Attention, gradient checkpointing)
- Initializes reward functions from the registry
- Formats prompts into conversations with optional system prompts
- Initializes the GRPOTrainer from TRL
- Runs the training loop
The heavy lifting happens in TRL's GRPOTrainer, which handles:
- Generating multiple completions per prompt (the "group" in GRPO)
- Computing rewards for each completion
- Estimating advantages using group statistics
- Running the PPO-style policy update with clipping
Open R1's contribution is the reward function library and the infrastructure to run this at scale.
The vLLM Integration
GRPO requires generating many completions per training step. For a batch of 16 prompts with 16 generations each, that's 256 completions per step. Standard HuggingFace generate() would be painfully slow.
Open R1 uses vLLM as the generation backend. TRL's vLLM integration runs a vLLM server either colocated with training (for single-node experiments) or on separate nodes (for multi-node scaling). The training process sends prompts to vLLM via HTTP, which handles batched, continuous generation with PagedAttention.
For multi-node GRPO, the architecture is:
- N nodes running the training loop (gradient computation)
- 1 node running the vLLM server (generation)
This separation allows scaling generation independently from training. The vLLM server can use tensor parallelism to handle larger models, while training nodes use data parallelism for throughput.
Configuration Deep Dive
Looking at a representative GRPO config reveals the key hyperparameters:
Generation Parameters
num_generations: 16- Generate 16 completions per prompttemperature: 0.7- Moderate temperature for diversitymax_completion_length: 2048- Maximum tokens per completion
Training Parameters
learning_rate: 1.0e-6- Very low learning rate (typical for RL fine-tuning)per_device_train_batch_size: 16- Prompts per device per stepgradient_accumulation_steps: 4- Effective batch of 64 prompts
Reward Configuration
reward_funcs: [accuracy, format, tag_count]- Which rewards to usereward_weights: [1.0, 1.0, 1.0]- How to weight them
The generation count (16) is a critical tradeoff. More generations give better advantage estimates but cost more compute. DeepSeek used larger groups (reportedly 64+), but 16 works well for smaller-scale experiments.
The learning rate (1e-6) is deliberately tiny. RL fine-tuning is notoriously unstable—large updates can collapse the model's capabilities. The low learning rate, combined with PPO's clipping, keeps updates conservative.
The Reward Function Library: Open R1's Core Innovation
The rewards.py module is perhaps Open R1's most valuable contribution. It provides 14 production-ready reward functions covering mathematics, code execution, and formatting. Understanding these rewards is essential for building reasoning models.
Mathematical Accuracy Rewards
The accuracy_reward function implements mathematical verification using the math_verify library. It doesn't just check string equality—it parses LaTeX expressions, normalizes mathematical notation, and verifies symbolic equivalence.
The verification flow:
- Parse the ground truth answer using LaTeX extraction
- Parse the model's response, looking for
\boxed{}answers first - Normalize both answers (handle equivalent notations)
- Compare symbolically using
math_verify
This matters because "1/2" and "0.5" and "0.500" and "\frac{1}{2}" are all the same answer. String matching would mark these wrong; symbolic verification correctly identifies them as equivalent.
The function also handles edge cases gracefully. If the ground truth can't be parsed (malformed LaTeX in the dataset), it returns None to skip that example rather than assigning an arbitrary reward. This prevents training on noisy signals.
Format Rewards: Teaching Structure
DeepSeek R1's distinctive feature is its structured output—thinking inside <think> tags, answers inside <answer> tags. Three reward functions encourage this:
format_reward - Binary reward for perfect formatting. The response must exactly match the pattern:
<think>
[content]
</think>
<answer>
[content]
</answer>
This is strict—extra whitespace or missing newlines cause failure.
tag_count_reward - Partial credit for structural elements. It awards 0.25 points each for:
- Having exactly one
<think>\nopening - Having exactly one
\n</think>\nclosing - Having exactly one
\n<answer>\nopening - Having exactly one
\n</answer>closing
This softer signal helps early in training when the model is learning the format.
reasoning_steps_reward - Encourages step-by-step reasoning by detecting patterns like "Step 1:", numbered lists, bullet points, and transition words ("First,", "Therefore,"). It caps at 1.0 after detecting 3+ reasoning indicators.
Length-Based Rewards: Preventing Overthinking
Two rewards address a subtle problem: models under RL pressure sometimes generate unnecessarily long responses, padding with repetitive reasoning or redundant verification steps.
len_reward (from Kimi 1.5) computes:
- For correct answers: reward = 0.5 - (length - min_length) / (max_length - min_length)
- For incorrect answers: reward = min(0, above_formula)
Correct answers are rewarded for brevity (shorter = higher reward). Incorrect answers are never rewarded for length, only penalized if very long.
cosine_scaled_reward applies a cosine schedule to length, providing smoother gradients than the linear formula. Shorter correct solutions get rewards closer to max_value_correct, while longer ones approach min_value_correct.
soft_overlong_punishment (from the DAPO paper) implements a different approach: it only penalizes, never rewards. The logic is:
- If completion length ≤ (max_length - cache_size): reward = 0 (no penalty)
- If (max_length - cache_size) < length ≤ max_length: reward scales linearly from 0 to -1
- If length > max_length: reward = -1 (full penalty)
This prevents the model from gaming length rewards by generating artificially short responses. It only intervenes when completions approach the maximum allowed length.
Repetition Penalty: Avoiding Degenerate Loops
The repetition_penalty_reward implements n-gram repetition detection, crucial for preventing a known failure mode where models get stuck repeating phrases.
For English, it splits text into words and counts unique n-grams (default n=3). The penalty scales with repetition:
A response with no repetition (all unique trigrams) gets reward 0. A highly repetitive response approaches max_penalty (typically -1.0).
The function also supports Chinese via jieba segmentation, acknowledging that word boundaries work differently across languages.
Code Execution Rewards
For coding tasks, Open R1 provides execution-based rewards that actually run the generated code against test cases.
code_reward extracts code from the response, sends it to an execution sandbox (E2B or Morph), runs it against test cases, and returns the pass rate. A response passing 7 of 10 tests gets reward 0.7.
binary_code_reward thresholds the above at 0.99—only (near) perfect solutions get reward 1.0, everything else gets 0.0. This binary signal is cleaner for training but provides less gradient information.
ioi_code_reward and cf_code_reward handle competitive programming formats from IOI and CodeForces respectively. These involve more complex grading (subtask scores, partial credit) and require specific execution infrastructure (Piston or Morph sandboxes).
The Reward Registry Pattern
Rewards are accessed through a registry pattern in get_reward_funcs(). The config file specifies reward names and weights:
reward_funcs:
- accuracy
- format
- tag_count
reward_weights:
- 1.0
- 1.0
- 1.0
The registry maps names to functions, applying any necessary configuration (like cosine parameters or repetition n-gram size). This clean interface makes it easy to experiment with different reward combinations without modifying training code.
Dataset Mixtures: Combining Multiple Sources
Real training rarely uses a single dataset. Open R1 supports weighted dataset mixtures through its configuration system.
The mixture config specifies multiple datasets with sampling weights:
dataset_mixture:
datasets:
- id: open-r1/OpenR1-Math-220k
weight: 0.6
columns: [problem, solution]
- id: open-r1/codeforces-cots
weight: 0.4
columns: [problem, solution]
seed: 42
test_split_size: 0.1
The loading logic shuffles each dataset, subsamples according to weight, concatenates, and shuffles again. This ensures training sees examples from all sources in proportion to their weights, with randomization preventing ordering effects.
Weights don't need to sum to 1.0—they're proportional. Setting weight 0.6 and 0.4 means 60% of training examples come from math, 40% from code.
Data Generation: Creating Reasoning Traces at Scale
Step 1 required generating 350,000 reasoning traces from DeepSeek-R1. The generate.py module handles this using Distilabel, a library for synthetic data generation.
The Generation Pipeline
The pipeline:
- Loads a source dataset (e.g., NuminaMath problems)
- Initializes a vLLM backend pointing to the generation model
- Creates a TextGeneration step with appropriate templates
- Runs generation in batches, optionally distributed via Ray
- Pushes results to the Hugging Face Hub
For large-scale generation (millions of examples), Ray distributes work across multiple nodes, each running a vLLM instance. This parallelism is necessary because generating long reasoning traces is slow—even with vLLM, a single 8xH100 node might only generate a few thousand traces per hour.
Data Decontamination
Before training, datasets must be decontaminated—removing examples that overlap with evaluation benchmarks. Open R1 provides decontaminate.py which:
- Builds 8-gram lookup tables from training data
- Checks each evaluation benchmark problem for n-gram overlap
- Removes contaminated training examples
- Pushes the cleaned dataset
This is critical for valid evaluation. Without decontamination, models might achieve high benchmark scores by memorizing answers rather than learning to reason.
Pass Rate Filtering: Quality Control for Training Data
Before GRPO training, Open R1 supports filtering datasets by generating completions and computing pass rates on verifiable tasks. This ensures the training data contains a mix of solvable and challenging problems—not problems that are too easy (always solved) or too hard (never solved).
The filtering pipeline:
- Generate multiple completions for each problem using the current model
- Verify each completion against ground truth or test cases
- Compute pass rate (fraction of correct completions)
- Filter to keep problems within a desired pass rate range (e.g., 10%-60%)
Problems with very high pass rates (>90%) provide little learning signal—the model already solves them consistently. Problems with very low pass rates (<10%) may be too hard for productive learning. The sweet spot is problems the model sometimes solves, creating meaningful reward variance within groups.
This filtering is particularly important for curriculum learning approaches, where you progressively train on harder problems as the model improves.
Callbacks: Automated Checkpoint Management
Open R1 includes a callback system for automating checkpoint management and evaluation:
PushToHubRevisionCallback handles model checkpoint publishing:
- After each save, pushes the checkpoint to a uniquely-named Hub branch (e.g.,
main-step-000001000) - Excludes optimizer states (saving bandwidth, since they're not needed for inference)
- Optionally triggers benchmark evaluation via Slurm job submission
This enables continuous evaluation during training. Each checkpoint gets its own Hub branch, and benchmark jobs can run in parallel with training. You can track model progress across training by comparing benchmark scores at different steps.
The callback system is extensible—you can register custom callbacks for additional automation (logging, alerts, custom metrics).
Evaluation: Reproducing Benchmark Results
Open R1 integrates with LightEval for standardized evaluation. The primary benchmarks are:
| Benchmark | Description | Samples per Query |
|---|---|---|
| AIME 2024 | American math competition | 64 |
| AIME 2025 | American math competition (latest) | 64 |
| MATH-500 | Mathematical problem solving | 4 |
| GPQA Diamond | Graduate-level science QA | 8 |
| LiveCodeBench v5 | Code generation | 16 |
The high sample counts (especially 64 for AIME) reflect reasoning model evaluation methodology. Because inference involves sampling, results are stochastic. Pass@1 with many samples estimates the probability that at least one of K samples is correct.
Open R1 also provides IOI24—a benchmark of very hard problems from international olympiads. This tests the ceiling of reasoning capabilities, where even R1-level models struggle.
Reproducing DeepSeek's Numbers
Open R1's evaluation reproduces DeepSeek's reported results within 1-3 standard deviations:
| Model | AIME 2024 (Open R1) | AIME 2024 (DeepSeek) |
|---|---|---|
| R1-Distill-Qwen-7B | 50.8 | 55.5 |
| R1-Distill-Qwen-32B | 69.7 | 72.6 |
The small differences likely reflect sampling variance and potentially different evaluation protocols (exact prompts, temperature settings). The key point is that results are in the same ballpark, validating that Open R1's implementation is correct.
Distributed Training: Scaling to Large Models
Open R1 provides multiple parallelism strategies through Accelerate configs.
DeepSpeed ZeRO Stages
ZeRO-2 shards optimizer states and gradients across GPUs. Good for models that fit in memory but need gradient accumulation. Lower communication overhead than ZeRO-3.
ZeRO-3 additionally shards model parameters, enabling training of models larger than single-GPU memory. Higher communication overhead but necessary for 32B+ models.
The choice depends on model size and available hardware:
- 1.5B models: ZeRO-2 or even DDP on 8xH100
- 7B models: ZeRO-2 with gradient checkpointing
- 32B+ models: ZeRO-3 required
Multi-Node GRPO Architecture
For GRPO specifically, multi-node training uses a N+1 architecture:
- 1 node runs the vLLM generation server
- N nodes run training
The Slurm script handles this coordination:
sbatch --nodes=2 slurm/train.slurm --model Qwen2.5-1.5B-Instruct \
--task grpo --config demo --accelerator zero2 --dp 8 --tp 1
This launches 2 nodes total: 1 for vLLM (with tensor parallelism if needed), 1 for training (with data parallelism).
Code Interpreter Integration: Training with Execution Feedback
A distinctive feature of Open R1 is integrated code execution during training. Models can generate code, have it executed, and receive reward based on test results—all within the training loop.
Sandbox Providers
Three execution providers are supported:
E2B - Fast cloud-based Python execution with good rate limits. Best for Python-focused training. Requires an API key.
Morph - Cloud-based sandboxes with broader language support (Python, JavaScript, C++, Rust). Better for competitive programming problems that require compiled languages.
Local - Direct subprocess execution on the training machine. No API costs, but requires careful security consideration since it executes untrusted code locally.
For production training, cloud providers (E2B or Morph) are recommended—they provide isolated environments that safely execute arbitrary generated programs without risking the training infrastructure.
Dataset Requirements for Code Training
Datasets for code reward training must include a verification_info column with test cases:
{
"language": "python",
"test_cases": [
{
"input": "5\n1 2 3 4 5\n",
"output": "15\n",
"type": "stdin_stdout"
}
]
}
The type field specifies how inputs/outputs are provided (stdin/stdout is most common for competitive programming). Multiple test cases per problem enable partial credit scoring.
Execution Architecture
To handle rate limits and parallelism, Open R1 supports router services. A CPU node runs a router that manages execution requests:
Training Node → Router → E2B/Morph API
→ Router → E2B/Morph API
→ Router → E2B/Morph API
The router batches requests, manages API quotas, and distributes load. All training jobs can share a single router, ensuring coordinated resource usage.
Competitive Programming: IOI and CodeForces
For harder problems, Open R1 includes specialized reward functions:
ioi_code_reward handles International Olympiad in Informatics format:
- C++ code extraction with automatic header inclusion
- Subtask-based scoring (partial credit)
- Piston or Morph backends for execution
cf_code_reward handles CodeForces problems:
- Multiple scoring modes (pass/fail, partial, weighted sum)
- Test batching for efficiency (stop early on failure)
- Language detection from problem metadata
These enable training on the hardest competitive programming problems, where even R1-level models struggle.
Connection to Theoretical Foundations
Open R1 is the practical manifestation of concepts from the GRPO and reasoning training literature. Let me explicitly connect the implementation to theory.
GRPO: Theory to Practice
In my GRPO post, I explained how GRPO eliminates the critic by using group statistics:
PPO:
GRPO:
Open R1's num_generations: 16 parameter is the group size. TRL's GRPOTrainer generates 16 completions per prompt, computes rewards for each, and uses the mean as the baseline. No value function needed.
The reward_weights correspond to combining multiple reward signals. When config specifies reward_funcs: [accuracy, format, tag_count] with reward_weights: [1.0, 1.0, 1.0], the total reward is:
total = 1.0 * accuracy + 1.0 * format + 1.0 * tag_count
This weighted sum becomes the reward for advantage computation.
RLVR: Verifiable Rewards in Action
The RLVR (Reinforcement Learning with Verifiable Rewards) concept emphasizes training on domains where correctness can be automatically verified. Open R1's reward functions are precisely this:
accuracy_reward- Mathematical verification via symbolic comparisoncode_reward- Code verification via test executionioi_code_reward/cf_code_reward- Competitive programming verification
No learned reward model, no human preferences—just automated verification. This is what enabled R1-Zero to develop reasoning without explicit reasoning supervision.
Rule-Based Rewards: DeepSeek's Approach
DeepSeek's technical report emphasized simple rule-based rewards. Open R1 implements these directly:
format_reward- Regex matching for tag structuretag_count_reward- Counting structural elementsreasoning_steps_reward- Pattern matching for reasoning indicators
These provide training signal without any learned components. The model learns to produce well-structured reasoning because those patterns correlate with higher rewards.
Practical Usage: Getting Started
If you want to run Open R1, here's the minimal path:
Installation
# Create environment
uv venv openr1 --python 3.11 && source openr1/bin/activate
# Install vLLM and Flash Attention
uv pip install vllm==0.8.5.post1
uv pip install setuptools && uv pip install flash-attn --no-build-isolation
# Install Open R1
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"
SFT Training
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
src/open_r1/sft.py \
--config recipes/OpenR1-Distill-7B/sft/config_distill.yaml
GRPO Training
For single-node with colocated vLLM:
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
src/open_r1/grpo.py \
--config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml \
--vllm_mode colocate
Evaluation
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
lighteval vllm "model_name=$MODEL,dtype=bfloat16,max_model_length=32768" \
"lighteval|aime24|0|0" --use-chat-template
Frequently Asked Questions
Related Articles
Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR
A deep technical guide to training reasoning models like o1 and DeepSeek R1—covering PPO, GRPO, reward function design, RLVR, and distillation techniques.
GRPO: Group Relative Policy Optimization Explained
Understanding Group Relative Policy Optimization—the technique behind DeepSeek's training efficiency and a simpler alternative to PPO-based RLHF.
Reasoning Models: A Brief Framework
Understanding o1, o3, DeepSeek R1, and the shift from pre-training scaling to inference-time and training-time scaling—the defining trend of 2025.
HuggingFace TRL: A Deep Dive into the Transformer Reinforcement Learning Library
In-depth exploration of HuggingFace TRL's architecture—examining its trainer ecosystem from SFT to GRPO, data collators, reward functions, vLLM integration, and the internals that power modern LLM fine-tuning workflows.
vLLM in Production: The Complete Guide to High-Performance LLM Serving
Hands-on guide to deploying vLLM in production—covering architecture internals, configuration tuning, Kubernetes deployment, monitoring, and troubleshooting.