RLVR: Reinforcement Learning with Verifiable Rewards
Understanding Reinforcement Learning with Verifiable Rewards (RLVR)—the technique behind DeepSeek R1's reasoning capabilities, process reward models, and when to use verifiable vs human feedback.
Table of Contents
Beyond Human Feedback
Traditional RLHF relies on human preferences to guide learning. But for tasks with objectively correct answers—math, code, logic—we can do better: verify correctness directly.
The fundamental insight: In RLHF, you train a reward model to predict what humans would prefer, then optimize against that proxy. But proxies are imperfect—the reward model might learn spurious correlations (longer responses score higher) rather than true quality. RLVR sidesteps this entirely: for math, the answer is either 5 or it isn't. No proxy needed. This directness eliminates an entire class of reward hacking failure modes.
Why this matters for reasoning: Teaching models to reason is hard because reasoning is a process, not an output. You can't label "good reasoning" the way you label "helpful response." But you can check if the reasoning led to a correct answer. RLVR bets that if you reward correct answers strongly enough, good reasoning will emerge as the strategy that most reliably produces them. DeepSeek R1 proved this bet correct—reasoning behaviors emerged without any explicit reasoning demonstrations.
From research: "Reinforcement Learning with Verifiable Rewards (RLVR) is an approach to training AI systems, particularly large language models, by providing them with clear, objective feedback based on whether their outputs meet predefined correctness criteria."
RLVR has powered some of the most impressive reasoning advances, including DeepSeek R1.
What Makes RLVR Different
RLHF vs RLVR
| Aspect | RLHF | RLVR |
|---|---|---|
| Feedback source | Human preferences | Programmatic verification |
| Signal type | Relative (A > B) | Absolute (correct/incorrect) |
| Cost | Expensive (human labelers) | Cheap (automated) |
| Scalability | Limited by labeling | Unlimited |
| Domain | Any task | Verifiable tasks only |
| Noise | Human disagreement | Verifier errors |
From research: "RLVR replaces learned reward models with programmatic verifiers. This approach trades generality (RLHF works for any task) for efficiency (RLVR is 3x cheaper on verifiable tasks)."
The Binary Reward
From research: "Verifiable Rewards are simple functions that provide a clear-cut, binary ground truth signal—typically a '1' (correct) or '0' (incorrect)—to indicate whether a model's output meets a predefined correctness criterion."
The elegance of binary signals: You might think binary rewards are too sparse—how does the model learn from "wrong, wrong, wrong, wrong, correct"? The answer lies in sampling: generate many solutions (often 8-16 per problem), identify which ones are correct, and reinforce those. The model doesn't learn from each individual failure; it learns from the contrast between successful and unsuccessful attempts. This is why RLVR pairs naturally with GRPO, which compares responses within a group.
Why not partial credit: It's tempting to give partial rewards—0.5 for "close" answers, 0.8 for "mostly right." But partial credit requires judgment about what "close" means, reintroducing the problems RLVR avoids. Is "x = 4.9" close to "x = 5"? Depends on context. Binary rewards are unambiguous. In practice, partial credit often hurts training by rewarding confidently wrong answers that happen to be "close" to correct ones.
def math_reward(problem: str, solution: str) -> float:
"""Simple binary reward for math problems."""
expected_answer = get_ground_truth(problem)
model_answer = extract_final_answer(solution)
if model_answer == expected_answer:
return 1.0
else:
return 0.0
RLVR in DeepSeek R1
DeepSeek used RLVR (with GRPO) to train R1's reasoning capabilities:
From research: "RLVR is among the leading training strategies for injecting learning signals into LLMs, successfully employed by models such as DeepSeek R1 and Tülu 3."
The R1-Zero experiment: DeepSeek's most surprising finding was R1-Zero: a model trained with only RLVR, no supervised fine-tuning on reasoning examples. The base model (DeepSeek-V3) knew how to follow instructions but had no explicit reasoning training. They applied GRPO with verifiable rewards on math and coding problems. What emerged was remarkable: the model spontaneously developed chain-of-thought reasoning, self-verification, and backtracking—behaviors no one taught it.
Why this works: The base model has latent reasoning capabilities from pretraining on math, code, and explanatory text. RLVR doesn't teach reasoning from scratch—it selects for reasoning. When the model stumbles upon an approach that works (thinking step by step), that approach gets reinforced. When it guesses without reasoning and fails, that behavior gets penalized. Over many iterations, reasoning strategies dominate because they're more reliable.
Training setup:
- Start with base model (DeepSeek-V3)
- No supervised fine-tuning initially (R1-Zero)
- Pure RL with verifiable rewards on math/code
- Model learns to reason through exploration
Key insight: Complex reasoning emerged without explicit demonstrations—just the signal of whether answers were correct. This suggests reasoning is more "discovered" through RL than "taught" through SFT.
Process Reward Models (PRMs)
Beyond Outcome Rewards
Basic RLVR rewards only the final answer. Process Reward Models (PRMs) reward intermediate reasoning steps.
The credit assignment problem with outcome rewards: Consider a 10-step proof where step 3 contains an error. With outcome rewards, all 10 steps receive the same signal: "the final answer was wrong." The model has no way to know that steps 1-2 were fine, step 3 was the problem, and steps 4-10 were consequent errors. It might "learn" to avoid step 7 (which was actually correct) because it was part of a failed solution. PRMs solve this by evaluating each step independently.
How PRMs accelerate learning: With outcome rewards, the model must discover good reasoning through trial and error across many complete solutions. With PRMs, feedback is immediate: step 1 was valid, step 2 was valid, step 3 was invalid. The model can focus its learning on the actual error points. Research shows PRMs can achieve the same performance with 3-5x fewer training examples than outcome-only rewards.
From research: "The verifiable reward function must capture both outcome accuracy and process validity. For mathematical problem solving, this means verifying each step in the solution chain, not just the final numerical result."
Why Process Matters
A model might get the right answer through wrong reasoning (lucky guess). PRMs ensure the reasoning itself is valid.
The lucky guess problem: If you ask "What is 17 × 24?" and the model outputs "408" without showing work, did it actually multiply or did it happen to guess correctly? With outcome-only rewards, both paths get reinforced equally. But the model that guesses has learned nothing generalizable—it will fail on the next problem. The model that actually computed has learned a procedure it can reuse. PRMs can distinguish these cases by checking intermediate work.
def process_reward(problem: str, solution: str) -> float:
"""Reward that considers reasoning steps."""
steps = extract_reasoning_steps(solution)
final_answer = extract_final_answer(solution)
expected = get_ground_truth(problem)
# Outcome reward
outcome_correct = (final_answer == expected)
# Process reward (verify each step)
step_scores = []
for step in steps:
if verify_step_logic(step):
step_scores.append(1.0)
else:
step_scores.append(0.0)
process_score = sum(step_scores) / len(step_scores) if step_scores else 0
# Combined reward
if outcome_correct:
return 0.5 + 0.5 * process_score # Correct answer + good process
else:
return 0.3 * process_score # Wrong answer but partial credit for process
In-Context Process Supervision
From research: "New techniques like process reward models (PRMs) and in-context process supervision are improving multi-step reasoning efficiency by identifying and revising flawed reasoning steps."
Rather than training a separate PRM, provide step verification in-context:
Problem: Solve 3x + 7 = 22
Step 1: Subtract 7 from both sides → 3x = 15 ✓
Step 2: Divide by 3 → x = 5 ✓
Final answer: x = 5 ✓
Reward: 1.0 (all steps verified)
Implementing RLVR
Math Verification
import sympy
from sympy.parsing.latex import parse_latex
def verify_math_answer(problem: str, model_answer: str, ground_truth: str) -> bool:
"""Verify mathematical equivalence."""
try:
model_expr = parse_latex(model_answer)
truth_expr = parse_latex(ground_truth)
# Check symbolic equivalence
return sympy.simplify(model_expr - truth_expr) == 0
except:
# Fallback to string comparison
return normalize(model_answer) == normalize(ground_truth)
Code Verification
import subprocess
import tempfile
def verify_code_solution(problem: str, code: str, test_cases: list) -> float:
"""Run code against test cases."""
passed = 0
for test in test_cases:
try:
# Execute in sandbox
result = run_sandboxed(code, test["input"], timeout=5)
if result == test["expected_output"]:
passed += 1
except Exception:
pass # Execution error = test failed
return passed / len(test_cases)
Logic Verification
def verify_logic_problem(problem: str, solution: str) -> bool:
"""Verify logical reasoning problems."""
# Extract claimed answer
answer = extract_answer(solution)
# For multiple choice
if problem.type == "multiple_choice":
return answer == problem.correct_answer
# For proof-based
if problem.type == "proof":
return check_proof_validity(solution, problem.premises, problem.conclusion)
# For constraint satisfaction
if problem.type == "constraint":
return check_constraints_satisfied(answer, problem.constraints)
RLVR Training Pipeline
Basic Training Loop
def rlvr_training_step(model, problems, verifier, optimizer):
"""Single RLVR training step."""
total_loss = 0
for problem in problems:
# Generate multiple solutions
solutions = model.generate(problem, n=8, temperature=0.8)
# Verify each solution
rewards = [verifier(problem, sol) for sol in solutions]
# Compute advantages (GRPO-style)
mean_reward = sum(rewards) / len(rewards)
std_reward = (sum((r - mean_reward)**2 for r in rewards) / len(rewards)) ** 0.5
advantages = [(r - mean_reward) / (std_reward + 1e-8) for r in rewards]
# Policy gradient loss
for solution, advantage in zip(solutions, advantages):
log_prob = model.log_prob(problem, solution)
loss = -advantage * log_prob
total_loss += loss
# Update model
total_loss.backward()
optimizer.step()
return total_loss.item()
Handling Imperfect Verifiers
Real verifiers make mistakes:
From research: "RLVR replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to {0,1}, but imperfect verifiers inevitably introduce false negatives (rejecting correct answers) and false positives (accepting incorrect ones)."
Mitigation strategies:
def robust_verification(problem: str, solution: str, verifiers: list) -> float:
"""Use multiple verifiers for robustness."""
votes = [v(problem, solution) for v in verifiers]
# Majority vote
return sum(votes) / len(votes)
def filtered_verification(problem: str, solution: str) -> float:
"""Filter obvious verifier errors."""
# Basic sanity checks
if not contains_final_answer(solution):
return 0.0
if contains_obvious_errors(solution):
return 0.0
# Run main verifier
return main_verifier(problem, solution)
Performance Results
Math Benchmarks
From research: "The success of RLVR was first established in domains with strong verifiability, notably mathematics and code. In mathematics, RLVR—using as little as one carefully chosen training example—can nearly double performance on challenging benchmarks such as MATH500 (e.g., raising Qwen2.5-Math-1.5B from 36.0% to 73.6% accuracy)."
Does RLVR Improve Reasoning?
This is debated:
From research: "A key paper from June 2025 demonstrates that RLVR can encourage correct reasoning even when rewards are based solely on answer correctness. The analysis of RLVR's training dynamics reveals that it incentivizes correct reasoning early in the process."
However:
From Promptfoo: "While RLVR promises to improve reasoning by allowing models to learn from free exploration, there remains debate over whether it truly enhances reasoning abilities or simply boosts sampling efficiency."
The title of one analysis: "Reinforcement Learning with Verifiable Rewards Makes Models Faster, Not Smarter"
When to Use RLVR
Good Fit
| Domain | Verifier Type | Example |
|---|---|---|
| Mathematics | Symbolic equality | "Is x=5 correct?" |
| Code | Test execution | "Does code pass tests?" |
| Logic puzzles | Constraint checking | "Is solution valid?" |
| Formal proofs | Proof checkers | "Is proof valid?" |
| Games | Win/loss | "Did agent win?" |
Poor Fit
From research: "RLVR works where ground truth exists. It fails for creative writing, brand voice, or nuanced argumentation. Human preference data remains superior for subjective quality."
| Domain | Why RLVR Fails |
|---|---|
| Creative writing | No objective "correct" answer |
| Conversation quality | Subjective preferences |
| Style/tone | No verifiable criteria |
| Open-ended research | Multiple valid approaches |
RLVR vs Other Methods
Comparison
| Method | Reward Source | Best For |
|---|---|---|
| RLHF (PPO) | Human preferences | Subjective quality |
| DPO | Preference pairs | Offline alignment |
| RLVR | Programmatic verification | Verifiable tasks |
| GRPO + RLVR | Group-normalized + verifiable | Reasoning models |
Hybrid Approaches
In practice, most tasks have both verifiable and subjective components. A math solution should be correct (verifiable) but also clearly explained (subjective). A code solution should pass tests (verifiable) but also be readable and maintainable (subjective).
Hybrid approaches combine RLVR with learned reward models to capture both:
def hybrid_reward(problem: str, solution: str) -> float:
"""Combine verifiable and preference rewards."""
# Verifiable component (if applicable)
if is_verifiable(problem):
correctness = verifier(problem, solution)
else:
correctness = 0.5 # Neutral for non-verifiable
# Quality component (always applicable)
quality = quality_model(problem, solution)
# Combine
return 0.7 * correctness + 0.3 * quality
Understanding the weights (0.7 correctness, 0.3 quality):
The weights reflect your priorities:
- 0.7 for correctness: Getting the right answer is the primary goal. A beautifully written wrong answer is worse than an ugly right answer.
- 0.3 for quality: But among correct answers, prefer ones that are well-explained, properly formatted, and easy to understand.
Tuning weights for different domains:
| Domain | Correctness Weight | Quality Weight | Reasoning |
|---|---|---|---|
| Math competition | 0.9 | 0.1 | Only the answer matters |
| Math tutoring | 0.6 | 0.4 | Explanation quality is crucial |
| Code generation | 0.7 | 0.3 | Tests matter, but so does readability |
| Customer support | 0.5 | 0.5 | Accuracy and tone equally important |
The 0.5 neutral value for non-verifiable problems: When a problem can't be verified programmatically, using 0.5 (the midpoint) means the verifiable component neither helps nor hurts—the model is trained purely on the quality signal for these cases.
Future Directions
Extending Verifiability
From research: "We will see more focus on RLVR next year. Right now, RLVR is primarily applied to math and code domains. The next logical step is to not only use the final answer's correctness as a reward signal but also judge the LLM's explanations during RLVR training."
Expanding Domains
Research is working to extend RLVR to:
- Scientific reasoning (verifiable experiments)
- Legal reasoning (statute compliance)
- Medical diagnosis (ground truth outcomes)
- Factual Q&A (knowledge graph verification)
Conclusion
RLVR represents a powerful training paradigm for verifiable domains:
- Binary rewards from programmatic verification
- Process rewards for reasoning quality
- 3x more efficient than RLHF for applicable tasks
- Powers reasoning models like DeepSeek R1
Use RLVR when you have objective correctness criteria. Use RLHF when quality is subjective.
Frequently Asked Questions
Related Articles
Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR
A deep technical guide to training reasoning models like o1 and DeepSeek R1—covering PPO, GRPO, reward function design, RLVR, and distillation techniques.
RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond
A comprehensive guide to reinforcement learning algorithms for LLM alignment—PPO, GRPO, GSPO, REINFORCE++, DPO, and their variants. Understanding the tradeoffs that power modern AI assistants.
GRPO: Group Relative Policy Optimization Explained
Understanding Group Relative Policy Optimization—the technique behind DeepSeek's training efficiency and a simpler alternative to PPO-based RLHF.