Skip to main content
Back to Blog

RLVR: Reinforcement Learning with Verifiable Rewards

Understanding Reinforcement Learning with Verifiable Rewards (RLVR)—the technique behind DeepSeek R1's reasoning capabilities, process reward models, and when to use verifiable vs human feedback.

9 min read
Share:

Beyond Human Feedback

Traditional RLHF relies on human preferences to guide learning. But for tasks with objectively correct answers—math, code, logic—we can do better: verify correctness directly.

The fundamental insight: In RLHF, you train a reward model to predict what humans would prefer, then optimize against that proxy. But proxies are imperfect—the reward model might learn spurious correlations (longer responses score higher) rather than true quality. RLVR sidesteps this entirely: for math, the answer is either 5 or it isn't. No proxy needed. This directness eliminates an entire class of reward hacking failure modes.

Why this matters for reasoning: Teaching models to reason is hard because reasoning is a process, not an output. You can't label "good reasoning" the way you label "helpful response." But you can check if the reasoning led to a correct answer. RLVR bets that if you reward correct answers strongly enough, good reasoning will emerge as the strategy that most reliably produces them. DeepSeek R1 proved this bet correct—reasoning behaviors emerged without any explicit reasoning demonstrations.

From research: "Reinforcement Learning with Verifiable Rewards (RLVR) is an approach to training AI systems, particularly large language models, by providing them with clear, objective feedback based on whether their outputs meet predefined correctness criteria."

RLVR has powered some of the most impressive reasoning advances, including DeepSeek R1.

What Makes RLVR Different

RLHF vs RLVR

AspectRLHFRLVR
Feedback sourceHuman preferencesProgrammatic verification
Signal typeRelative (A > B)Absolute (correct/incorrect)
CostExpensive (human labelers)Cheap (automated)
ScalabilityLimited by labelingUnlimited
DomainAny taskVerifiable tasks only
NoiseHuman disagreementVerifier errors

From research: "RLVR replaces learned reward models with programmatic verifiers. This approach trades generality (RLHF works for any task) for efficiency (RLVR is 3x cheaper on verifiable tasks)."

The Binary Reward

From research: "Verifiable Rewards are simple functions that provide a clear-cut, binary ground truth signal—typically a '1' (correct) or '0' (incorrect)—to indicate whether a model's output meets a predefined correctness criterion."

The elegance of binary signals: You might think binary rewards are too sparse—how does the model learn from "wrong, wrong, wrong, wrong, correct"? The answer lies in sampling: generate many solutions (often 8-16 per problem), identify which ones are correct, and reinforce those. The model doesn't learn from each individual failure; it learns from the contrast between successful and unsuccessful attempts. This is why RLVR pairs naturally with GRPO, which compares responses within a group.

Why not partial credit: It's tempting to give partial rewards—0.5 for "close" answers, 0.8 for "mostly right." But partial credit requires judgment about what "close" means, reintroducing the problems RLVR avoids. Is "x = 4.9" close to "x = 5"? Depends on context. Binary rewards are unambiguous. In practice, partial credit often hurts training by rewarding confidently wrong answers that happen to be "close" to correct ones.

Python
def math_reward(problem: str, solution: str) -> float:
    """Simple binary reward for math problems."""
    expected_answer = get_ground_truth(problem)
    model_answer = extract_final_answer(solution)

    if model_answer == expected_answer:
        return 1.0
    else:
        return 0.0

RLVR in DeepSeek R1

DeepSeek used RLVR (with GRPO) to train R1's reasoning capabilities:

From research: "RLVR is among the leading training strategies for injecting learning signals into LLMs, successfully employed by models such as DeepSeek R1 and Tülu 3."

The R1-Zero experiment: DeepSeek's most surprising finding was R1-Zero: a model trained with only RLVR, no supervised fine-tuning on reasoning examples. The base model (DeepSeek-V3) knew how to follow instructions but had no explicit reasoning training. They applied GRPO with verifiable rewards on math and coding problems. What emerged was remarkable: the model spontaneously developed chain-of-thought reasoning, self-verification, and backtracking—behaviors no one taught it.

Why this works: The base model has latent reasoning capabilities from pretraining on math, code, and explanatory text. RLVR doesn't teach reasoning from scratch—it selects for reasoning. When the model stumbles upon an approach that works (thinking step by step), that approach gets reinforced. When it guesses without reasoning and fails, that behavior gets penalized. Over many iterations, reasoning strategies dominate because they're more reliable.

Training setup:

  1. Start with base model (DeepSeek-V3)
  2. No supervised fine-tuning initially (R1-Zero)
  3. Pure RL with verifiable rewards on math/code
  4. Model learns to reason through exploration

Key insight: Complex reasoning emerged without explicit demonstrations—just the signal of whether answers were correct. This suggests reasoning is more "discovered" through RL than "taught" through SFT.

Process Reward Models (PRMs)

Beyond Outcome Rewards

Basic RLVR rewards only the final answer. Process Reward Models (PRMs) reward intermediate reasoning steps.

The credit assignment problem with outcome rewards: Consider a 10-step proof where step 3 contains an error. With outcome rewards, all 10 steps receive the same signal: "the final answer was wrong." The model has no way to know that steps 1-2 were fine, step 3 was the problem, and steps 4-10 were consequent errors. It might "learn" to avoid step 7 (which was actually correct) because it was part of a failed solution. PRMs solve this by evaluating each step independently.

How PRMs accelerate learning: With outcome rewards, the model must discover good reasoning through trial and error across many complete solutions. With PRMs, feedback is immediate: step 1 was valid, step 2 was valid, step 3 was invalid. The model can focus its learning on the actual error points. Research shows PRMs can achieve the same performance with 3-5x fewer training examples than outcome-only rewards.

From research: "The verifiable reward function must capture both outcome accuracy and process validity. For mathematical problem solving, this means verifying each step in the solution chain, not just the final numerical result."

Why Process Matters

A model might get the right answer through wrong reasoning (lucky guess). PRMs ensure the reasoning itself is valid.

The lucky guess problem: If you ask "What is 17 × 24?" and the model outputs "408" without showing work, did it actually multiply or did it happen to guess correctly? With outcome-only rewards, both paths get reinforced equally. But the model that guesses has learned nothing generalizable—it will fail on the next problem. The model that actually computed has learned a procedure it can reuse. PRMs can distinguish these cases by checking intermediate work.

Python
def process_reward(problem: str, solution: str) -> float:
    """Reward that considers reasoning steps."""
    steps = extract_reasoning_steps(solution)
    final_answer = extract_final_answer(solution)
    expected = get_ground_truth(problem)

    # Outcome reward
    outcome_correct = (final_answer == expected)

    # Process reward (verify each step)
    step_scores = []
    for step in steps:
        if verify_step_logic(step):
            step_scores.append(1.0)
        else:
            step_scores.append(0.0)

    process_score = sum(step_scores) / len(step_scores) if step_scores else 0

    # Combined reward
    if outcome_correct:
        return 0.5 + 0.5 * process_score  # Correct answer + good process
    else:
        return 0.3 * process_score  # Wrong answer but partial credit for process

In-Context Process Supervision

From research: "New techniques like process reward models (PRMs) and in-context process supervision are improving multi-step reasoning efficiency by identifying and revising flawed reasoning steps."

Rather than training a separate PRM, provide step verification in-context:

Code
Problem: Solve 3x + 7 = 22

Step 1: Subtract 7 from both sides → 3x = 15 ✓
Step 2: Divide by 3 → x = 5 ✓
Final answer: x = 5 ✓

Reward: 1.0 (all steps verified)

Implementing RLVR

Math Verification

Python
import sympy
from sympy.parsing.latex import parse_latex

def verify_math_answer(problem: str, model_answer: str, ground_truth: str) -> bool:
    """Verify mathematical equivalence."""
    try:
        model_expr = parse_latex(model_answer)
        truth_expr = parse_latex(ground_truth)

        # Check symbolic equivalence
        return sympy.simplify(model_expr - truth_expr) == 0
    except:
        # Fallback to string comparison
        return normalize(model_answer) == normalize(ground_truth)

Code Verification

Python
import subprocess
import tempfile

def verify_code_solution(problem: str, code: str, test_cases: list) -> float:
    """Run code against test cases."""
    passed = 0

    for test in test_cases:
        try:
            # Execute in sandbox
            result = run_sandboxed(code, test["input"], timeout=5)

            if result == test["expected_output"]:
                passed += 1
        except Exception:
            pass  # Execution error = test failed

    return passed / len(test_cases)

Logic Verification

Python
def verify_logic_problem(problem: str, solution: str) -> bool:
    """Verify logical reasoning problems."""
    # Extract claimed answer
    answer = extract_answer(solution)

    # For multiple choice
    if problem.type == "multiple_choice":
        return answer == problem.correct_answer

    # For proof-based
    if problem.type == "proof":
        return check_proof_validity(solution, problem.premises, problem.conclusion)

    # For constraint satisfaction
    if problem.type == "constraint":
        return check_constraints_satisfied(answer, problem.constraints)

RLVR Training Pipeline

Basic Training Loop

Python
def rlvr_training_step(model, problems, verifier, optimizer):
    """Single RLVR training step."""
    total_loss = 0

    for problem in problems:
        # Generate multiple solutions
        solutions = model.generate(problem, n=8, temperature=0.8)

        # Verify each solution
        rewards = [verifier(problem, sol) for sol in solutions]

        # Compute advantages (GRPO-style)
        mean_reward = sum(rewards) / len(rewards)
        std_reward = (sum((r - mean_reward)**2 for r in rewards) / len(rewards)) ** 0.5
        advantages = [(r - mean_reward) / (std_reward + 1e-8) for r in rewards]

        # Policy gradient loss
        for solution, advantage in zip(solutions, advantages):
            log_prob = model.log_prob(problem, solution)
            loss = -advantage * log_prob
            total_loss += loss

    # Update model
    total_loss.backward()
    optimizer.step()

    return total_loss.item()

Handling Imperfect Verifiers

Real verifiers make mistakes:

From research: "RLVR replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to {0,1}, but imperfect verifiers inevitably introduce false negatives (rejecting correct answers) and false positives (accepting incorrect ones)."

Mitigation strategies:

Python
def robust_verification(problem: str, solution: str, verifiers: list) -> float:
    """Use multiple verifiers for robustness."""
    votes = [v(problem, solution) for v in verifiers]

    # Majority vote
    return sum(votes) / len(votes)

def filtered_verification(problem: str, solution: str) -> float:
    """Filter obvious verifier errors."""
    # Basic sanity checks
    if not contains_final_answer(solution):
        return 0.0

    if contains_obvious_errors(solution):
        return 0.0

    # Run main verifier
    return main_verifier(problem, solution)

Performance Results

Math Benchmarks

From research: "The success of RLVR was first established in domains with strong verifiability, notably mathematics and code. In mathematics, RLVR—using as little as one carefully chosen training example—can nearly double performance on challenging benchmarks such as MATH500 (e.g., raising Qwen2.5-Math-1.5B from 36.0% to 73.6% accuracy)."

Does RLVR Improve Reasoning?

This is debated:

From research: "A key paper from June 2025 demonstrates that RLVR can encourage correct reasoning even when rewards are based solely on answer correctness. The analysis of RLVR's training dynamics reveals that it incentivizes correct reasoning early in the process."

However:

From Promptfoo: "While RLVR promises to improve reasoning by allowing models to learn from free exploration, there remains debate over whether it truly enhances reasoning abilities or simply boosts sampling efficiency."

The title of one analysis: "Reinforcement Learning with Verifiable Rewards Makes Models Faster, Not Smarter"

When to Use RLVR

Good Fit

DomainVerifier TypeExample
MathematicsSymbolic equality"Is x=5 correct?"
CodeTest execution"Does code pass tests?"
Logic puzzlesConstraint checking"Is solution valid?"
Formal proofsProof checkers"Is proof valid?"
GamesWin/loss"Did agent win?"

Poor Fit

From research: "RLVR works where ground truth exists. It fails for creative writing, brand voice, or nuanced argumentation. Human preference data remains superior for subjective quality."

DomainWhy RLVR Fails
Creative writingNo objective "correct" answer
Conversation qualitySubjective preferences
Style/toneNo verifiable criteria
Open-ended researchMultiple valid approaches

RLVR vs Other Methods

Comparison

MethodReward SourceBest For
RLHF (PPO)Human preferencesSubjective quality
DPOPreference pairsOffline alignment
RLVRProgrammatic verificationVerifiable tasks
GRPO + RLVRGroup-normalized + verifiableReasoning models

Hybrid Approaches

In practice, most tasks have both verifiable and subjective components. A math solution should be correct (verifiable) but also clearly explained (subjective). A code solution should pass tests (verifiable) but also be readable and maintainable (subjective).

Hybrid approaches combine RLVR with learned reward models to capture both:

Python
def hybrid_reward(problem: str, solution: str) -> float:
    """Combine verifiable and preference rewards."""
    # Verifiable component (if applicable)
    if is_verifiable(problem):
        correctness = verifier(problem, solution)
    else:
        correctness = 0.5  # Neutral for non-verifiable

    # Quality component (always applicable)
    quality = quality_model(problem, solution)

    # Combine
    return 0.7 * correctness + 0.3 * quality

Understanding the weights (0.7 correctness, 0.3 quality):

The weights reflect your priorities:

  • 0.7 for correctness: Getting the right answer is the primary goal. A beautifully written wrong answer is worse than an ugly right answer.
  • 0.3 for quality: But among correct answers, prefer ones that are well-explained, properly formatted, and easy to understand.

Tuning weights for different domains:

DomainCorrectness WeightQuality WeightReasoning
Math competition0.90.1Only the answer matters
Math tutoring0.60.4Explanation quality is crucial
Code generation0.70.3Tests matter, but so does readability
Customer support0.50.5Accuracy and tone equally important

The 0.5 neutral value for non-verifiable problems: When a problem can't be verified programmatically, using 0.5 (the midpoint) means the verifiable component neither helps nor hurts—the model is trained purely on the quality signal for these cases.

Future Directions

Extending Verifiability

From research: "We will see more focus on RLVR next year. Right now, RLVR is primarily applied to math and code domains. The next logical step is to not only use the final answer's correctness as a reward signal but also judge the LLM's explanations during RLVR training."

Expanding Domains

Research is working to extend RLVR to:

  • Scientific reasoning (verifiable experiments)
  • Legal reasoning (statute compliance)
  • Medical diagnosis (ground truth outcomes)
  • Factual Q&A (knowledge graph verification)

Conclusion

RLVR represents a powerful training paradigm for verifiable domains:

  1. Binary rewards from programmatic verification
  2. Process rewards for reasoning quality
  3. 3x more efficient than RLHF for applicable tasks
  4. Powers reasoning models like DeepSeek R1

Use RLVR when you have objective correctness criteria. Use RLHF when quality is subjective.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles