Is RLVR the same as GRPO?

No. RLVR is the reward paradigm (verifiable correctness). GRPO is the optimization algorithm (group-relative advantages). DeepSeek R1 uses GRPO as the algorithm with RLVR as the reward signal.

Can I use RLVR for my custom task?

If you can write a function that definitively returns "correct" or "incorrect" for any model output, yes. If correctness is subjective or requires human judgment, use RLHF/DPO instead.

What if my verifier makes mistakes?

Use robust verification: multiple verifiers, sanity checks, confidence thresholds. Some false positives/negatives are tolerable—the aggregate signal still guides learning. Binarizing rewards ({0,1}) helps reduce verifier hacking.

Does RLVR really improve reasoning or just sampling?

Debated. Evidence suggests RLVR incentivizes better reasoning early in training, but some argue it primarily improves sampling efficiency (finding correct answers faster) rather than fundamental reasoning ability. The practical result—better performance—matters regardless.

RLVR: Reinforcement Learning with Verifiable Rewards | Enrico Piovano

Q: How much training data does RLVR need?

Surprisingly little—the verifier provides unlimited feedback. From research: "RLVR—using as little as one carefully chosen training example—can nearly double performance." The key is diverse problems, not large datasets.

Beyond Human Feedback

Traditional RLHF relies on human preferences to guide learning. But for tasks with objectively correct answers—math, code, logic—we can do better: verify correctness directly.

The fundamental insight: In RLHF, you train a reward model to predict what humans would prefer, then optimize against that proxy. But proxies are imperfect—the reward model might learn spurious correlations (longer responses score higher) rather than true quality. RLVR sidesteps this entirely: for math, the answer is either 5 or it isn't. No proxy needed. This directness eliminates an entire class of reward hacking failure modes.

Why this matters for reasoning: Teaching models to reason is hard because reasoning is a process, not an output. You can't label "good reasoning" the way you label "helpful response." But you can check if the reasoning led to a correct answer. RLVR bets that if you reward correct answers strongly enough, good reasoning will emerge as the strategy that most reliably produces them. DeepSeek R1 proved this bet correct—reasoning behaviors emerged without any explicit reasoning demonstrations.

From research: "Reinforcement Learning with Verifiable Rewards (RLVR) is an approach to training AI systems, particularly large language models, by providing them with clear, objective feedback based on whether their outputs meet predefined correctness criteria."

RLVR has powered some of the most impressive reasoning advances, including DeepSeek R1.

What Makes RLVR Different

RLHF vs RLVR

Aspect	RLHF	RLVR
Feedback source	Human preferences	Programmatic verification
Signal type	Relative (A > B)	Absolute (correct/incorrect)
Cost	Expensive (human labelers)	Cheap (automated)
Scalability	Limited by labeling	Unlimited
Domain	Any task	Verifiable tasks only
Noise	Human disagreement	Verifier errors

From research: "RLVR replaces learned reward models with programmatic verifiers. This approach trades generality (RLHF works for any task) for efficiency (RLVR is 3x cheaper on verifiable tasks)."

The Binary Reward

From research: "Verifiable Rewards are simple functions that provide a clear-cut, binary ground truth signal—typically a '1' (correct) or '0' (incorrect)—to indicate whether a model's output meets a predefined correctness criterion."

The elegance of binary signals: You might think binary rewards are too sparse—how does the model learn from "wrong, wrong, wrong, wrong, correct"? The answer lies in sampling: generate many solutions (often 8-16 per problem), identify which ones are correct, and reinforce those. The model doesn't learn from each individual failure; it learns from the contrast between successful and unsuccessful attempts. This is why RLVR pairs naturally with GRPO, which compares responses within a group.

Why not partial credit: It's tempting to give partial rewards—0.5 for "close" answers, 0.8 for "mostly right." But partial credit requires judgment about what "close" means, reintroducing the problems RLVR avoids. Is "x = 4.9" close to "x = 5"? Depends on context. Binary rewards are unambiguous. In practice, partial credit often hurts training by rewarding confidently wrong answers that happen to be "close" to correct ones.

Python

def math_reward(problem: str, solution: str) -> float:
    """Simple binary reward for math problems."""
    expected_answer = get_ground_truth(problem)
    model_answer = extract_final_answer(solution)

    if model_answer == expected_answer:
        return 1.0
    else:
        return 0.0

RLVR in DeepSeek R1

DeepSeek used RLVR (with GRPO) to train R1's reasoning capabilities:

From research: "RLVR is among the leading training strategies for injecting learning signals into LLMs, successfully employed by models such as DeepSeek R1 and Tülu 3."

The R1-Zero experiment: DeepSeek's most surprising finding was R1-Zero: a model trained with only RLVR, no supervised fine-tuning on reasoning examples. The base model (DeepSeek-V3) knew how to follow instructions but had no explicit reasoning training. They applied GRPO with verifiable rewards on math and coding problems. What emerged was remarkable: the model spontaneously developed chain-of-thought reasoning, self-verification, and backtracking—behaviors no one taught it.

Why this works: The base model has latent reasoning capabilities from pretraining on math, code, and explanatory text. RLVR doesn't teach reasoning from scratch—it selects for reasoning. When the model stumbles upon an approach that works (thinking step by step), that approach gets reinforced. When it guesses without reasoning and fails, that behavior gets penalized. Over many iterations, reasoning strategies dominate because they're more reliable.

Training setup:

Start with base model (DeepSeek-V3)
No supervised fine-tuning initially (R1-Zero)
Pure RL with verifiable rewards on math/code
Model learns to reason through exploration

Key insight: Complex reasoning emerged without explicit demonstrations—just the signal of whether answers were correct. This suggests reasoning is more "discovered" through RL than "taught" through SFT.

Process Reward Models (PRMs)

Beyond Outcome Rewards

Basic RLVR rewards only the final answer. Process Reward Models (PRMs) reward intermediate reasoning steps.

The credit assignment problem with outcome rewards: Consider a 10-step proof where step 3 contains an error. With outcome rewards, all 10 steps receive the same signal: "the final answer was wrong." The model has no way to know that steps 1-2 were fine, step 3 was the problem, and steps 4-10 were consequent errors. It might "learn" to avoid step 7 (which was actually correct) because it was part of a failed solution. PRMs solve this by evaluating each step independently.

How PRMs accelerate learning: With outcome rewards, the model must discover good reasoning through trial and error across many complete solutions. With PRMs, feedback is immediate: step 1 was valid, step 2 was valid, step 3 was invalid. The model can focus its learning on the actual error points. Research shows PRMs can achieve the same performance with 3-5x fewer training examples than outcome-only rewards.

From research: "The verifiable reward function must capture both outcome accuracy and process validity. For mathematical problem solving, this means verifying each step in the solution chain, not just the final numerical result."

Why Process Matters

A model might get the right answer through wrong reasoning (lucky guess). PRMs ensure the reasoning itself is valid.

The lucky guess problem: If you ask "What is 17 × 24?" and the model outputs "408" without showing work, did it actually multiply or did it happen to guess correctly? With outcome-only rewards, both paths get reinforced equally. But the model that guesses has learned nothing generalizable—it will fail on the next problem. The model that actually computed has learned a procedure it can reuse. PRMs can distinguish these cases by checking intermediate work.

Python

def process_reward(problem: str, solution: str) -> float:
    """Reward that considers reasoning steps."""
    steps = extract_reasoning_steps(solution)
    final_answer = extract_final_answer(solution)
    expected = get_ground_truth(problem)

    # Outcome reward
    outcome_correct = (final_answer == expected)

    # Process reward (verify each step)
    step_scores = []
    for step in steps:
        if verify_step_logic(step):
            step_scores.append(1.0)
        else:
            step_scores.append(0.0)

    process_score = sum(step_scores) / len(step_scores) if step_scores else 0

    # Combined reward
    if outcome_correct:
        return 0.5 + 0.5 * process_score  # Correct answer + good process
    else:
        return 0.3 * process_score  # Wrong answer but partial credit for process

In-Context Process Supervision

From research: "New techniques like process reward models (PRMs) and in-context process supervision are improving multi-step reasoning efficiency by identifying and revising flawed reasoning steps."

Rather than training a separate PRM, provide step verification in-context:

Code

Problem: Solve 3x + 7 = 22

Step 1: Subtract 7 from both sides → 3x = 15 ✓
Step 2: Divide by 3 → x = 5 ✓
Final answer: x = 5 ✓

Reward: 1.0 (all steps verified)

Implementing RLVR

Math Verification

Python

import sympy
from sympy.parsing.latex import parse_latex

def verify_math_answer(problem: str, model_answer: str, ground_truth: str) -> bool:
    """Verify mathematical equivalence."""
    try:
        model_expr = parse_latex(model_answer)
        truth_expr = parse_latex(ground_truth)

        # Check symbolic equivalence
        return sympy.simplify(model_expr - truth_expr) == 0
    except:
        # Fallback to string comparison
        return normalize(model_answer) == normalize(ground_truth)

Code Verification

Python

import subprocess
import tempfile

def verify_code_solution(problem: str, code: str, test_cases: list) -> float:
    """Run code against test cases."""
    passed = 0

    for test in test_cases:
        try:
            # Execute in sandbox
            result = run_sandboxed(code, test["input"], timeout=5)

            if result == test["expected_output"]:
                passed += 1
        except Exception:
            pass  # Execution error = test failed

    return passed / len(test_cases)

Logic Verification

Python

def verify_logic_problem(problem: str, solution: str) -> bool:
    """Verify logical reasoning problems."""
    # Extract claimed answer
    answer = extract_answer(solution)

    # For multiple choice
    if problem.type == "multiple_choice":
        return answer == problem.correct_answer

    # For proof-based
    if problem.type == "proof":
        return check_proof_validity(solution, problem.premises, problem.conclusion)

    # For constraint satisfaction
    if problem.type == "constraint":
        return check_constraints_satisfied(answer, problem.constraints)

RLVR Training Pipeline

Basic Training Loop

Python

def rlvr_training_step(model, problems, verifier, optimizer):
    """Single RLVR training step."""
    total_loss = 0

    for problem in problems:
        # Generate multiple solutions
        solutions = model.generate(problem, n=8, temperature=0.8)

        # Verify each solution
        rewards = [verifier(problem, sol) for sol in solutions]

        # Compute advantages (GRPO-style)
        mean_reward = sum(rewards) / len(rewards)
        std_reward = (sum((r - mean_reward)**2 for r in rewards) / len(rewards)) ** 0.5
        advantages = [(r - mean_reward) / (std_reward + 1e-8) for r in rewards]

        # Policy gradient loss
        for solution, advantage in zip(solutions, advantages):
            log_prob = model.log_prob(problem, solution)
            loss = -advantage * log_prob
            total_loss += loss

    # Update model
    total_loss.backward()
    optimizer.step()

    return total_loss.item()

Handling Imperfect Verifiers

Real verifiers make mistakes:

From research: "RLVR replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to {0,1}, but imperfect verifiers inevitably introduce false negatives (rejecting correct answers) and false positives (accepting incorrect ones)."

Mitigation strategies:

Python

def robust_verification(problem: str, solution: str, verifiers: list) -> float:
    """Use multiple verifiers for robustness."""
    votes = [v(problem, solution) for v in verifiers]

    # Majority vote
    return sum(votes) / len(votes)

def filtered_verification(problem: str, solution: str) -> float:
    """Filter obvious verifier errors."""
    # Basic sanity checks
    if not contains_final_answer(solution):
        return 0.0

    if contains_obvious_errors(solution):
        return 0.0

    # Run main verifier
    return main_verifier(problem, solution)

Performance Results

Math Benchmarks

From research: "The success of RLVR was first established in domains with strong verifiability, notably mathematics and code. In mathematics, RLVR—using as little as one carefully chosen training example—can nearly double performance on challenging benchmarks such as MATH500 (e.g., raising Qwen2.5-Math-1.5B from 36.0% to 73.6% accuracy)."

Does RLVR Improve Reasoning?

This is debated:

From research: "A key paper from June 2025 demonstrates that RLVR can encourage correct reasoning even when rewards are based solely on answer correctness. The analysis of RLVR's training dynamics reveals that it incentivizes correct reasoning early in the process."

However:

From Promptfoo: "While RLVR promises to improve reasoning by allowing models to learn from free exploration, there remains debate over whether it truly enhances reasoning abilities or simply boosts sampling efficiency."

The title of one analysis: "Reinforcement Learning with Verifiable Rewards Makes Models Faster, Not Smarter"

When to Use RLVR

Good Fit

Domain	Verifier Type	Example
Mathematics	Symbolic equality	"Is x=5 correct?"
Code	Test execution	"Does code pass tests?"
Logic puzzles	Constraint checking	"Is solution valid?"
Formal proofs	Proof checkers	"Is proof valid?"
Games	Win/loss	"Did agent win?"

Poor Fit

From research: "RLVR works where ground truth exists. It fails for creative writing, brand voice, or nuanced argumentation. Human preference data remains superior for subjective quality."

Domain	Why RLVR Fails
Creative writing	No objective "correct" answer
Conversation quality	Subjective preferences
Style/tone	No verifiable criteria
Open-ended research	Multiple valid approaches

RLVR vs Other Methods

Comparison

Method	Reward Source	Best For
RLHF (PPO)	Human preferences	Subjective quality
DPO	Preference pairs	Offline alignment
RLVR	Programmatic verification	Verifiable tasks
GRPO + RLVR	Group-normalized + verifiable	Reasoning models

Hybrid Approaches

In practice, most tasks have both verifiable and subjective components. A math solution should be correct (verifiable) but also clearly explained (subjective). A code solution should pass tests (verifiable) but also be readable and maintainable (subjective).

Hybrid approaches combine RLVR with learned reward models to capture both:

Python

def hybrid_reward(problem: str, solution: str) -> float:
    """Combine verifiable and preference rewards."""
    # Verifiable component (if applicable)
    if is_verifiable(problem):
        correctness = verifier(problem, solution)
    else:
        correctness = 0.5  # Neutral for non-verifiable

    # Quality component (always applicable)
    quality = quality_model(problem, solution)

    # Combine
    return 0.7 * correctness + 0.3 * quality

Understanding the weights (0.7 correctness, 0.3 quality):

The weights reflect your priorities:

0.7 for correctness: Getting the right answer is the primary goal. A beautifully written wrong answer is worse than an ugly right answer.
0.3 for quality: But among correct answers, prefer ones that are well-explained, properly formatted, and easy to understand.

Tuning weights for different domains:

Domain	Correctness Weight	Quality Weight	Reasoning
Math competition	0.9	0.1	Only the answer matters
Math tutoring	0.6	0.4	Explanation quality is crucial
Code generation	0.7	0.3	Tests matter, but so does readability
Customer support	0.5	0.5	Accuracy and tone equally important

The 0.5 neutral value for non-verifiable problems: When a problem can't be verified programmatically, using 0.5 (the midpoint) means the verifiable component neither helps nor hurts—the model is trained purely on the quality signal for these cases.

Future Directions

Extending Verifiability

From research: "We will see more focus on RLVR next year. Right now, RLVR is primarily applied to math and code domains. The next logical step is to not only use the final answer's correctness as a reward signal but also judge the LLM's explanations during RLVR training."

Expanding Domains

Research is working to extend RLVR to:

Scientific reasoning (verifiable experiments)
Legal reasoning (statute compliance)
Medical diagnosis (ground truth outcomes)
Factual Q&A (knowledge graph verification)

Conclusion

RLVR represents a powerful training paradigm for verifiable domains:

Binary rewards from programmatic verification
Process rewards for reasoning quality
3x more efficient than RLHF for applicable tasks
Powers reasoning models like DeepSeek R1

Use RLVR when you have objective correctness criteria. Use RLHF when quality is subjective.

RLVR: Reinforcement Learning with Verifiable Rewards

Table of Contents

Beyond Human Feedback

What Makes RLVR Different

RLHF vs RLVR

The Binary Reward

RLVR in DeepSeek R1

Process Reward Models (PRMs)

Beyond Outcome Rewards

Why Process Matters

In-Context Process Supervision

Implementing RLVR

Math Verification

Code Verification

Logic Verification

RLVR Training Pipeline

Basic Training Loop

Handling Imperfect Verifiers

Performance Results

Math Benchmarks

Does RLVR Improve Reasoning?

When to Use RLVR

Good Fit

Poor Fit

RLVR vs Other Methods

Comparison

Hybrid Approaches

Future Directions

Extending Verifiability

Expanding Domains

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR

RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond

GRPO: Group Relative Policy Optimization Explained