How are o3/o4-mini different from R1?

o3/o4-mini hide their reasoning (showing only summaries), while R1 exposes full reasoning in ` ` tags. R1/V3.1 are open-weight; o-series are API-only. Performance is comparable—o4-mini leads on AIME 2025 (92.7%), while R1-0528 approaches O3 on most benchmarks. DeepSeek V3.1 offers the unique advantage of switching between thinking and non-thinking modes in a single model.

Can I train my own reasoning model?

Yes. The deep dive on [training reasoning models](/blog/training-reasoning-models-grpo-rewards) covers GRPO implementation, reward function design, and distillation. Start with an existing base model and apply RLVR for verifiable domains.

Which test-time technique should I use?

See the [test-time compute guide](/blog/test-time-compute-scaling) for a decision framework. Quick summary: Start with Self-Consistency for simple cases, use Best-of-N with a verifier for moderate compute, and DVTS for complex problems requiring maximum accuracy.

How much compute do I need?

For training: GRPO on a 7B model requires ~30GB GPU memory. For inference: test-time search scales linearly with candidates. A 16-sample Best-of-N uses 16x the compute of single generation.

Where can I learn more?

[Test-Time Compute Scaling](/blog/test-time-compute-scaling) - All inference techniques in depth, [Training Reasoning Models](/blog/training-reasoning-models-grpo-rewards) - Complete training guide with code.

Reasoning Models: A Brief Framework | Enrico Piovano

The New Frontier of AI Capabilities

2025 marks a fundamental shift in how we scale AI capabilities. For years, the dominant paradigm was simple: more parameters, more training data, more compute at training time. But two new paradigms have emerged that are changing everything:

Test-Time Compute Scaling: Models improve by "thinking longer" at inference
Novel Training Algorithms: GRPO, RLVR, and verifiable rewards enable reasoning to emerge

This guide provides an overview and links to deep dives on both topics.

The Two Pillars of Reasoning

Test-Time Compute (Inference)

At inference time, reasoning models use extra computation to improve outputs:

Technique	Description	Compute
Chain-of-Thought	Step-by-step reasoning	Low
Self-Consistency	Multiple paths + voting	Medium
Tree of Thoughts	Branching exploration	High
MCTS	Monte Carlo Tree Search	Very High
Process Reward Models	Step-by-step guidance	High

Deep dive: Test-Time Compute Scaling: CoT, ToT, MCTS, and Search-Based Reasoning

Training-Time Innovations

Novel training algorithms enable reasoning to emerge:

Algorithm	Key Innovation	Memory
PPO	Clipped surrogate + value function	2x model
GRPO	Group statistics, no critic	1x model
RLVR	Verifiable rewards only	1x model
DPO	Direct preference optimization	1x model

Deep dive: Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR

Key Models

OpenAI o-Series

OpenAI's reasoning models use hidden chain-of-thought and test-time search:

o1 (September 2024): First major reasoning model
o3 (April 16, 2025): State-of-the-art reasoning with multimodal "thinking with images"
o4-mini (April 16, 2025): Cost-efficient reasoning optimized for math and code
o3-pro (June 10, 2025): Highest-level performance in the o-series

From OpenAI: "o3 and o4-mini are our first AI models that can 'think with images'—they integrate visual information directly into the reasoning chain."

Key benchmarks (2025):

Model	AIME 2025	SWE-bench	GPQA Diamond
o3	88.9%	69.1%	83.3
o4-mini	92.7%	68.1%	81.4

From OpenAI: "o4-mini is the best-performing benchmarked model on AIME 2024 and 2025, achieving remarkable performance for its size and cost."

Note: Sam Altman indicated o3/o4-mini may be the last standalone reasoning models before GPT-5, which will unify traditional and reasoning models.

DeepSeek R1

DeepSeek's open-weight reasoning model introduces GRPO:

R1-Zero: Pure RL without SFT—reasoning emerges naturally
R1 (January 2025): Multi-stage training for better readability
R1-0528 (May 2025): Improved reasoning, JSON output, function-calling
V3.1 (August 2025): Hybrid model with switchable thinking/non-thinking modes
V3.2-Speciale: Gold-medal performance on 2025 IMO and IOI
Distilled variants: 1.5B to 70B parameters

From DeepSeek: "We directly applied reinforcement learning to the base model without relying on supervised fine-tuning. This approach allows the model to explore chain-of-thought for solving complex problems."

R1-0528 update: "Major improvements in inference and hallucination reduction. Performance now approaches O3 and Gemini 2.5 Pro."

V3.1 key feature: One model covers both use cases—switch between reasoning (like R1) and direct answers (like V3) via chat template, eliminating the need to choose between models.

The Core Insight

The breakthrough is that reasoning can emerge from simple rewards. DeepSeek R1-Zero uses only:

Correctness reward: Binary signal for right/wrong answer
Format reward: Encourages thinking structure

No neural reward model. No human labels for reasoning steps. Just optimize for answer correctness, and the model learns to reason.

Why this is revolutionary: Previous approaches assumed you needed to teach reasoning explicitly—either through supervised fine-tuning on reasoning traces or through reward models trained on human judgments of reasoning quality. DeepSeek showed you can skip both. Just reward correct answers, and reasoning emerges as an instrumental strategy the model discovers to get more correct answers. This is analogous to how evolution doesn't explicitly design intelligence—it just rewards survival, and intelligence emerges as a useful adaptation.

The RL magic: When a model generates multiple reasoning paths and some lead to correct answers while others don't, the RL gradient points toward "more of what led to correct answers." Over many iterations, the model learns that certain reasoning patterns (breaking problems into steps, checking intermediate results, considering edge cases) correlate with correctness. It's not following explicit reasoning instructions—it's discovering that reasoning works.

Why format rewards help: Without format constraints, a model might discover correct answers through opaque or inconsistent processes. The format reward (encouraging <think>...</think> tags) provides structure that makes the reasoning readable and debuggable, while still letting the model learn the content of reasoning through outcome-based rewards.

When to Use Reasoning Models

Understanding when reasoning models add value versus when they're overkill is crucial for cost-effective AI deployment.

Use reasoning models for:

Complex mathematical problems
Multi-step logical reasoning
Code generation requiring planning
Scientific analysis
High-stakes decisions where accuracy matters

Use standard models for:

Simple Q&A
Creative writing
Summarization
High-throughput applications
Cost-sensitive use cases

The decision framework: Ask yourself: "Would a human need to think carefully about this, or is it quick and intuitive?" Calculating a tip, answering "What's the capital of France?", or summarizing a paragraph don't require deep thought—standard models excel and are much cheaper. Proving a theorem, debugging a complex algorithm, or analyzing a multi-factor decision require careful reasoning—reasoning models shine here.

The hidden cost of reasoning: It's not just token cost. Reasoning models have higher latency because they generate long chains of thought before answering. A standard model might respond in 500ms; a reasoning model might take 5-30 seconds for complex problems. For interactive applications, this latency matters as much as cost.

Hybrid approaches work well: Use a lightweight classifier to route queries: simple questions go to fast standard models, complex questions go to reasoning models. This captures most of the accuracy gains while controlling costs. Some systems use reasoning models only for verification—generate with a standard model, verify with a reasoning model.

Cost Considerations

Reasoning models use significantly more tokens:

Model Type	Typical Output	Cost Multiplier
Standard LLM	100-500 tokens	1x
Reasoning (simple)	500-2000 tokens	4-10x
Reasoning (complex)	2000-10000+ tokens	20-50x

Learning Path

For a complete understanding of reasoning models:

Start here: Overview of the landscape (this post)
Inference techniques: Test-Time Compute Scaling - CoT, ToT, MCTS, PRMs, HuggingFace search-and-learn
Training methods: Training Reasoning Models - PPO, GRPO, reward functions, distillation

The Future

By late 2025, reasoning capabilities will be standard, not optional. The combination of efficient training (GRPO) and effective inference (test-time search) is democratizing advanced AI reasoning.

Lightweight models (10-20B range) will achieve complex reasoning nearly on par with the largest models, making reasoning accessible for most applications.

Reasoning Models: A Brief Framework

Table of Contents

The New Frontier of AI Capabilities

The Two Pillars of Reasoning

Test-Time Compute (Inference)

Training-Time Innovations

Key Models

OpenAI o-Series

DeepSeek R1

The Core Insight

When to Use Reasoning Models

Cost Considerations

Learning Path

The Future

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR

Test-Time Compute Scaling: CoT, ToT, MCTS, and Search-Based Reasoning

GRPO: Group Relative Policy Optimization Explained