Reasoning Models: A Brief Framework
Understanding o1, o3, DeepSeek R1, and the shift from pre-training scaling to inference-time and training-time scaling—the defining trend of 2025.
Table of Contents
The New Frontier of AI Capabilities
2025 marks a fundamental shift in how we scale AI capabilities. For years, the dominant paradigm was simple: more parameters, more training data, more compute at training time. But two new paradigms have emerged that are changing everything:
- Test-Time Compute Scaling: Models improve by "thinking longer" at inference
- Novel Training Algorithms: GRPO, RLVR, and verifiable rewards enable reasoning to emerge
This guide provides an overview and links to deep dives on both topics.
The Two Pillars of Reasoning
Test-Time Compute (Inference)
At inference time, reasoning models use extra computation to improve outputs:
| Technique | Description | Compute |
|---|---|---|
| Chain-of-Thought | Step-by-step reasoning | Low |
| Self-Consistency | Multiple paths + voting | Medium |
| Tree of Thoughts | Branching exploration | High |
| MCTS | Monte Carlo Tree Search | Very High |
| Process Reward Models | Step-by-step guidance | High |
Deep dive: Test-Time Compute Scaling: CoT, ToT, MCTS, and Search-Based Reasoning
Training-Time Innovations
Novel training algorithms enable reasoning to emerge:
| Algorithm | Key Innovation | Memory |
|---|---|---|
| PPO | Clipped surrogate + value function | 2x model |
| GRPO | Group statistics, no critic | 1x model |
| RLVR | Verifiable rewards only | 1x model |
| DPO | Direct preference optimization | 1x model |
Deep dive: Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR
Key Models
OpenAI o-Series
OpenAI's reasoning models use hidden chain-of-thought and test-time search:
- o1 (September 2024): First major reasoning model
- o3 (April 16, 2025): State-of-the-art reasoning with multimodal "thinking with images"
- o4-mini (April 16, 2025): Cost-efficient reasoning optimized for math and code
- o3-pro (June 10, 2025): Highest-level performance in the o-series
From OpenAI: "o3 and o4-mini are our first AI models that can 'think with images'—they integrate visual information directly into the reasoning chain."
Key benchmarks (2025):
| Model | AIME 2025 | SWE-bench | GPQA Diamond |
|---|---|---|---|
| o3 | 88.9% | 69.1% | 83.3 |
| o4-mini | 92.7% | 68.1% | 81.4 |
From OpenAI: "o4-mini is the best-performing benchmarked model on AIME 2024 and 2025, achieving remarkable performance for its size and cost."
Note: Sam Altman indicated o3/o4-mini may be the last standalone reasoning models before GPT-5, which will unify traditional and reasoning models.
DeepSeek R1
DeepSeek's open-weight reasoning model introduces GRPO:
- R1-Zero: Pure RL without SFT—reasoning emerges naturally
- R1 (January 2025): Multi-stage training for better readability
- R1-0528 (May 2025): Improved reasoning, JSON output, function-calling
- V3.1 (August 2025): Hybrid model with switchable thinking/non-thinking modes
- V3.2-Speciale: Gold-medal performance on 2025 IMO and IOI
- Distilled variants: 1.5B to 70B parameters
From DeepSeek: "We directly applied reinforcement learning to the base model without relying on supervised fine-tuning. This approach allows the model to explore chain-of-thought for solving complex problems."
R1-0528 update: "Major improvements in inference and hallucination reduction. Performance now approaches O3 and Gemini 2.5 Pro."
V3.1 key feature: One model covers both use cases—switch between reasoning (like R1) and direct answers (like V3) via chat template, eliminating the need to choose between models.
The Core Insight
The breakthrough is that reasoning can emerge from simple rewards. DeepSeek R1-Zero uses only:
- Correctness reward: Binary signal for right/wrong answer
- Format reward: Encourages thinking structure
No neural reward model. No human labels for reasoning steps. Just optimize for answer correctness, and the model learns to reason.
Why this is revolutionary: Previous approaches assumed you needed to teach reasoning explicitly—either through supervised fine-tuning on reasoning traces or through reward models trained on human judgments of reasoning quality. DeepSeek showed you can skip both. Just reward correct answers, and reasoning emerges as an instrumental strategy the model discovers to get more correct answers. This is analogous to how evolution doesn't explicitly design intelligence—it just rewards survival, and intelligence emerges as a useful adaptation.
The RL magic: When a model generates multiple reasoning paths and some lead to correct answers while others don't, the RL gradient points toward "more of what led to correct answers." Over many iterations, the model learns that certain reasoning patterns (breaking problems into steps, checking intermediate results, considering edge cases) correlate with correctness. It's not following explicit reasoning instructions—it's discovering that reasoning works.
Why format rewards help: Without format constraints, a model might discover correct answers through opaque or inconsistent processes. The format reward (encouraging <think>...</think> tags) provides structure that makes the reasoning readable and debuggable, while still letting the model learn the content of reasoning through outcome-based rewards.
When to Use Reasoning Models
Understanding when reasoning models add value versus when they're overkill is crucial for cost-effective AI deployment.
Use reasoning models for:
- Complex mathematical problems
- Multi-step logical reasoning
- Code generation requiring planning
- Scientific analysis
- High-stakes decisions where accuracy matters
Use standard models for:
- Simple Q&A
- Creative writing
- Summarization
- High-throughput applications
- Cost-sensitive use cases
The decision framework: Ask yourself: "Would a human need to think carefully about this, or is it quick and intuitive?" Calculating a tip, answering "What's the capital of France?", or summarizing a paragraph don't require deep thought—standard models excel and are much cheaper. Proving a theorem, debugging a complex algorithm, or analyzing a multi-factor decision require careful reasoning—reasoning models shine here.
The hidden cost of reasoning: It's not just token cost. Reasoning models have higher latency because they generate long chains of thought before answering. A standard model might respond in 500ms; a reasoning model might take 5-30 seconds for complex problems. For interactive applications, this latency matters as much as cost.
Hybrid approaches work well: Use a lightweight classifier to route queries: simple questions go to fast standard models, complex questions go to reasoning models. This captures most of the accuracy gains while controlling costs. Some systems use reasoning models only for verification—generate with a standard model, verify with a reasoning model.
Cost Considerations
Reasoning models use significantly more tokens:
| Model Type | Typical Output | Cost Multiplier |
|---|---|---|
| Standard LLM | 100-500 tokens | 1x |
| Reasoning (simple) | 500-2000 tokens | 4-10x |
| Reasoning (complex) | 2000-10000+ tokens | 20-50x |
Learning Path
For a complete understanding of reasoning models:
- Start here: Overview of the landscape (this post)
- Inference techniques: Test-Time Compute Scaling - CoT, ToT, MCTS, PRMs, HuggingFace search-and-learn
- Training methods: Training Reasoning Models - PPO, GRPO, reward functions, distillation
The Future
By late 2025, reasoning capabilities will be standard, not optional. The combination of efficient training (GRPO) and effective inference (test-time search) is democratizing advanced AI reasoning.
Lightweight models (10-20B range) will achieve complex reasoning nearly on par with the largest models, making reasoning accessible for most applications.
Frequently Asked Questions
Related Articles
Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR
A deep technical guide to training reasoning models like o1 and DeepSeek R1—covering PPO, GRPO, reward function design, RLVR, and distillation techniques.
Test-Time Compute Scaling: CoT, ToT, MCTS, and Search-Based Reasoning
A comprehensive guide to inference-time scaling techniques—Chain of Thought, Tree of Thoughts, Monte Carlo Tree Search, Process Reward Models, and the HuggingFace search-and-learn framework.
GRPO: Group Relative Policy Optimization Explained
Understanding Group Relative Policy Optimization—the technique behind DeepSeek's training efficiency and a simpler alternative to PPO-based RLHF.