Skip to main content
Back to Blog

Reasoning Models: A Brief Framework

Understanding o1, o3, DeepSeek R1, and the shift from pre-training scaling to inference-time and training-time scaling—the defining trend of 2025.

6 min read
Share:

The New Frontier of AI Capabilities

2025 marks a fundamental shift in how we scale AI capabilities. For years, the dominant paradigm was simple: more parameters, more training data, more compute at training time. But two new paradigms have emerged that are changing everything:

  1. Test-Time Compute Scaling: Models improve by "thinking longer" at inference
  2. Novel Training Algorithms: GRPO, RLVR, and verifiable rewards enable reasoning to emerge

This guide provides an overview and links to deep dives on both topics.

The Two Pillars of Reasoning

Test-Time Compute (Inference)

At inference time, reasoning models use extra computation to improve outputs:

TechniqueDescriptionCompute
Chain-of-ThoughtStep-by-step reasoningLow
Self-ConsistencyMultiple paths + votingMedium
Tree of ThoughtsBranching explorationHigh
MCTSMonte Carlo Tree SearchVery High
Process Reward ModelsStep-by-step guidanceHigh

Deep dive: Test-Time Compute Scaling: CoT, ToT, MCTS, and Search-Based Reasoning

Training-Time Innovations

Novel training algorithms enable reasoning to emerge:

AlgorithmKey InnovationMemory
PPOClipped surrogate + value function2x model
GRPOGroup statistics, no critic1x model
RLVRVerifiable rewards only1x model
DPODirect preference optimization1x model

Deep dive: Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR

Key Models

OpenAI o-Series

OpenAI's reasoning models use hidden chain-of-thought and test-time search:

  • o1 (September 2024): First major reasoning model
  • o3 (April 16, 2025): State-of-the-art reasoning with multimodal "thinking with images"
  • o4-mini (April 16, 2025): Cost-efficient reasoning optimized for math and code
  • o3-pro (June 10, 2025): Highest-level performance in the o-series

From OpenAI: "o3 and o4-mini are our first AI models that can 'think with images'—they integrate visual information directly into the reasoning chain."

Key benchmarks (2025):

ModelAIME 2025SWE-benchGPQA Diamond
o388.9%69.1%83.3
o4-mini92.7%68.1%81.4

From OpenAI: "o4-mini is the best-performing benchmarked model on AIME 2024 and 2025, achieving remarkable performance for its size and cost."

Note: Sam Altman indicated o3/o4-mini may be the last standalone reasoning models before GPT-5, which will unify traditional and reasoning models.

DeepSeek R1

DeepSeek's open-weight reasoning model introduces GRPO:

  • R1-Zero: Pure RL without SFT—reasoning emerges naturally
  • R1 (January 2025): Multi-stage training for better readability
  • R1-0528 (May 2025): Improved reasoning, JSON output, function-calling
  • V3.1 (August 2025): Hybrid model with switchable thinking/non-thinking modes
  • V3.2-Speciale: Gold-medal performance on 2025 IMO and IOI
  • Distilled variants: 1.5B to 70B parameters

From DeepSeek: "We directly applied reinforcement learning to the base model without relying on supervised fine-tuning. This approach allows the model to explore chain-of-thought for solving complex problems."

R1-0528 update: "Major improvements in inference and hallucination reduction. Performance now approaches O3 and Gemini 2.5 Pro."

V3.1 key feature: One model covers both use cases—switch between reasoning (like R1) and direct answers (like V3) via chat template, eliminating the need to choose between models.

The Core Insight

The breakthrough is that reasoning can emerge from simple rewards. DeepSeek R1-Zero uses only:

  1. Correctness reward: Binary signal for right/wrong answer
  2. Format reward: Encourages thinking structure

No neural reward model. No human labels for reasoning steps. Just optimize for answer correctness, and the model learns to reason.

Why this is revolutionary: Previous approaches assumed you needed to teach reasoning explicitly—either through supervised fine-tuning on reasoning traces or through reward models trained on human judgments of reasoning quality. DeepSeek showed you can skip both. Just reward correct answers, and reasoning emerges as an instrumental strategy the model discovers to get more correct answers. This is analogous to how evolution doesn't explicitly design intelligence—it just rewards survival, and intelligence emerges as a useful adaptation.

The RL magic: When a model generates multiple reasoning paths and some lead to correct answers while others don't, the RL gradient points toward "more of what led to correct answers." Over many iterations, the model learns that certain reasoning patterns (breaking problems into steps, checking intermediate results, considering edge cases) correlate with correctness. It's not following explicit reasoning instructions—it's discovering that reasoning works.

Why format rewards help: Without format constraints, a model might discover correct answers through opaque or inconsistent processes. The format reward (encouraging <think>...</think> tags) provides structure that makes the reasoning readable and debuggable, while still letting the model learn the content of reasoning through outcome-based rewards.

When to Use Reasoning Models

Understanding when reasoning models add value versus when they're overkill is crucial for cost-effective AI deployment.

Use reasoning models for:

  • Complex mathematical problems
  • Multi-step logical reasoning
  • Code generation requiring planning
  • Scientific analysis
  • High-stakes decisions where accuracy matters

Use standard models for:

  • Simple Q&A
  • Creative writing
  • Summarization
  • High-throughput applications
  • Cost-sensitive use cases

The decision framework: Ask yourself: "Would a human need to think carefully about this, or is it quick and intuitive?" Calculating a tip, answering "What's the capital of France?", or summarizing a paragraph don't require deep thought—standard models excel and are much cheaper. Proving a theorem, debugging a complex algorithm, or analyzing a multi-factor decision require careful reasoning—reasoning models shine here.

The hidden cost of reasoning: It's not just token cost. Reasoning models have higher latency because they generate long chains of thought before answering. A standard model might respond in 500ms; a reasoning model might take 5-30 seconds for complex problems. For interactive applications, this latency matters as much as cost.

Hybrid approaches work well: Use a lightweight classifier to route queries: simple questions go to fast standard models, complex questions go to reasoning models. This captures most of the accuracy gains while controlling costs. Some systems use reasoning models only for verification—generate with a standard model, verify with a reasoning model.

Cost Considerations

Reasoning models use significantly more tokens:

Model TypeTypical OutputCost Multiplier
Standard LLM100-500 tokens1x
Reasoning (simple)500-2000 tokens4-10x
Reasoning (complex)2000-10000+ tokens20-50x

Learning Path

For a complete understanding of reasoning models:

  1. Start here: Overview of the landscape (this post)
  2. Inference techniques: Test-Time Compute Scaling - CoT, ToT, MCTS, PRMs, HuggingFace search-and-learn
  3. Training methods: Training Reasoning Models - PPO, GRPO, reward functions, distillation

The Future

By late 2025, reasoning capabilities will be standard, not optional. The combination of efficient training (GRPO) and effective inference (test-time search) is democratizing advanced AI reasoning.

Lightweight models (10-20B range) will achieve complex reasoning nearly on par with the largest models, making reasoning accessible for most applications.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles