Which code model should I use in 2025?

For most uses: **Qwen 2.5 Coder** offers the best balance of quality and efficiency. The 7B model is remarkably capable; the 32B model rivals proprietary offerings. For frontier performance: **DeepSeek Coder V2** if you have MoE-compatible infrastructure. For full transparency: **StarCoder2**.

Should I fine-tune a code model for my domain?

Consider fine-tuning if: - You have domain-specific languages/frameworks - You need specific code styles - You have proprietary patterns not in training data Start with few-shot prompting on a large model first—often sufficient without fine-tuning.

How do I evaluate code model quality for my use case?

Create an evaluation set from your actual work: - Historical coding tasks your team has solved - Bug fixes with known solutions - Real code review comments Test models on these before committing. Generic benchmarks may not predict your experience.

What about security—can code models introduce vulnerabilities?

Yes, models can generate insecure code (SQL injection, XSS, etc.). Mitigations: - Security-focused prompting - Static analysis on generated code - Security-trained models or fine-tuning - Human review for security-critical code

How long should context be for code generation?

It depends on the task: - Autocomplete: Hundreds to low thousands of tokens - Function generation: Surrounding file + imports (2-8K tokens) - Bug fixing: Full file + relevant dependencies (10-30K tokens) - Repository-level: As much as the model can handle effectively More context isn't always better—irrelevant context can confuse models.

Code LLMs: Architecture, Training, and the State of AI-Assisted Programming | Enrico Piovano

The Rise of Code-Specialized LLMs

Code generation has become one of the most impactful applications of large language models. What started as autocomplete suggestions has evolved into AI systems that can implement complete features, debug complex issues, and reason about codebases at a level that genuinely augments developer productivity.

The code LLM landscape in 2025 is remarkably competitive. Open-source models now match or exceed proprietary offerings on many benchmarks:

Qwen 2.5 Coder 7B scores 88.4% on HumanEval—surpassing GPT-4's 87.1% with a model a fraction of the size.

DeepSeek Coder V2 achieves GPT-4 Turbo-level performance on code tasks with only 21B active parameters (in a 236B total MoE architecture).

Open-source models on LiveCodeBench now rival closed models from just a year ago, with Qwen3-235B scoring 70.7 on code generation.

This guide explores what makes code models different, how they're trained, and how to choose the right one for your use case.

What Makes Code Different from Natural Language

Code is fundamentally different from prose, and these differences profoundly impact model architecture and training:

Structural Properties

Syntax matters absolutely: In prose, "the dog bit the man" and "the man bit the dog" are both grammatical. In code, a single missing semicolon or mismatched bracket is a hard error. Code models must learn rigid syntactic rules.

Long-range dependencies: A variable defined hundreds of lines earlier must be referenced correctly. A function signature must match its implementation. These dependencies span farther than typical prose.

Hierarchical structure: Code has explicit structure—files contain classes, classes contain methods, methods contain blocks. Understanding this hierarchy is essential.

Multiple formalisms: A codebase contains Python, JavaScript, SQL, YAML, JSON, Markdown, and shell scripts. Each has different syntax and semantics.

Semantic Properties

Execution semantics: Code does something. Understanding what code does (semantically) is different from understanding what code says (syntactically).

Type systems: Many languages have explicit types that constrain what's valid. Type-aware generation is more constrained than free-form text.

API knowledge: Good code generation requires knowing libraries, frameworks, and their conventions—extensive world knowledge specific to programming.

Cross-file context: Unlike a self-contained essay, code files reference other files, imported modules, and system APIs.

Pragmatic Properties

Correctness is testable: Unlike prose quality (subjective), code correctness can be verified by execution and tests.

Multiple valid solutions: Many different implementations can solve the same problem. Code models must learn that different doesn't mean wrong.

Style conventions: PEP 8 for Python, linting rules, project-specific conventions. Models should respect these.

Code Model Architectures

Code LLMs share architectural foundations with general language models but incorporate adaptations for code:

Decoder-Only Transformers

Most modern code LLMs use decoder-only transformer architectures (GPT-style), the same as general LLMs:

Why decoder-only for code?

Code completion is naturally left-to-right (though fill-in-the-middle training addresses this)
Unified architecture simplifies training and inference
Strong performance on both completion and generation

Standard components:

Token embeddings + positional encoding
Multi-head self-attention layers
Feed-forward networks
Layer normalization

Extended Context Windows

Code requires longer context than typical prose:

CodeLlama: Extended to 100K tokens (from Llama's 4K) DeepSeek Coder V2: 128K context window Qwen Coder: Variable by model size, up to 128K

Why longer context matters for code:

Complete files often exceed 4K tokens
Cross-file context requires seeing multiple files
Long function implementations need full visibility

Context extension techniques:

RoPE scaling (position interpolation, NTK-aware scaling)
ALiBi (Attention with Linear Biases)
Continued pre-training on long sequences

Mixture of Experts (MoE)

Several leading code models use MoE architectures:

DeepSeek Coder V2:

236B total parameters
Only 21B active per token
Expert specialization may align with language/domain

Benefits for code:

Capacity for many programming languages without interference
Efficient inference despite large total size
Different experts can specialize in different languages/patterns

Fill-in-the-Middle (FIM) Training

Standard left-to-right training can't handle code completion in the middle of a file. FIM training addresses this:

The FIM transformation:

Split document into prefix, middle, and suffix
Rearrange as: <PREFIX>{prefix}<SUFFIX>{suffix}<MIDDLE>{middle}
Train model to predict the middle given prefix and suffix

Example:

Code

Original:
def add(a, b):
    return a + b

FIM format:
<PREFIX>def add(a, b):
    return <SUFFIX>

<MIDDLE>a + b

FIM enables:

Cursor-position completion (most common IDE use case)
Infilling between code blocks
Generating implementations that fit signatures

Most code models (CodeLlama, StarCoder, DeepSeek Coder, Qwen Coder) include FIM training.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                 FILL-IN-THE-MIDDLE (FIM) TRAINING                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  STANDARD LEFT-TO-RIGHT TRAINING:                                           │
│  ─────────────────────────────────                                          │
│                                                                             │
│  "def fibonacci(n):\n    if n <= 1:\n        return n\n    return..."      │
│   ───────────────────────────────────────────────────────────────────►      │
│                                                                             │
│  Model predicts each token given all previous tokens                        │
│  Problem: Can't complete in the MIDDLE of existing code                     │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  FIM TRAINING:                                                              │
│  ─────────────                                                              │
│                                                                             │
│  Original code:                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ def fibonacci(n):                                                    │   │
│  │     if n <= 1:                                                       │   │
│  │         return n                          ← We want to generate this │   │
│  │     return fibonacci(n-1) + fibonacci(n-2)                          │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Split into three parts:                                                    │
│  PREFIX: "def fibonacci(n):\n    if n <= 1:\n        return "              │
│  MIDDLE: "n"                                                                │
│  SUFFIX: "\n    return fibonacci(n-1) + fibonacci(n-2)"                    │
│                                                                             │
│  FIM format for training:                                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ <PRE> def fibonacci(n):                                              │   │
│  │     if n <= 1:                                                       │   │
│  │         return <SUF>                                                 │   │
│  │     return fibonacci(n-1) + fibonacci(n-2) <MID> n                   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Model learns to generate MIDDLE given PREFIX + SUFFIX context              │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  FIM MODES:                                                                 │
│                                                                             │
│  PSM (Prefix-Suffix-Middle): <PRE>...<SUF>...<MID>... (shown above)        │
│  SPM (Suffix-Prefix-Middle): <SUF>...<PRE>...<MID>...                       │
│                                                                             │
│  Both are used during training with random selection                        │
│  Typical FIM rate: 50% of training examples use FIM format                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Training Code Models

Training a code-specialized LLM involves several stages:

Stage 1: Code Pre-training Corpus

The foundation is a massive code corpus. Key considerations:

Sources:

GitHub (the dominant source)
GitLab, Bitbucket
Package repositories (PyPI, npm, crates.io)
Documentation sites
Stack Overflow
Technical blogs

Language distribution: DeepSeek Coder V2 supports 338 programming languages. However, distribution is highly skewed:

Python, JavaScript, Java, C/C++, Go dominate
Long-tail languages have limited data

License filtering: Many code datasets filter for permissive licenses (MIT, Apache, BSD) to avoid copyright issues. This is more conservative but legally safer.

Quality signals:

Stars/forks on repositories
Presence of tests
Documentation quality
Code style (linting passes)
Recency (not abandoned)

Stage 2: General Pre-training

Most code models start from a general language model checkpoint (or train jointly on code + text):

Why include natural language:

Documentation understanding
Reasoning in natural language
Instruction following
Code-comment relationships

Mixture strategies:

StarCoder: ~80% code, ~20% GitHub issues/docs
Llama 3: Up to 35% code in mixture
Code-specific models often use higher code ratios

Stage 3: Code-Specific Pre-training

Extended training on code-heavy data:

Curriculum considerations:

Start with high-quality repositories
Include diverse languages
Balance popular languages with long-tail
Include documentation alongside code

FIM integration: Typically 50% of examples use FIM format, 50% use standard left-to-right.

Stage 4: Instruction Tuning

Fine-tune on code-specific instructions:

Instruction types:

"Write a function that..."
"Fix this bug..."
"Explain this code..."
"Refactor to improve..."
"Add tests for..."
"Convert from Python to JavaScript..."

Data sources:

Human-written instruction-response pairs
LLM-generated (using GPT-4 or similar)
Derived from commits (commit message = instruction, diff = response)

Stage 5: Alignment (Optional)

RLHF or DPO on code preferences:

Preference signals:

Correctness (does it pass tests?)
Style (follows conventions?)
Efficiency (reasonable complexity?)
Safety (avoids vulnerabilities?)

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CODE MODEL TRAINING PIPELINE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  STAGE 1: BASE MODEL                                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Option A: Start from general LLM (Llama, Qwen base)                 │   │
│  │ Option B: Train from scratch on code-heavy mixture                  │   │
│  │                                                                       │   │
│  │ Data: General text + code (varying ratios)                           │   │
│  │ Objective: Standard next-token prediction                            │   │
│  │ Duration: Weeks on large GPU clusters                                │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│                              ▼                                              │
│  STAGE 2: CODE-SPECIFIC PRE-TRAINING                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Continue training on code-heavy corpus                               │   │
│  │                                                                       │   │
│  │ Data:                                                                 │   │
│  │   - GitHub repositories (filtered by quality, license)              │   │
│  │   - 338+ programming languages                                       │   │
│  │   - Documentation and comments                                        │   │
│  │   - Technical Q&A (Stack Overflow)                                   │   │
│  │                                                                       │   │
│  │ Training modifications:                                               │   │
│  │   - 50% FIM (Fill-in-the-Middle) format                              │   │
│  │   - Extended context training (16K → 128K)                           │   │
│  │   - Repository-level context (multiple files)                        │   │
│  │                                                                       │   │
│  │ Duration: Days to weeks                                               │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│                              ▼                                              │
│  STAGE 3: INSTRUCTION TUNING                                                │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Fine-tune on instruction-following data                              │   │
│  │                                                                       │   │
│  │ Data:                                                                 │   │
│  │   - Code generation instructions                                     │   │
│  │   - Bug fixing examples                                               │   │
│  │   - Code explanation tasks                                            │   │
│  │   - Refactoring demonstrations                                        │   │
│  │   - Multi-turn coding conversations                                  │   │
│  │                                                                       │   │
│  │ Duration: Hours to days                                               │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│                              ▼                                              │
│  STAGE 4: ALIGNMENT (Optional)                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ RLHF or DPO on code preferences                                      │   │
│  │                                                                       │   │
│  │ Preference signals:                                                   │   │
│  │   - Correctness (execution, tests)                                   │   │
│  │   - Style and conventions                                             │   │
│  │   - Security (no vulnerabilities)                                    │   │
│  │   - Efficiency                                                        │   │
│  │                                                                       │   │
│  │ Duration: Hours to days                                               │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│                              ▼                                              │
│                    Production Code Model                                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

The 2025 Code Model Landscape

DeepSeek Coder

DeepSeek Coder V2 from DeepSeek AI represents the current frontier of open-source code models:

Architecture:

Mixture of Experts (MoE) based on DeepSeekMoE
Two sizes: 16B (2.4B active) and 236B (21B active)
128K context window (extended from 16K)
338 programming languages supported

Training:

Pre-trained from DeepSeek-V2 intermediate checkpoint
Additional 6 trillion tokens of code-heavy training
Extensive FIM training

Performance:

Achieves GPT-4 Turbo-level performance on code benchmarks
Strong on HumanEval, MBPP, and real-world coding tasks
Efficient inference due to MoE architecture

Strengths: Frontier performance, efficient inference, massive context window Considerations: Large model size, MoE requires compatible infrastructure

Qwen Coder

Qwen 2.5 Coder from Alibaba's Qwen team leads many code benchmarks:

Model sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B

Performance highlights:

7B model scores 88.4% on HumanEval (exceeds GPT-4's 87.1%)
82.0% on Spider SQL benchmark (vs. 76.6% for Codestral)
Strong real-world coding reports from developers

Strengths:

Excellent quality at smaller sizes (7B punches above its weight)
Strong SQL/database capabilities
Consistent, reliable code generation
Multilingual (code + natural language)

Developer feedback: "Qwen Coder had absolutely no problem producing code, completing tasks perfectly. Even variations of prompts multiple times showed that it wasn't a fluke—Qwen delivered every time."

CodeLlama

CodeLlama from Meta, based on Llama 2:

Variants:

CodeLlama (base): General code model
CodeLlama-Python: Python-specialized
CodeLlama-Instruct: Instruction-tuned

Sizes: 7B, 13B, 34B, 70B

Features:

Extended context to 100K tokens
FIM capability
Strong infilling performance

Status: Somewhat superseded by newer models but well-documented and widely deployed.

StarCoder / StarCoder2

StarCoder2 from BigCode (open collaboration):

Training data: The Stack v2 (67TB of permissively licensed code)

Sizes: 3B, 7B, 15B

Features:

Fully open (weights, data, training code)
Grouped Query Attention
FIM training

Strengths: Complete transparency, ethical data sourcing, strong community

Codestral

Codestral from Mistral AI:

Size: 22B parameters

Performance: 81.1% on HumanEval

Features:

32K context window
Supports 80+ languages
FIM capability

Strengths: Good balance of size and capability, Mistral ecosystem integration

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CODE MODEL COMPARISON (2025)                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  MODEL            │ SIZE       │ HUMANEVAL │ CONTEXT │ KEY STRENGTH         │
│  ────────────────────────────────────────────────────────────────────────   │
│                                                                             │
│  Qwen 2.5 Coder   │ 7B-32B     │ 88.4%     │ 128K    │ Best small-model     │
│                   │            │ (7B)      │         │ perf, SQL strength   │
│                                                                             │
│  DeepSeek Coder   │ 16B/236B   │ GPT-4     │ 128K    │ Frontier quality,    │
│  V2               │ (MoE)      │ level     │         │ 338 languages        │
│                                                                             │
│  Codestral        │ 22B        │ 81.1%     │ 32K     │ Balanced, Mistral    │
│                   │            │           │         │ ecosystem            │
│                                                                             │
│  CodeLlama        │ 7B-70B     │ ~75%      │ 100K    │ Well-documented,     │
│                   │            │ (34B)     │         │ mature               │
│                                                                             │
│  StarCoder2       │ 3B-15B     │ ~70%      │ 16K     │ Fully open, ethical  │
│                   │            │ (15B)     │         │ data sourcing        │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  PROPRIETARY REFERENCES:                                                    │
│                                                                             │
│  GPT-4            │ Unknown    │ 87.1%     │ 128K    │ Strong reasoning     │
│  GPT-4o           │ Unknown    │ 90.2%     │ 128K    │ Current OpenAI best  │
│  Claude 3.5       │ Unknown    │ 92%+      │ 200K    │ Agentic coding       │
│  Sonnet           │            │           │         │                      │
│                                                                             │
│  Note: HumanEval is one benchmark; real-world performance varies.           │
│  LiveCodeBench and SWE-bench provide more realistic evaluation.             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Evaluation: Beyond HumanEval

HumanEval (synthesizing functions from docstrings) is the most cited benchmark, but it has limitations:

HumanEval Limitations

Too easy: Top models saturate HumanEval. It no longer differentiates frontier models.

Not realistic: Short, self-contained functions don't represent real coding work.

Narrow scope: Python-focused, doesn't test cross-file reasoning.

Better Benchmarks

MBPP (Mostly Basic Python Problems): More problems, similar format to HumanEval.

LiveCodeBench: Real competitive programming problems, updated regularly to prevent contamination.

SWE-bench: Real GitHub issues. Models must navigate repositories, understand context, and produce correct patches. Much harder and more realistic.

Spider: SQL generation from natural language. Tests database understanding.

HumanEval+: Enhanced HumanEval with more test cases, catching edge case failures.

BigCodeBench: Tests function calls to libraries, more practical than self-contained algorithms.

Evaluation Dimensions

Beyond accuracy, consider:

Latency: How fast can the model generate code?

Cost: Token cost for your volume.

Context handling: Does it use long context effectively?

Multi-file reasoning: Can it navigate codebases?

Instruction following: Does it follow specific requirements?

Code style: Does it match your conventions?

Practical Deployment Considerations

Choosing a Model

For IDE autocomplete (latency-critical):

Smaller, faster models (Qwen 2.5 Coder 7B, StarCoder2 7B)
Local deployment if privacy matters

For complex generation (quality-critical):

Larger models (Qwen 2.5 Coder 32B, DeepSeek Coder V2)
Can tolerate higher latency

For agentic coding (multi-step reasoning):

Largest available models
Long context essential
Often Claude 3.5 Sonnet or GPT-4o for best results

For specialized domains:

Fine-tune on domain-specific code
Or use larger general models with domain context

Inference Optimization

Code models benefit from standard LLM optimization:

Quantization: 4-bit or 8-bit quantization for memory efficiency. AWQ, GPTQ, or GGUF formats.

Speculative decoding: Small draft model accelerates large model generation.

KV cache optimization: Essential for long context (vLLM's paged attention).

Batching: Continuous batching for multi-user deployments.

Context Management

Long context doesn't mean infinite context:

Retrieval augmentation: Retrieve relevant files/functions instead of stuffing everything.

Hierarchical context: Repository structure summary + detailed relevant files.

Sliding window: For very long files, maintain relevant window.

The Future of Code Models

Several trends are shaping the future:

Repository-Level Understanding

Moving beyond file-level to understanding entire codebases:

Cross-file dependencies
Project architecture
Test-code relationships
CI/CD integration

Agentic Coding

Models that can:

Navigate codebases autonomously
Run tests and iterate
Use tools (linters, type checkers, search)
Plan multi-step implementations

SWE-bench is the benchmark pushing this frontier.

Formal Verification Integration

Combining LLMs with formal methods:

Type-directed generation
Proof-guided synthesis
Specification-to-code pipelines

Personalization

Models that learn individual/team preferences:

Coding style adaptation
Framework preferences
Architecture patterns

Real-Time Collaboration

Integration with development workflows:

PR review assistance
Continuous code improvement suggestions
Documentation generation

Table of Contents

The Rise of Code-Specialized LLMs

What Makes Code Different from Natural Language

Structural Properties

Semantic Properties

Pragmatic Properties

Code Model Architectures

Decoder-Only Transformers

Extended Context Windows

Mixture of Experts (MoE)

Fill-in-the-Middle (FIM) Training

Training Code Models

Stage 1: Code Pre-training Corpus

Stage 2: General Pre-training

Stage 3: Code-Specific Pre-training

Stage 4: Instruction Tuning

Stage 5: Alignment (Optional)

The 2025 Code Model Landscape

DeepSeek Coder

Qwen Coder

CodeLlama

StarCoder / StarCoder2

Codestral

Evaluation: Beyond HumanEval

HumanEval Limitations

Better Benchmarks

Evaluation Dimensions

Practical Deployment Considerations

Choosing a Model

Inference Optimization

Context Management

The Future of Code Models

Repository-Level Understanding

Agentic Coding

Formal Verification Integration

Personalization

Real-Time Collaboration

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Building AI Coding Agents: From Code Understanding to Autonomous Development

AI Coding Assistants 2025: Cursor vs Copilot vs Windsurf vs Claude Code

Cline: Deep Dive into the Open-Source AI Coding Agent

Structured Outputs and Tool Use: Patterns for Reliable AI Applications