Skip to main content
Back to Blog

Code LLMs: Architecture, Training, and the State of AI-Assisted Programming

A comprehensive guide to code-specialized language models—from DeepSeek Coder and Qwen Coder to CodeLlama and StarCoder. Understanding how code models are trained, what makes them different from general LLMs, and how to choose the right one for your application.

9 min read
Share:

The Rise of Code-Specialized LLMs

Code generation has become one of the most impactful applications of large language models. What started as autocomplete suggestions has evolved into AI systems that can implement complete features, debug complex issues, and reason about codebases at a level that genuinely augments developer productivity.

The code LLM landscape in 2025 is remarkably competitive. Open-source models now match or exceed proprietary offerings on many benchmarks:

Qwen 2.5 Coder 7B scores 88.4% on HumanEval—surpassing GPT-4's 87.1% with a model a fraction of the size.

DeepSeek Coder V2 achieves GPT-4 Turbo-level performance on code tasks with only 21B active parameters (in a 236B total MoE architecture).

Open-source models on LiveCodeBench now rival closed models from just a year ago, with Qwen3-235B scoring 70.7 on code generation.

This guide explores what makes code models different, how they're trained, and how to choose the right one for your use case.

What Makes Code Different from Natural Language

Code is fundamentally different from prose, and these differences profoundly impact model architecture and training:

Structural Properties

Syntax matters absolutely: In prose, "the dog bit the man" and "the man bit the dog" are both grammatical. In code, a single missing semicolon or mismatched bracket is a hard error. Code models must learn rigid syntactic rules.

Long-range dependencies: A variable defined hundreds of lines earlier must be referenced correctly. A function signature must match its implementation. These dependencies span farther than typical prose.

Hierarchical structure: Code has explicit structure—files contain classes, classes contain methods, methods contain blocks. Understanding this hierarchy is essential.

Multiple formalisms: A codebase contains Python, JavaScript, SQL, YAML, JSON, Markdown, and shell scripts. Each has different syntax and semantics.

Semantic Properties

Execution semantics: Code does something. Understanding what code does (semantically) is different from understanding what code says (syntactically).

Type systems: Many languages have explicit types that constrain what's valid. Type-aware generation is more constrained than free-form text.

API knowledge: Good code generation requires knowing libraries, frameworks, and their conventions—extensive world knowledge specific to programming.

Cross-file context: Unlike a self-contained essay, code files reference other files, imported modules, and system APIs.

Pragmatic Properties

Correctness is testable: Unlike prose quality (subjective), code correctness can be verified by execution and tests.

Multiple valid solutions: Many different implementations can solve the same problem. Code models must learn that different doesn't mean wrong.

Style conventions: PEP 8 for Python, linting rules, project-specific conventions. Models should respect these.

Code Model Architectures

Code LLMs share architectural foundations with general language models but incorporate adaptations for code:

Decoder-Only Transformers

Most modern code LLMs use decoder-only transformer architectures (GPT-style), the same as general LLMs:

Why decoder-only for code?

  • Code completion is naturally left-to-right (though fill-in-the-middle training addresses this)
  • Unified architecture simplifies training and inference
  • Strong performance on both completion and generation

Standard components:

  • Token embeddings + positional encoding
  • Multi-head self-attention layers
  • Feed-forward networks
  • Layer normalization

Extended Context Windows

Code requires longer context than typical prose:

CodeLlama: Extended to 100K tokens (from Llama's 4K) DeepSeek Coder V2: 128K context window Qwen Coder: Variable by model size, up to 128K

Why longer context matters for code:

  • Complete files often exceed 4K tokens
  • Cross-file context requires seeing multiple files
  • Long function implementations need full visibility

Context extension techniques:

  • RoPE scaling (position interpolation, NTK-aware scaling)
  • ALiBi (Attention with Linear Biases)
  • Continued pre-training on long sequences

Mixture of Experts (MoE)

Several leading code models use MoE architectures:

DeepSeek Coder V2:

  • 236B total parameters
  • Only 21B active per token
  • Expert specialization may align with language/domain

Benefits for code:

  • Capacity for many programming languages without interference
  • Efficient inference despite large total size
  • Different experts can specialize in different languages/patterns

Fill-in-the-Middle (FIM) Training

Standard left-to-right training can't handle code completion in the middle of a file. FIM training addresses this:

The FIM transformation:

  1. Split document into prefix, middle, and suffix
  2. Rearrange as: <PREFIX>{prefix}<SUFFIX>{suffix}<MIDDLE>{middle}
  3. Train model to predict the middle given prefix and suffix

Example:

Code
Original:
def add(a, b):
    return a + b

FIM format:
<PREFIX>def add(a, b):
    return <SUFFIX>

<MIDDLE>a + b

FIM enables:

  • Cursor-position completion (most common IDE use case)
  • Infilling between code blocks
  • Generating implementations that fit signatures

Most code models (CodeLlama, StarCoder, DeepSeek Coder, Qwen Coder) include FIM training.

Code
┌─────────────────────────────────────────────────────────────────────────────┐
│                 FILL-IN-THE-MIDDLE (FIM) TRAINING                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  STANDARD LEFT-TO-RIGHT TRAINING:                                           │
│  ─────────────────────────────────                                          │
│                                                                             │
│  "def fibonacci(n):\n    if n <= 1:\n        return n\n    return..."      │
│   ───────────────────────────────────────────────────────────────────►      │
│                                                                             │
│  Model predicts each token given all previous tokens                        │
│  Problem: Can't complete in the MIDDLE of existing code                     │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  FIM TRAINING:                                                              │
│  ─────────────                                                              │
│                                                                             │
│  Original code:                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ def fibonacci(n):                                                    │   │
│  │     if n <= 1:                                                       │   │
│  │         return n                          ← We want to generate this │   │
│  │     return fibonacci(n-1) + fibonacci(n-2)                          │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Split into three parts:                                                    │
│  PREFIX: "def fibonacci(n):\n    if n <= 1:\n        return "              │
│  MIDDLE: "n"                                                                │
│  SUFFIX: "\n    return fibonacci(n-1) + fibonacci(n-2)"                    │
│                                                                             │
│  FIM format for training:                                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ <PRE> def fibonacci(n):                                              │   │
│  │     if n <= 1:                                                       │   │
│  │         return <SUF>                                                 │   │
│  │     return fibonacci(n-1) + fibonacci(n-2) <MID> n                   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  Model learns to generate MIDDLE given PREFIX + SUFFIX context              │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  FIM MODES:                                                                 │
│                                                                             │
│  PSM (Prefix-Suffix-Middle): <PRE>...<SUF>...<MID>... (shown above)        │
│  SPM (Suffix-Prefix-Middle): <SUF>...<PRE>...<MID>...                       │
│                                                                             │
│  Both are used during training with random selection                        │
│  Typical FIM rate: 50% of training examples use FIM format                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Training Code Models

Training a code-specialized LLM involves several stages:

Stage 1: Code Pre-training Corpus

The foundation is a massive code corpus. Key considerations:

Sources:

  • GitHub (the dominant source)
  • GitLab, Bitbucket
  • Package repositories (PyPI, npm, crates.io)
  • Documentation sites
  • Stack Overflow
  • Technical blogs

Language distribution: DeepSeek Coder V2 supports 338 programming languages. However, distribution is highly skewed:

  • Python, JavaScript, Java, C/C++, Go dominate
  • Long-tail languages have limited data

License filtering: Many code datasets filter for permissive licenses (MIT, Apache, BSD) to avoid copyright issues. This is more conservative but legally safer.

Quality signals:

  • Stars/forks on repositories
  • Presence of tests
  • Documentation quality
  • Code style (linting passes)
  • Recency (not abandoned)

Stage 2: General Pre-training

Most code models start from a general language model checkpoint (or train jointly on code + text):

Why include natural language:

  • Documentation understanding
  • Reasoning in natural language
  • Instruction following
  • Code-comment relationships

Mixture strategies:

  • StarCoder: ~80% code, ~20% GitHub issues/docs
  • Llama 3: Up to 35% code in mixture
  • Code-specific models often use higher code ratios

Stage 3: Code-Specific Pre-training

Extended training on code-heavy data:

Curriculum considerations:

  • Start with high-quality repositories
  • Include diverse languages
  • Balance popular languages with long-tail
  • Include documentation alongside code

FIM integration: Typically 50% of examples use FIM format, 50% use standard left-to-right.

Stage 4: Instruction Tuning

Fine-tune on code-specific instructions:

Instruction types:

  • "Write a function that..."
  • "Fix this bug..."
  • "Explain this code..."
  • "Refactor to improve..."
  • "Add tests for..."
  • "Convert from Python to JavaScript..."

Data sources:

  • Human-written instruction-response pairs
  • LLM-generated (using GPT-4 or similar)
  • Derived from commits (commit message = instruction, diff = response)

Stage 5: Alignment (Optional)

RLHF or DPO on code preferences:

Preference signals:

  • Correctness (does it pass tests?)
  • Style (follows conventions?)
  • Efficiency (reasonable complexity?)
  • Safety (avoids vulnerabilities?)
Code
┌─────────────────────────────────────────────────────────────────────────────┐
│                    CODE MODEL TRAINING PIPELINE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  STAGE 1: BASE MODEL                                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Option A: Start from general LLM (Llama, Qwen base)                 │   │
│  │ Option B: Train from scratch on code-heavy mixture                  │   │
│  │                                                                       │   │
│  │ Data: General text + code (varying ratios)                           │   │
│  │ Objective: Standard next-token prediction                            │   │
│  │ Duration: Weeks on large GPU clusters                                │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│                              ▼                                              │
│  STAGE 2: CODE-SPECIFIC PRE-TRAINING                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Continue training on code-heavy corpus                               │   │
│  │                                                                       │   │
│  │ Data:                                                                 │   │
│  │   - GitHub repositories (filtered by quality, license)              │   │
│  │   - 338+ programming languages                                       │   │
│  │   - Documentation and comments                                        │   │
│  │   - Technical Q&A (Stack Overflow)                                   │   │
│  │                                                                       │   │
│  │ Training modifications:                                               │   │
│  │   - 50% FIM (Fill-in-the-Middle) format                              │   │
│  │   - Extended context training (16K → 128K)                           │   │
│  │   - Repository-level context (multiple files)                        │   │
│  │                                                                       │   │
│  │ Duration: Days to weeks                                               │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│                              ▼                                              │
│  STAGE 3: INSTRUCTION TUNING                                                │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Fine-tune on instruction-following data                              │   │
│  │                                                                       │   │
│  │ Data:                                                                 │   │
│  │   - Code generation instructions                                     │   │
│  │   - Bug fixing examples                                               │   │
│  │   - Code explanation tasks                                            │   │
│  │   - Refactoring demonstrations                                        │   │
│  │   - Multi-turn coding conversations                                  │   │
│  │                                                                       │   │
│  │ Duration: Hours to days                                               │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│                              ▼                                              │
│  STAGE 4: ALIGNMENT (Optional)                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ RLHF or DPO on code preferences                                      │   │
│  │                                                                       │   │
│  │ Preference signals:                                                   │   │
│  │   - Correctness (execution, tests)                                   │   │
│  │   - Style and conventions                                             │   │
│  │   - Security (no vulnerabilities)                                    │   │
│  │   - Efficiency                                                        │   │
│  │                                                                       │   │
│  │ Duration: Hours to days                                               │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                              │
│                              ▼                                              │
│                    Production Code Model                                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

The 2025 Code Model Landscape

DeepSeek Coder

DeepSeek Coder V2 from DeepSeek AI represents the current frontier of open-source code models:

Architecture:

  • Mixture of Experts (MoE) based on DeepSeekMoE
  • Two sizes: 16B (2.4B active) and 236B (21B active)
  • 128K context window (extended from 16K)
  • 338 programming languages supported

Training:

  • Pre-trained from DeepSeek-V2 intermediate checkpoint
  • Additional 6 trillion tokens of code-heavy training
  • Extensive FIM training

Performance:

  • Achieves GPT-4 Turbo-level performance on code benchmarks
  • Strong on HumanEval, MBPP, and real-world coding tasks
  • Efficient inference due to MoE architecture

Strengths: Frontier performance, efficient inference, massive context window Considerations: Large model size, MoE requires compatible infrastructure

Qwen Coder

Qwen 2.5 Coder from Alibaba's Qwen team leads many code benchmarks:

Model sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B

Performance highlights:

  • 7B model scores 88.4% on HumanEval (exceeds GPT-4's 87.1%)
  • 82.0% on Spider SQL benchmark (vs. 76.6% for Codestral)
  • Strong real-world coding reports from developers

Strengths:

  • Excellent quality at smaller sizes (7B punches above its weight)
  • Strong SQL/database capabilities
  • Consistent, reliable code generation
  • Multilingual (code + natural language)

Developer feedback: "Qwen Coder had absolutely no problem producing code, completing tasks perfectly. Even variations of prompts multiple times showed that it wasn't a fluke—Qwen delivered every time."

CodeLlama

CodeLlama from Meta, based on Llama 2:

Variants:

  • CodeLlama (base): General code model
  • CodeLlama-Python: Python-specialized
  • CodeLlama-Instruct: Instruction-tuned

Sizes: 7B, 13B, 34B, 70B

Features:

  • Extended context to 100K tokens
  • FIM capability
  • Strong infilling performance

Status: Somewhat superseded by newer models but well-documented and widely deployed.

StarCoder / StarCoder2

StarCoder2 from BigCode (open collaboration):

Training data: The Stack v2 (67TB of permissively licensed code)

Sizes: 3B, 7B, 15B

Features:

  • Fully open (weights, data, training code)
  • Grouped Query Attention
  • FIM training

Strengths: Complete transparency, ethical data sourcing, strong community

Codestral

Codestral from Mistral AI:

Size: 22B parameters

Performance: 81.1% on HumanEval

Features:

  • 32K context window
  • Supports 80+ languages
  • FIM capability

Strengths: Good balance of size and capability, Mistral ecosystem integration

Code
┌─────────────────────────────────────────────────────────────────────────────┐
│                    CODE MODEL COMPARISON (2025)                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  MODEL            │ SIZE       │ HUMANEVAL │ CONTEXT │ KEY STRENGTH         │
│  ────────────────────────────────────────────────────────────────────────   │
│                                                                             │
│  Qwen 2.5 Coder   │ 7B-32B     │ 88.4%     │ 128K    │ Best small-model     │
│                   │            │ (7B)      │         │ perf, SQL strength   │
│                                                                             │
│  DeepSeek Coder   │ 16B/236B   │ GPT-4     │ 128K    │ Frontier quality,    │
│  V2               │ (MoE)      │ level     │         │ 338 languages        │
│                                                                             │
│  Codestral        │ 22B        │ 81.1%     │ 32K     │ Balanced, Mistral    │
│                   │            │           │         │ ecosystem            │
│                                                                             │
│  CodeLlama        │ 7B-70B     │ ~75%      │ 100K    │ Well-documented,     │
│                   │            │ (34B)     │         │ mature               │
│                                                                             │
│  StarCoder2       │ 3B-15B     │ ~70%      │ 16K     │ Fully open, ethical  │
│                   │            │ (15B)     │         │ data sourcing        │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  PROPRIETARY REFERENCES:                                                    │
│                                                                             │
│  GPT-4            │ Unknown    │ 87.1%     │ 128K    │ Strong reasoning     │
│  GPT-4o           │ Unknown    │ 90.2%     │ 128K    │ Current OpenAI best  │
│  Claude 3.5       │ Unknown    │ 92%+      │ 200K    │ Agentic coding       │
│  Sonnet           │            │           │         │                      │
│                                                                             │
│  Note: HumanEval is one benchmark; real-world performance varies.           │
│  LiveCodeBench and SWE-bench provide more realistic evaluation.             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Evaluation: Beyond HumanEval

HumanEval (synthesizing functions from docstrings) is the most cited benchmark, but it has limitations:

HumanEval Limitations

Too easy: Top models saturate HumanEval. It no longer differentiates frontier models.

Not realistic: Short, self-contained functions don't represent real coding work.

Narrow scope: Python-focused, doesn't test cross-file reasoning.

Better Benchmarks

MBPP (Mostly Basic Python Problems): More problems, similar format to HumanEval.

LiveCodeBench: Real competitive programming problems, updated regularly to prevent contamination.

SWE-bench: Real GitHub issues. Models must navigate repositories, understand context, and produce correct patches. Much harder and more realistic.

Spider: SQL generation from natural language. Tests database understanding.

HumanEval+: Enhanced HumanEval with more test cases, catching edge case failures.

BigCodeBench: Tests function calls to libraries, more practical than self-contained algorithms.

Evaluation Dimensions

Beyond accuracy, consider:

Latency: How fast can the model generate code?

Cost: Token cost for your volume.

Context handling: Does it use long context effectively?

Multi-file reasoning: Can it navigate codebases?

Instruction following: Does it follow specific requirements?

Code style: Does it match your conventions?

Practical Deployment Considerations

Choosing a Model

For IDE autocomplete (latency-critical):

  • Smaller, faster models (Qwen 2.5 Coder 7B, StarCoder2 7B)
  • Local deployment if privacy matters

For complex generation (quality-critical):

  • Larger models (Qwen 2.5 Coder 32B, DeepSeek Coder V2)
  • Can tolerate higher latency

For agentic coding (multi-step reasoning):

  • Largest available models
  • Long context essential
  • Often Claude 3.5 Sonnet or GPT-4o for best results

For specialized domains:

  • Fine-tune on domain-specific code
  • Or use larger general models with domain context

Inference Optimization

Code models benefit from standard LLM optimization:

Quantization: 4-bit or 8-bit quantization for memory efficiency. AWQ, GPTQ, or GGUF formats.

Speculative decoding: Small draft model accelerates large model generation.

KV cache optimization: Essential for long context (vLLM's paged attention).

Batching: Continuous batching for multi-user deployments.

Context Management

Long context doesn't mean infinite context:

Retrieval augmentation: Retrieve relevant files/functions instead of stuffing everything.

Hierarchical context: Repository structure summary + detailed relevant files.

Sliding window: For very long files, maintain relevant window.

The Future of Code Models

Several trends are shaping the future:

Repository-Level Understanding

Moving beyond file-level to understanding entire codebases:

  • Cross-file dependencies
  • Project architecture
  • Test-code relationships
  • CI/CD integration

Agentic Coding

Models that can:

  • Navigate codebases autonomously
  • Run tests and iterate
  • Use tools (linters, type checkers, search)
  • Plan multi-step implementations

SWE-bench is the benchmark pushing this frontier.

Formal Verification Integration

Combining LLMs with formal methods:

  • Type-directed generation
  • Proof-guided synthesis
  • Specification-to-code pipelines

Personalization

Models that learn individual/team preferences:

  • Coding style adaptation
  • Framework preferences
  • Architecture patterns

Real-Time Collaboration

Integration with development workflows:

  • PR review assistance
  • Continuous code improvement suggestions
  • Documentation generation

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles