Open-Source LLMs: The Complete 2025 Guide
A comprehensive guide to open-source LLMs—Llama 4, Qwen3, DeepSeek V3.2, Mistral Large 3, Kimi K2, GLM-4.7 and more. Detailed benchmarks, hardware requirements, deployment strategies, and practical recommendations for production use.
Table of Contents
The Open-Source Revolution
Open-source models have reached a critical inflection point. They now rival—and sometimes exceed—proprietary alternatives in many tasks while offering complete control over deployment, customization, and costs.
Why the gap closed so quickly: In 2023, open-source models were clearly inferior—GPT-4 seemed unreachable. Two factors changed this: (1) Training recipes became public. Llama 2's technical report, InstructGPT's RLHF paper, and countless fine-tuning guides democratized the "secret sauce." (2) Scale became accessible. Cloud providers offer H100 clusters at reasonable rates; training a 70B model costs ~100M. The frontier is still proprietary, but the gap is months, not years.
The strategic calculus has flipped: When open models were 2+ years behind, the choice was easy—use APIs for quality. Now the calculation is different. At 100K+ requests/day, self-hosting costs 10x less. For sensitive data, on-premises is the only option. For latency-critical applications, local inference wins. Open-source went from "not good enough" to "often the right choice."
From research: "Leading open-source models like Llama 4 Maverick, DeepSeek V3.2, Mistral Large 3, GPT-OSS-120B, Kimi K2 Thinking, and GLM-4.7 now match or exceed GPT-4o and Claude 3.5 Sonnet on most benchmarks, with some beating GPT-5 and Claude 4.5 on specific tasks."
2025 highlights:
- Llama 4 introduces MoE architecture with 10M context (Scout) and matches DeepSeek V3 at half the active parameters
- Qwen3 brings hybrid reasoning modes—"thinking" and "non-thinking"—under Apache 2.0, trained on 36T tokens in 119 languages
- DeepSeek V3.2 is reasoning-first for agents, V3.2-Speciale wins gold at IMO/IOI 2025
- Mistral Large 3 (675B MoE, 41B active) is the strongest Apache 2.0 model outside China with 256K context
- GPT-OSS-120B marks OpenAI's return to open-source—first open weights since GPT-2 (2019), Apache 2.0
- Kimi K2 (1T params, 32B active) from Moonshot AI sets new agentic benchmarks; K2 Thinking beats GPT-5 and Claude 4.5 Sonnet
- GLM-4.7 from Zhipu AI achieves highest open-source SWE-bench (73.8%) and costs $3/month
- Grok 2.5 from xAI open-sourced (270B), with Grok 3 expected to follow
- Falcon H1 (TII) brings hybrid Transformer+Mamba architecture with 262K context; 34B matches 70B models
This guide provides everything you need to choose, deploy, and optimize open-source LLMs for production use.
Why Open-Source Matters
The Strategic Case
| Benefit | Description | Impact |
|---|---|---|
| Cost at Scale | No per-token API fees | 10-100x cheaper at high volume |
| Data Privacy | Data never leaves your infrastructure | Regulatory compliance, IP protection |
| Customization | Full fine-tuning capability | Domain-specific optimization |
| Control | No vendor lock-in or API changes | Business continuity |
| Transparency | Inspect weights and behavior | Debugging, safety analysis |
| Latency | Local inference, no network round-trip | Real-time applications |
When to Use Open-Source
Strong fit:
- High-volume applications (>100K requests/day)
- Sensitive data that can't leave your infrastructure
- Need for fine-tuning on proprietary data
- Edge deployment or offline operation
- Cost-sensitive applications
- Applications requiring low latency (<100ms)
Consider proprietary APIs when:
- Prototyping and MVP development
- Variable, unpredictable load
- Need absolute frontier capabilities
- Limited ML engineering resources
- Rapid iteration more important than cost
Cost Comparison Example
For 1 million tokens/day (30M tokens/month):
| Option | Monthly Cost | Notes |
|---|---|---|
| GPT-4o API | ~$450 | $15/1M input tokens |
| Claude 3.5 Sonnet | ~$270 | 15/1M output |
| Self-hosted Llama 70B | ~$150-300 | A100 cloud instance |
| Self-hosted on-prem | $50-100 | After hardware amortization |
At 10M tokens/day, self-hosting typically saves 5-10x.
The Major Model Families
Llama (Meta)
Meta's Llama series has become the de facto standard for open-source LLMs, with the broadest ecosystem support.
Model Lineup
Llama 4 (April 2025): From Meta: "Llama 4 is our most intelligent model family, designed to enable people to build more personalized multimodal experiences."
The first Llama family using Mixture-of-Experts (MoE) architecture:
- Llama 4 Scout: 17B active / 109B total parameters (16 experts), 10M context window
- Llama 4 Maverick: 17B active / 400B total parameters (128 experts), 1M context window
- Llama 4 Behemoth: 288B active / ~2T total parameters (in training)
From Meta: "Scout is the best multimodal model in the world in its class, fitting on a single NVIDIA H100. Maverick beats GPT-4o and Gemini 2.0 Flash across benchmarks while achieving results comparable to DeepSeek V3 at less than half the active parameters."
⚠️ EU Restriction: Llama 4 is prohibited for users domiciled in the EU due to AI and data privacy regulations.
Llama 3.3 70B (December 2024): From Meta: "Offers performance comparable to the 405B parameter model at a fraction of the computational cost."
- 70B parameters
- 128K context window
- Instruction-tuned
- Multilingual support (8 languages)
Llama 3.2 (September 2024):
- Vision models: 11B and 90B with image understanding
- Edge models: 1B and 3B for mobile/edge deployment
- Multimodal capabilities integrated
Llama 3.1 (July 2024):
- 8B, 70B, and 405B parameter versions
- 405B is the largest open model available
- 128K context window
- Improved multilingual and tool use
Benchmarks
| Benchmark | Llama 3.3 70B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|
| MMLU (5-shot) | 86.0 | 88.7 | 88.7 |
| HumanEval | 88.4 | 90.2 | 92.0 |
| GSM8K | 93.0 | 95.3 | 96.4 |
| MATH | 77.0 | 76.6 | 78.3 |
Strengths
- Ecosystem: Widest tool support (vLLM, llama.cpp, TensorRT-LLM, all major frameworks)
- Documentation: Best-documented training process in open-source
- Fine-tuning: Extensive fine-tuning ecosystem (Axolotl, LLaMA-Factory, TRL)
- Stability: Most battle-tested in production
- Community: Largest community, most available fine-tunes
Weaknesses
- License restrictions: 700M MAU limit for commercial use
- Multilingual: Good but not best-in-class (8 languages vs Qwen's 29+)
- Reasoning: Trails DeepSeek R1 on complex reasoning
License
Llama 3 Community License:
- Free for commercial use under 700M monthly active users
- Above threshold requires separate commercial license from Meta
- Allows modification and redistribution
- Requires attribution
Llama 4 License:
- Free for commercial use under 700M monthly active users
- EU users prohibited from using or distributing Llama 4 models
- Multimodal capabilities (Scout, Maverick) under same restrictions
Best For
- General-purpose chat applications
- Production deployments needing stability
- Teams wanting maximum ecosystem support
- RAG systems with good general knowledge
- Applications needing vision capabilities (Llama 3.2)
Qwen (Alibaba)
Alibaba's Qwen family excels in multilingual support and specialized variants, with fully permissive licensing.
Model Lineup
Qwen3 (April 2025): From Alibaba: "Qwen3 introduces hybrid reasoning—models can 'think' through complex problems or answer quickly, matching OpenAI's o3 reasoning capability."
- Qwen3-235B-A22B: 235B total, 22B active (MoE), 128K context
- Qwen3-32B: Dense model, 128K context
- Qwen3-14B, 8B, 4B, 1.7B, 0.6B: Smaller variants
- Trained on 36 trillion tokens in 119 languages
Qwen3 Latest (Late 2025):
- Qwen3-2507: 1M token context support (August 2025)
- Qwen3-Max: Outperforms Claude 4 Opus non-thinking, DeepSeek V3.1 (September 2025)
- Qwen3-Next: Apache 2.0 licensed Instruct and Thinking models (September 2025)
- Qwen3-Coder-480B-A35B: 480B MoE with 35B active—most agentic code model
Qwen2.5 (September 2024):
- Qwen2.5-72B: Flagship dense model
- Qwen2.5-Coder: State-of-the-art code model
- Qwen2.5-Math: Mathematical reasoning specialist
- Trained on 29+ languages
Qwen2.5-VL (Vision-Language):
- 3B, 7B, 72B vision-language models
- Native image and video understanding
- Document/chart analysis
Benchmarks
| Benchmark | Qwen2.5-72B | Qwen3-235B | Llama 3.3 70B |
|---|---|---|---|
| MMLU | 86.1 | 88.2 | 86.0 |
| HumanEval | 86.6 | 89.5 | 88.4 |
| MATH | 83.1 | 86.7 | 77.0 |
| Multilingual Avg | 75.6 | 79.3 | 68.2 |
Strengths
- Multilingual: Best-in-class support for 29+ languages
- Specialized models: Coder, Math, VL variants
- MoE efficiency: 22B active parameters with 235B quality
- License: Fully Apache 2.0 (no restrictions)
- Asian languages: Best Chinese, Japanese, Korean support
Weaknesses
- Ecosystem: Smaller than Llama ecosystem
- Documentation: Less extensive than Meta's
- Community: Fewer available fine-tunes
License
Apache 2.0:
- Fully permissive commercial use
- No user limits
- No attribution required
- Can modify and redistribute freely
Best For
- Multilingual applications
- Coding assistants (Qwen2.5-Coder)
- Mathematical reasoning (Qwen2.5-Math)
- Asian market applications
- Teams wanting fully permissive licensing
- Vision-language applications (Qwen2.5-VL)
DeepSeek
Chinese AI lab that achieved breakthrough reasoning capabilities with innovative training approaches.
Model Lineup
DeepSeek-V3.2 (December 2025): From DeepSeek: "V3.2 is our reasoning-first model built for agents, performing comparably to GPT-5."
- 671B total parameters (37B active)
- DSA (DeepSeek Sparse Attention): Efficient attention for long-context scenarios
- First DeepSeek model to integrate thinking directly into tool-use
- Massive agent training: 1,800+ environments, 85k+ complex instructions
DeepSeek-V3.2-Speciale:
- High-compute variant surpassing GPT-5, on par with Gemini 3 Pro
- Gold medal in 2025 IMO and IOI, CMO, ICPC World Finals
- Designed exclusively for deep reasoning (no tool-calling)
- MIT License
DeepSeek-V3.1 (August 2025): From DeepSeek: "V3.1 combines the strengths of V3 and R1 into a single hybrid model with 'thinking' and 'non-thinking' modes."
- 671B total parameters (37B active)
- 128K context window
- Hybrid thinking mode: Switch between chain-of-thought reasoning (like R1) and direct answers (like V3) via chat template
- One model covers both general-purpose and reasoning-heavy use cases
DeepSeek-V3 (December 2024):
- 671B total parameters (MoE)
- 37B active parameters per token
- Pre-training cost: $5.6M (remarkably efficient)
- Strong general capabilities
DeepSeek-R1-0528 (May 2025): From DeepSeek: "R1-0528 features major improvements in inference and hallucination reduction, with performance approaching O3 and Gemini 2.5 Pro."
- Updated reasoning model with structured JSON output
- Built-in function-calling capabilities
- Improved math, code, and logic benchmarks
- 671B total, 37B active
DeepSeek-R1 (January 2025): From DeepSeek: "DeepSeek-R1 matches OpenAI-o1-0912 on math benchmarks with open weights."
- Reasoning model trained with pure RL (GRPO)
- Extended thinking capabilities
- Shows chain-of-thought reasoning in
<think>tags
DeepSeek-R1-Distill Series:
- Distilled versions at 1.5B, 7B, 8B, 14B, 32B, 70B
- Retain significant reasoning capability
- Run on consumer hardware
From research: "DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks."
DeepSeek-Coder-V2:
- 236B MoE (21B active)
- State-of-the-art code understanding
- 128K context window
Benchmarks
| Benchmark | DeepSeek-R1 | o1-preview | DeepSeek-V3 |
|---|---|---|---|
| AIME 2024 | 79.8% | 74.4% | 39.2% |
| MATH-500 | 97.3% | 96.4% | 90.2% |
| Codeforces | 2029 Elo | 1891 Elo | 1134 Elo |
| MMLU | 90.8 | 90.8 | 88.5 |
Strengths
- Reasoning: Best-in-class mathematical and logical reasoning
- Code: Excellent code generation and understanding
- Efficiency: Remarkably cost-effective training
- Transparency: Published detailed training methodology
- Distillation: High-quality smaller models available
Weaknesses
- Inference cost: Full R1 requires significant compute
- Latency: Reasoning models produce many tokens
- Language focus: Stronger in English/Chinese than other languages
License
DeepSeek License (MIT-like):
- Free for commercial use
- Some restrictions on specific use cases
- Open weights available
Best For
- Complex reasoning tasks
- Mathematical problem-solving
- Code generation and analysis
- Research and analysis applications
- Teams willing to trade latency for accuracy
Mistral
French AI company known for efficient, high-quality models with strong European presence. With Mistral 3, they've released the strongest fully open-weight model (Apache 2.0) developed outside China.
Model Lineup
Mistral Large 3 (December 2025): From Mistral: "Mistral Large 3 is our most capable model to date—a sparse mixture-of-experts trained with 41B active and 675B total parameters."
- 675B total parameters (MoE), 41B active
- 256K context window
- Native vision capabilities (image analysis)
- Multilingual: English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic
- Best-in-class agentic capabilities with native function calling and JSON output
- Apache 2.0 license (fully open)
- Trained from scratch on 3000 NVIDIA H200 GPUs
- Top open-source coding model on LMArena leaderboard
Mistral 3 Small Models (December 2025):
- Mistral 3 14B, 8B, 3B dense models
- All Apache 2.0 licensed
- Optimized for efficiency
Mistral Large 2 (July 2024):
- 123B parameters
- 128K context window
- Strong function calling
- Multilingual (dozens of languages)
Mistral Small 3 (January 2025): From Mistral: "Mistral Small 3 (24B) achieves state-of-the-art capabilities comparable to larger models."
- 24B parameters
- Optimized for efficiency
- Strong reasoning for size
Mixtral 8x22B:
- MoE architecture: 176B total, 39B active
- Strong general performance
- Good efficiency
Codestral (May 2024):
- 22B parameters
- Specialized for code
- 32K context window
Ministral (October 2024):
- 3B and 8B models
- Edge deployment focus
- Strong for size
Benchmarks
| Benchmark | Mistral Large 3 | Mistral Large 2 | Mistral Small 3 |
|---|---|---|---|
| MMLU | 88.2 | 84.0 | 81.0 |
| HumanEval | 94.3 | 92.1 | 84.5 |
| MATH | 82.1 | 76.5 | 69.4 |
| GSM8K | 95.8 | 91.0 | 86.2 |
Strengths
- Apache 2.0: Mistral Large 3 and Mistral 3 models are fully open (no restrictions)
- Function calling: Best-in-class structured output and agentic capabilities
- Efficiency: 41B active parameters with 675B-class quality
- European: GDPR considerations, EU-based, no regional restrictions
- Vision: Native multimodal capabilities in Large 3
- Production-ready: Focus on deployment, NVFP4 checkpoint available
Weaknesses
- Hardware requirements: Large 3 needs 8×H100 or Blackwell NVL72 for full precision
- Smaller ecosystem: Less community activity than Llama
- Legacy licensing: Older models (Codestral) have commercial restrictions
License
Mistral 3 Family: Apache 2.0 (fully permissive)
- Mistral Large 3
- Mistral 3 14B, 8B, 3B
Legacy models:
- Apache 2.0: Mistral 7B, Mixtral
- Commercial: Codestral, older Large versions
Best For
- European deployments (GDPR compliance, no regional restrictions)
- Function calling / structured output / agentic applications
- Teams wanting frontier performance with Apache 2.0 license
- Production systems needing vision + text capabilities
- Organizations requiring fully open weights outside China
Other Notable Models
Kimi K2 (Moonshot AI)
State-of-the-art agentic model from Beijing startup (Alibaba-backed). One of the most cost-effective frontier models available.
Kimi K2 (July 2025): From Moonshot: "Kimi K2 is designed for tool use, reasoning, and autonomous problem-solving."
- 1 trillion total parameters, 32B active (MoE)
- State-of-the-art among non-thinking models for knowledge, math, coding
- Trained with MuonClip optimizer on 15.5T tokens with zero instability
- Pre-trained at unprecedented 1T scale with novel optimization techniques
- SWE-bench Verified: 65.8% pass@1 (single-attempt patches)
- SWE-bench Multilingual: 47.3% pass@1
- Surpasses Claude Opus 4 on two benchmarks
- Better overall performance than GPT-4.1 on coding metrics
- Modified MIT license
- Pricing: 2.50/1M output tokens (100x cheaper than Claude Opus 4 input)
Kimi K2 Thinking (November 2025): From Moonshot: "The most powerful open source thinking model in the Kimi series to date."
- Outperforms GPT-5, Claude Sonnet 4.5 (Thinking), and Grok-4 on reasoning, coding, and agentic benchmarks
- Automatically selects 200-300 tools for autonomous task completion
- Training cost: $4.6M
- Fully open-source despite beating proprietary competitors
- Top position in reasoning and coding evaluations
Model Variants:
- Kimi-K2-Base: Foundation model for researchers wanting full control for fine-tuning
- Kimi-K2-Instruct: Post-trained model for general-purpose chat and agentic experiences
GLM-4 Series (Zhipu AI / Z.ai)
China's answer to Claude Code, open-sourced from Tsinghua University spinoff. Rapidly evolving series with strong coding and vision capabilities.
GLM-4.7 (December 22, 2025): From Z.ai: "GLM-4.7 achieves the highest SWE-bench Verified score among open-source models at 73.8%."
- ~400B parameters, 200K context, 128K output
- Open-weight 32B and 9B variants (base, reasoning, rumination)
- SWE-bench Verified: 73.8% (highest open-source)
- LiveCodeBench: 84.9% (beats Claude Sonnet 4.5)
- AIME 2025: 95.7%
- HLE (Humanity's Last Exam): 42.8% (outperforms GPT-5.1)
- Code Arena: Rank #1 among open-source and Chinese models
- "Preserved Thinking": Maintains reasoning chains across multiple turns (addresses biggest frustration in AI-assisted coding)
- $3/month or free locally via HuggingFace/ModelScope with vLLM or SGLang
Agentic capabilities:
- BrowseComp: 67.5 (web tasks)
- τ²-Bench: 87.4 (interactive tool use)—new open-source SOTA, surpasses Claude Sonnet 4.5
GLM-4.6V (December 2025): From Z.ai: "A 128K context vision-language model with native tool calling."
- GLM-4.6V: 106B parameters for cloud-scale inference
- GLM-4.6V-Flash: 9B parameters for low-latency, local applications
- 128K context window
- Native multimodal function calling: Images, screenshots, and document pages pass directly as tool parameters
- Tools can return visual outputs (search grids, charts, web pages, product images)
- Model fuses visual outputs with text in the same reasoning chain
- Optimized for frontend automation and multimodal reasoning
GLM-4.5 (Mid-2025):
- 355B total parameters, 32B active (MoE)
- Supports reasoning, tool use, coding, and agentic behaviors
- GLM-4.5-Air: Smaller sibling for efficiency
- Fast despite large parameter count due to MoE
MiniMax
Chinese startup with innovative attention mechanisms. Pioneering hybrid-attention and long-context reasoning.
MiniMax M2.1 (December 25, 2025): From MiniMax: "Significantly enhanced multi-language programming, built for real-world complex tasks."
- Sparse MoE architecture with 10B active parameters
- 204,800 token context window
- First open-source model with Advanced Interleaved Thinking (separates reasoning from response)
- Multi-language programming: Rust, Java, Go, C++, Kotlin, Objective-C, TypeScript, JavaScript
- SWE-multilingual: 72.5%
- VIBE aggregate: 88.6 (VIBE-Web: 91.5, VIBE-Android: 89.7)
- Outperforms Claude Sonnet 4.5, approaches Claude Opus 4.5 in multilingual scenarios
- ~10% the price of Claude Sonnet
- 90 tokens/sec on RTX5090 with vLLM and FP8 weights
- Runs on 4×A100 GPUs
- Modified-MIT license, weights on HuggingFace
- Inference: SGLang or vLLM (temp=1.0, top_p=0.95, top_k=40)
MiniMax-M1 (June 2025):
- World's first large-scale hybrid-attention reasoning model
- 1M token context (8x DeepSeek R1)
- 80K token reasoning output
- CISPO: RL algorithm 2x faster than DAPO
MiniMax-01:
- 456B total parameters (45.9B active)
- 4M token context (20-32x other models)
Google Gemma
Open version of Gemini technology:
- Gemma 2 (2B, 9B, 27B)
- Strong for size class
- Good efficiency
- Research-friendly license
Microsoft Phi
Quality-focused small models:
- Phi-4 (14B): "Competitive with much larger models"
- Trained primarily on synthetic data
- Excellent for size
- MIT license
HuggingFace SmolLM
Fully transparent small models:
- SmolLM2 (135M, 360M, 1.7B)
- Complete training transparency
- Research-friendly
- Apache 2.0 license
Cohere Command R
Enterprise-focused:
- Command R+ (104B)
- Strong RAG capabilities
- Enterprise features
- Commercial focus
GPT-OSS (OpenAI)
OpenAI's first open-weight release since GPT-2 (2019). A landmark moment for open-source AI.
GPT-OSS-120B (August 2025): From OpenAI: "State-of-the-art open-weight language models that deliver strong real-world performance at low cost."
- 117B total parameters, 5.1B active (MoE)
- Fits on a single 80GB GPU (H100 or AMD MI300X)
- Trained using large-scale distillation and reinforcement learning
- Three reasoning effort levels: low, medium, high (trade-off latency vs performance)
- Full chain-of-thought access for debugging
- Native agentic capabilities: function calling, web browsing, Python execution, structured outputs
- Outperforms o3-mini, matches/exceeds o4-mini on MMLU (90%), GPQA (~80%), AIME 2024/2025
- Apache 2.0 license (fully permissive)
- Available on HuggingFace, GitHub, LM Studio, Ollama
GPT-OSS-20B:
- Smaller variant for edge/efficiency use cases
- Same Apache 2.0 license
From Artificial Analysis: "The most intelligent American open weights model."
Grok (xAI)
Elon Musk's xAI open-source releases:
Grok 2.5 (August 2025):
- ~270B parameters
- Trained on text-based reasoning tasks
- Model weights (~500 GB across 42 files) + tokenizer
- Custom license: Grok 2 Community License Agreement
- Allows commercial and non-commercial use
- Restriction: Cannot use to train other AI models
- Revocable license (less permissive than Apache 2.0)
- Grok 3 expected to follow in ~6 months
Grok-1 (March 2024):
- 314B parameter MoE
- Apache 2.0 license (fully permissive)
- Historical significance: Early frontier open-source release
Falcon 3 & H1 (TII)
UAE's Technology Innovation Institute. Focus on efficient models that run on consumer hardware.
Falcon 3 (December 2024):
- Model sizes: 1B, 3B, 7B, 10B (Base and Instruct variants)
- Trained on 14 trillion tokens (2.5x predecessor)
- 32K context (8K for 1B)
- #1 on HuggingFace leaderboard at launch (for size class)
- Beats Llama-3.1-8B, Qwen2.5-7B, Mistral NeMo-12B, Gemma2-9B
- TII Falcon License (Apache 2.0-based, permissive)
- Runs on laptops and light infrastructure
Falcon H1 (2025): From TII: "A family of hybrid-head language models redefining efficiency and performance."
- Model sizes: 0.5B, 1.5B, 1.5B-Deep, 3B, 7B, 34B (+ Instruct variants)
- Hybrid architecture: Transformer attention + State Space Model (Mamba-2)
- 262K context window
- 18 languages: English, Chinese, Japanese, Korean, German, French, Spanish, etc.
- Falcon-H1-34B matches/outperforms 70B models (Qwen3-32B, Qwen2.5-72B, Llama3.3-70B)
- Falcon-H1-1.5B-Deep rivals current 7B-10B models
- Falcon-H1-0.5B performs like 2024's 7B models
- Native support in llama.cpp, axolotl, llama-factory, unsloth
- 55+ million downloads to date
- Permissive open-source license
Comprehensive Benchmark Comparison
General Capabilities
| Model | MMLU | HumanEval | GSM8K | MATH | HellaSwag |
|---|---|---|---|---|---|
| Llama 3.3 70B | 86.0 | 88.4 | 93.0 | 77.0 | 88.6 |
| Qwen2.5-72B | 86.1 | 86.6 | 91.6 | 83.1 | 88.5 |
| DeepSeek-V3 | 88.5 | 82.6 | 89.3 | 90.2 | 88.9 |
| Mistral Large 2 | 84.0 | 92.1 | 91.0 | 76.5 | 87.1 |
| DeepSeek-R1 | 90.8 | 86.7 | 97.3 | 97.3 | - |
| GPT-4o (reference) | 88.7 | 90.2 | 95.3 | 76.6 | 95.3 |
| Claude 3.5 Sonnet | 88.7 | 92.0 | 96.4 | 78.3 | 89.0 |
Coding Benchmarks
| Model | HumanEval | MBPP | LiveCodeBench | SWE-bench |
|---|---|---|---|---|
| DeepSeek-Coder-V2 | 90.2 | 80.4 | 43.4 | 22.0 |
| Qwen2.5-Coder-32B | 92.7 | 83.2 | 42.1 | 23.5 |
| Codestral | 81.1 | 78.2 | 36.2 | 18.3 |
| Llama 3.3 70B | 88.4 | 75.6 | 33.8 | 16.4 |
Reasoning Benchmarks
| Model | AIME 2024 | MATH-500 | GPQA | ARC-C |
|---|---|---|---|---|
| DeepSeek-R1 | 79.8% | 97.3% | 71.5 | 92.3 |
| Qwen3-235B | 54.7% | 86.7% | 59.1 | 89.4 |
| Llama 3.3 70B | 33.3% | 77.0% | 50.7 | 88.1 |
| o1-preview (reference) | 74.4% | 96.4% | 73.3 | - |
Chatbot Arena Rankings (Human Preference)
From LMSYS Chatbot Arena / LMArena (Late 2025). Rankings use 6M+ human votes with Elo-like scoring:
| Rank | Model | Arena Score | Notes |
|---|---|---|---|
| 1 | Claude Opus 4.5 | ~1450 | Proprietary frontier |
| 2 | GPT-5.2 | ~1440 | Proprietary frontier |
| 3 | Gemini 3 Pro | ~1430 | Proprietary frontier |
| 4 | Llama 4 Maverick | 1417 | Open-source |
| 5 | DeepSeek-V3.2 | ~1400 | Open-source |
| 6 | Mistral Large 3 | ~1395 | Open-source, top OS coding |
| 7 | Kimi K2 Thinking | ~1390 | Open-source |
| 8 | Qwen3-235B | ~1385 | Open-source |
| 9 | GLM-4.7 | ~1380 | Open-source |
Key insight: The gap between proprietary and open-source models has narrowed from 17.5 to just 0.3 percentage points on MMLU. Open-source models now achieve 85-90% of frontier performance.
Hardware Requirements
Memory Calculation
LLM memory requirements follow this formula:
Memory (GB) ≈ Parameters (B) × Bytes per Parameter
FP32: 4 bytes/param → 7B model = 28 GB
FP16: 2 bytes/param → 7B model = 14 GB
INT8: 1 byte/param → 7B model = 7 GB
INT4: 0.5 bytes/param → 7B model = 3.5 GB
Add ~20% overhead for KV cache and runtime.
Detailed Memory Requirements
| Model Size | FP16 | INT8 | INT4 (AWQ/GPTQ) | GGUF Q4_K_M |
|---|---|---|---|---|
| 1.5B | 3 GB | 1.5 GB | 1 GB | 1.1 GB |
| 7B | 14 GB | 7 GB | 4 GB | 4.4 GB |
| 8B | 16 GB | 8 GB | 4.5 GB | 5 GB |
| 13B | 26 GB | 13 GB | 7 GB | 8 GB |
| 32B | 64 GB | 32 GB | 16 GB | 19 GB |
| 70B | 140 GB | 70 GB | 35 GB | 40 GB |
| 405B | 810 GB | 405 GB | 203 GB | 230 GB |
GPU Recommendations by Use Case
Consumer Hardware
| GPU | VRAM | Can Run | Typical Use |
|---|---|---|---|
| RTX 3060 | 12 GB | 7B INT4, 3B FP16 | Development, small models |
| RTX 3090 | 24 GB | 13B INT4, 7B FP16 | Development, medium models |
| RTX 4090 | 24 GB | 32B INT4, 13B INT8 | Production-capable |
| 2x RTX 4090 | 48 GB | 70B INT4 | Production workloads |
Professional/Enterprise
| GPU | VRAM | Can Run | Typical Use |
|---|---|---|---|
| A10G | 24 GB | 32B INT4, 13B FP16 | Cloud inference |
| L40S | 48 GB | 70B INT4, 32B FP16 | Production inference |
| A100 40GB | 40 GB | 70B INT4, 32B FP16 | Training and inference |
| A100 80GB | 80 GB | 70B FP16 | Training and inference |
| H100 80GB | 80 GB | 70B FP16, optimized | High-throughput |
| 8x H100 | 640 GB | 405B FP16 | Frontier models |
From research: "With a 24 GB GPU (e.g., RTX 3090/4090), you can comfortably run 4-bit quantized versions of models up to ~40B parameters."
From research: "With Pro/Enterprise GPU (e.g., 48-80GB VRAM), you can comfortably run 70B models like Llama 3.1 and Qwen2 72B."
Apple Silicon
| Chip | Unified Memory | Can Run | Notes |
|---|---|---|---|
| M1/M2 (8GB) | 8 GB | 3B-7B Q4 | Basic use |
| M1/M2 Pro (16GB) | 16 GB | 13B Q4, 7B Q8 | Good development |
| M1/M2 Max (32GB) | 32 GB | 32B Q4, 13B Q8 | Serious development |
| M1/M2 Ultra (64GB) | 64 GB | 70B Q4 | Production-capable |
| M3 Max (128GB) | 128 GB | 70B Q8, 405B Q2 | High-end |
Deployment Options
Local Inference Engines
Ollama
Simplest local deployment:
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Run models
ollama pull llama3.3:70b
ollama run llama3.3:70b "Explain transformers"
# API access
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3:70b",
"prompt": "Explain transformers"
}'
Pros: Simplest setup, automatic model management, good defaults Cons: Less optimization control, Mac/Linux focus
llama.cpp
Maximum performance and control:
# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j LLAMA_CUDA=1 # For NVIDIA GPUs
# Run
./llama-cli \
-m models/llama-3.3-70b-instruct-q4_k_m.gguf \
-p "Explain transformers" \
-n 500 \
--temp 0.7 \
-ngl 99 # GPU layers
Pros: Best CPU performance, extensive quantization options, cross-platform Cons: Manual model management, more complex setup
vLLM
Production serving with high throughput:
from vllm import LLM, SamplingParams
# Single GPU
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
quantization="awq",
gpu_memory_utilization=0.9,
)
# Multi-GPU
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
tensor_parallel_size=2, # 2 GPUs
quantization="awq",
)
# Generate
outputs = llm.generate(
["Explain transformers"],
SamplingParams(temperature=0.7, max_tokens=500)
)
Pros: Highest throughput, PagedAttention, production-ready Cons: GPU-only, requires more resources
TensorRT-LLM
Maximum NVIDIA performance:
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct")
outputs = llm.generate(
["Explain transformers"],
SamplingParams(max_new_tokens=500)
)
Pros: Best NVIDIA performance, optimized kernels Cons: NVIDIA-only, complex setup
Comparison of Inference Engines
| Engine | Best For | Throughput | Ease of Use |
|---|---|---|---|
| Ollama | Local development | Medium | Very Easy |
| llama.cpp | CPU, Apple Silicon | Medium | Medium |
| vLLM | Production serving | High | Medium |
| TensorRT-LLM | Max NVIDIA perf | Highest | Complex |
| HF Transformers | Flexibility | Low | Easy |
Cloud Deployment
Inference Providers
| Provider | Specialty | Pricing Model |
|---|---|---|
| Together AI | Wide model selection | Per-token |
| Fireworks AI | Speed, function calling | Per-token |
| Anyscale | Scalability | Per-token |
| Replicate | Ease of use | Per-second |
| Modal | Serverless | Per-second |
| Groq | Speed (LPU) | Per-token |
Self-Hosted Cloud
AWS:
# SageMaker endpoint
Instance: ml.g5.12xlarge (4x A10G)
Model: Llama 3.3 70B AWQ
Throughput: ~50 tokens/sec
Cost: ~$7/hour
GCP:
# Vertex AI
Instance: a2-highgpu-1g (A100 40GB)
Model: Qwen2.5-72B GPTQ
Throughput: ~60 tokens/sec
Cost: ~$4/hour
Quantization Formats
| Format | Tool | Precision | Best For |
|---|---|---|---|
| GGUF | llama.cpp | Q2-Q8 | CPU, Apple Silicon, flexibility |
| AWQ | vLLM, TGI | INT4 | NVIDIA GPUs, production |
| GPTQ | vLLM, TGI | INT4 | NVIDIA GPUs, production |
| GGML | Legacy | Q4-Q8 | Deprecated, use GGUF |
| bitsandbytes | HuggingFace | INT8/INT4 | Quick prototyping |
| ExLlamaV2 | Custom | Various | Optimized inference |
Quantization Quality Comparison
Impact on Llama 3.1 70B quality:
| Quantization | MMLU Impact | Speed Gain | Memory Reduction |
|---|---|---|---|
| FP16 (baseline) | 0% | 1x | 0% |
| INT8 | -0.5% | 1.5x | 50% |
| AWQ INT4 | -1.2% | 2x | 75% |
| GPTQ INT4 | -1.5% | 2x | 75% |
| GGUF Q4_K_M | -1.8% | 2.5x | 75% |
| GGUF Q2_K | -5.0% | 3x | 87% |
Licensing Deep Dive
Fully Permissive (Apache 2.0 / MIT)
Apache 2.0:
- Qwen (all models)
- Mistral (7B, Mixtral)
- Gemma
- SmolLM
- Phi (MIT)
What you can do:
- Use commercially without limits
- Modify and create derivatives
- Distribute without attribution
- No revenue sharing required
Permissive with Limits
Llama 3 Community License:
- Free for companies with <700M monthly active users
- Must request license above threshold
- Can modify and redistribute
- Must include attribution
DeepSeek License:
- Generally permissive
- Some restrictions on specific use cases
- Review terms for your use case
Research / Non-Commercial
Some models (check individual licenses):
- May restrict commercial use
- May require academic attribution
- May have geographic restrictions
License Selection Guide
| Use Case | Recommended Models |
|---|---|
| Startup/SMB | Qwen (Apache 2.0), Llama (under 700M MAU) |
| Enterprise | Qwen, licensed Llama, Mistral |
| Research | Any (check publication requirements) |
| Products in Asia | Qwen (best Asian language support) |
| EU compliance focus | Mistral (European company) |
Practical Recommendations
By Use Case
General Chat / Customer Service
Recommended: Llama 3.3 70B or Qwen2.5-72B
For general-purpose chat, you need models with strong instruction following and broad knowledge. AWQ quantization reduces memory 4x with minimal quality loss, enabling 70B models on 2 GPUs.
# vLLM serving
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
quantization="awq",
tensor_parallel_size=2,
)
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=1024,
top_p=0.9,
)
response = llm.generate([user_message], sampling_params)
Why: Best general capabilities, strong instruction following, extensive testing.
Coding Assistant
Recommended: DeepSeek-Coder-V2 or Qwen2.5-Coder-32B
# With fill-in-the-middle
prompt = f"""<fim_prefix>{code_before}<fim_suffix>{code_after}<fim_middle>"""
response = llm.generate(prompt, SamplingParams(
temperature=0.2, # Lower for code
max_tokens=512,
))
Why: Best code understanding, completion, and generation.
Complex Reasoning / Math
Recommended: DeepSeek-R1 or R1-Distill-32B
# Allow extended thinking
response = llm.generate(
"Solve this step by step: [problem]",
SamplingParams(
temperature=0.6,
max_tokens=8192, # Allow long reasoning
)
)
Why: Best-in-class reasoning, shows work, highest accuracy on complex problems.
Multilingual Applications
Recommended: Qwen2.5-72B
Why: 29+ language support, best non-English performance.
Edge / Mobile Deployment
Recommended: Llama 3.2 3B or Phi-4-mini
# llama.cpp on mobile
./llama-cli \
-m llama-3.2-3b-instruct-q4_k_m.gguf \
-ngl 0 \ # CPU only
-t 4 \ # 4 threads
-c 2048 # Shorter context for memory
Why: Best quality at small size, optimized for resource constraints.
RAG / Knowledge Base
Recommended: Llama 3.3 70B with good retrieval
from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
# Use open embeddings too
embeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-large-en-v1.5"
)
vectorstore = Chroma.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
Why: Strong instruction following, good at synthesizing retrieved context.
Architecture Recommendations
Single GPU (24GB)
Hardware: RTX 4090 or L40S
Model: Qwen2.5-32B-AWQ or Llama 3.1 8B FP16
Engine: vLLM or llama.cpp
Throughput: ~30-50 tokens/sec
Dual GPU (48GB)
Hardware: 2x RTX 4090 or 2x A10G
Model: Llama 3.3 70B AWQ
Engine: vLLM with tensor_parallel_size=2
Throughput: ~40-60 tokens/sec
Production Cluster
Hardware: 4x A100 80GB or 4x H100
Model: Llama 3.3 70B FP16 or DeepSeek-V3
Engine: vLLM or TensorRT-LLM
Throughput: ~200+ tokens/sec
Fine-Tuning Open Models
When to Fine-Tune
Good candidates:
- Domain-specific terminology/style
- Consistent output format requirements
- Task-specific performance improvement
- Reducing prompt length
Skip fine-tuning when:
- Few-shot prompting works well
- Data is limited (<1000 examples)
- Requirements change frequently
- General knowledge is sufficient
Fine-Tuning Approaches
LoRA (Low-Rank Adaptation)
LoRA freezes the base model and adds small trainable matrices to attention layers. Instead of updating billions of parameters, you train ~0.1% of them. This makes fine-tuning feasible on consumer GPUs while preserving most of the base model's knowledge.
The key parameters: r controls the rank (higher = more parameters = more expressivity), lora_alpha is a scaling factor (typically 2x r), and target_modules specifies which layers to adapt.
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
# Trainable params: ~0.1% of total
Memory: ~16GB for 7B model, ~48GB for 70B model
QLoRA (Quantized LoRA)
QLoRA combines 4-bit quantization with LoRA, enabling 70B model fine-tuning on a single 24GB GPU. The base model is loaded in 4-bit precision while LoRA adapters train in full precision. The "nf4" quantization type is specifically designed for normally-distributed weights.
This is the practical choice for most teams: fine-tune large models without expensive hardware, then merge adapters with the base model for deployment.
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
)
# Then apply LoRA
Memory: ~8GB for 7B model, ~24GB for 70B model
Fine-Tuning Frameworks
| Framework | Best For | Complexity |
|---|---|---|
| Axolotl | Production fine-tuning | Medium |
| LLaMA-Factory | Quick experiments | Low |
| HuggingFace TRL | Research, flexibility | Medium |
| Unsloth | Speed, memory efficiency | Low |
Staying Current
Benchmarks to Monitor
| Benchmark | What It Measures | Where to Find |
|---|---|---|
| LMSYS Chatbot Arena | Human preference | chat.lmsys.org |
| Open LLM Leaderboard | Automated benchmarks | HuggingFace |
| HumanEval / MBPP | Code generation | Papers/releases |
| MATH / GSM8K | Mathematical reasoning | Papers/releases |
| MMLU | General knowledge | Papers/releases |
Information Sources
- HuggingFace Model Hub: New model releases
- r/LocalLLaMA: Community testing and discussion
- Twitter/X: Researcher announcements
- Model provider blogs: Official benchmarks
- Papers With Code: Research tracking
Conclusion
Open-source LLMs have reached production maturity. The gap with proprietary models has narrowed from 17.5 to 0.3 percentage points—open-source now achieves 85-90% of frontier performance:
- Llama 4: MoE architecture, massive context (10M for Scout), best ecosystem—but EU-restricted
- Qwen3: Best multilingual (119 languages), hybrid reasoning, fully Apache 2.0, trained on 36T tokens
- DeepSeek V3.2: Reasoning-first for agents, V3.2-Speciale wins gold at IMO/IOI 2025
- Mistral Large 3: Strongest Apache 2.0 model outside China (675B MoE), 256K context, vision capabilities
- GPT-OSS-120B: OpenAI's return to open-source (Apache 2.0), most intelligent American open-weight model
- Kimi K2: Best value—1T params at $0.15/1M input; K2 Thinking beats GPT-5 and Claude 4.5
- GLM-4.7: Highest open-source SWE-bench (73.8%), $3/month, "preserved thinking" for coding
- MiniMax M2.1: Advanced interleaved thinking, 204K context, multilingual coding champion
- Grok 2.5: xAI's 270B open release, with Grok 3 coming soon
- Falcon H1: Hybrid Transformer+Mamba, 262K context, 34B matches 70B performance
For many applications, open-source is now the better choice for cost, privacy, and control.
Recommendation:
- General use: Llama 4 Maverick, Qwen3-235B, Mistral Large 3, or GPT-OSS-120B
- EU users: Qwen3, Mistral Large 3, GPT-OSS-120B, or GLM-4.7 (Llama 4 prohibited)
- Reasoning: DeepSeek V3.2-Speciale, Kimi K2 Thinking, GPT-OSS-120B (high effort), or DeepSeek-R1-0528
- Code: GLM-4.7 (73.8% SWE-bench), MiniMax M2.1, or Qwen3-Coder-480B
- Agentic: Kimi K2 (65.8% SWE-bench, tool use), GPT-OSS-120B, MiniMax M2.1, or GLM-4.7
- Vision: GLM-4.6V (native tool calling), Mistral Large 3, or Llama 4 Maverick
- Long context: MiniMax-01 (4M tokens), Falcon H1-34B (262K), MiniMax-M1 (1M), or Llama 4 Scout (10M)
- Edge/Efficient: Falcon H1 (0.5B-34B), Falcon 3 (1B-10B), GPT-OSS-20B, Llama 3.2 3B, or Qwen3-4B
- Apache 2.0 only: Qwen3, Mistral Large 3, GPT-OSS, or Falcon
Frequently Asked Questions
Related Articles
LLM Inference Optimization: From Quantization to Speculative Decoding
A comprehensive guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.
Small Language Models: Edge Deployment and Knowledge Distillation
The rise of Small Language Models (SLMs)—from Llama 3.2 to Phi-4 to Qwen 2.5. Understanding knowledge distillation, quantization, and deploying AI at the edge.
Fine-Tuning vs Prompting: When to Use Each
A practical guide to deciding between fine-tuning and prompt engineering for your LLM application, based on real-world experience with both approaches.