Skip to main content
Back to Blog

Open-Source LLMs: The Complete 2025 Guide

A comprehensive guide to open-source LLMs—Llama 4, Qwen3, DeepSeek V3.2, Mistral Large 3, Kimi K2, GLM-4.7 and more. Detailed benchmarks, hardware requirements, deployment strategies, and practical recommendations for production use.

3 min read
Share:

The Open-Source Revolution

Open-source models have reached a critical inflection point. They now rival—and sometimes exceed—proprietary alternatives in many tasks while offering complete control over deployment, customization, and costs.

Why the gap closed so quickly: In 2023, open-source models were clearly inferior—GPT-4 seemed unreachable. Two factors changed this: (1) Training recipes became public. Llama 2's technical report, InstructGPT's RLHF paper, and countless fine-tuning guides democratized the "secret sauce." (2) Scale became accessible. Cloud providers offer H100 clusters at reasonable rates; training a 70B model costs ~1Minsteadof1M instead of 100M. The frontier is still proprietary, but the gap is months, not years.

The strategic calculus has flipped: When open models were 2+ years behind, the choice was easy—use APIs for quality. Now the calculation is different. At 100K+ requests/day, self-hosting costs 10x less. For sensitive data, on-premises is the only option. For latency-critical applications, local inference wins. Open-source went from "not good enough" to "often the right choice."

From research: "Leading open-source models like Llama 4 Maverick, DeepSeek V3.2, Mistral Large 3, GPT-OSS-120B, Kimi K2 Thinking, and GLM-4.7 now match or exceed GPT-4o and Claude 3.5 Sonnet on most benchmarks, with some beating GPT-5 and Claude 4.5 on specific tasks."

2025 highlights:

  • Llama 4 introduces MoE architecture with 10M context (Scout) and matches DeepSeek V3 at half the active parameters
  • Qwen3 brings hybrid reasoning modes—"thinking" and "non-thinking"—under Apache 2.0, trained on 36T tokens in 119 languages
  • DeepSeek V3.2 is reasoning-first for agents, V3.2-Speciale wins gold at IMO/IOI 2025
  • Mistral Large 3 (675B MoE, 41B active) is the strongest Apache 2.0 model outside China with 256K context
  • GPT-OSS-120B marks OpenAI's return to open-source—first open weights since GPT-2 (2019), Apache 2.0
  • Kimi K2 (1T params, 32B active) from Moonshot AI sets new agentic benchmarks; K2 Thinking beats GPT-5 and Claude 4.5 Sonnet
  • GLM-4.7 from Zhipu AI achieves highest open-source SWE-bench (73.8%) and costs $3/month
  • Grok 2.5 from xAI open-sourced (270B), with Grok 3 expected to follow
  • Falcon H1 (TII) brings hybrid Transformer+Mamba architecture with 262K context; 34B matches 70B models

This guide provides everything you need to choose, deploy, and optimize open-source LLMs for production use.

Why Open-Source Matters

The Strategic Case

BenefitDescriptionImpact
Cost at ScaleNo per-token API fees10-100x cheaper at high volume
Data PrivacyData never leaves your infrastructureRegulatory compliance, IP protection
CustomizationFull fine-tuning capabilityDomain-specific optimization
ControlNo vendor lock-in or API changesBusiness continuity
TransparencyInspect weights and behaviorDebugging, safety analysis
LatencyLocal inference, no network round-tripReal-time applications

When to Use Open-Source

Strong fit:

  • High-volume applications (>100K requests/day)
  • Sensitive data that can't leave your infrastructure
  • Need for fine-tuning on proprietary data
  • Edge deployment or offline operation
  • Cost-sensitive applications
  • Applications requiring low latency (<100ms)

Consider proprietary APIs when:

  • Prototyping and MVP development
  • Variable, unpredictable load
  • Need absolute frontier capabilities
  • Limited ML engineering resources
  • Rapid iteration more important than cost

Cost Comparison Example

For 1 million tokens/day (30M tokens/month):

OptionMonthly CostNotes
GPT-4o API~$450$15/1M input tokens
Claude 3.5 Sonnet~$2703/1Minput,3/1M input, 15/1M output
Self-hosted Llama 70B~$150-300A100 cloud instance
Self-hosted on-prem$50-100After hardware amortization

At 10M tokens/day, self-hosting typically saves 5-10x.

The Major Model Families

Llama (Meta)

Meta's Llama series has become the de facto standard for open-source LLMs, with the broadest ecosystem support.

Model Lineup

Llama 4 (April 2025): From Meta: "Llama 4 is our most intelligent model family, designed to enable people to build more personalized multimodal experiences."

The first Llama family using Mixture-of-Experts (MoE) architecture:

  • Llama 4 Scout: 17B active / 109B total parameters (16 experts), 10M context window
  • Llama 4 Maverick: 17B active / 400B total parameters (128 experts), 1M context window
  • Llama 4 Behemoth: 288B active / ~2T total parameters (in training)

From Meta: "Scout is the best multimodal model in the world in its class, fitting on a single NVIDIA H100. Maverick beats GPT-4o and Gemini 2.0 Flash across benchmarks while achieving results comparable to DeepSeek V3 at less than half the active parameters."

⚠️ EU Restriction: Llama 4 is prohibited for users domiciled in the EU due to AI and data privacy regulations.

Llama 3.3 70B (December 2024): From Meta: "Offers performance comparable to the 405B parameter model at a fraction of the computational cost."

  • 70B parameters
  • 128K context window
  • Instruction-tuned
  • Multilingual support (8 languages)

Llama 3.2 (September 2024):

  • Vision models: 11B and 90B with image understanding
  • Edge models: 1B and 3B for mobile/edge deployment
  • Multimodal capabilities integrated

Llama 3.1 (July 2024):

  • 8B, 70B, and 405B parameter versions
  • 405B is the largest open model available
  • 128K context window
  • Improved multilingual and tool use

Benchmarks

BenchmarkLlama 3.3 70BGPT-4oClaude 3.5 Sonnet
MMLU (5-shot)86.088.788.7
HumanEval88.490.292.0
GSM8K93.095.396.4
MATH77.076.678.3

Strengths

  • Ecosystem: Widest tool support (vLLM, llama.cpp, TensorRT-LLM, all major frameworks)
  • Documentation: Best-documented training process in open-source
  • Fine-tuning: Extensive fine-tuning ecosystem (Axolotl, LLaMA-Factory, TRL)
  • Stability: Most battle-tested in production
  • Community: Largest community, most available fine-tunes

Weaknesses

  • License restrictions: 700M MAU limit for commercial use
  • Multilingual: Good but not best-in-class (8 languages vs Qwen's 29+)
  • Reasoning: Trails DeepSeek R1 on complex reasoning

License

Llama 3 Community License:

  • Free for commercial use under 700M monthly active users
  • Above threshold requires separate commercial license from Meta
  • Allows modification and redistribution
  • Requires attribution

Llama 4 License:

  • Free for commercial use under 700M monthly active users
  • EU users prohibited from using or distributing Llama 4 models
  • Multimodal capabilities (Scout, Maverick) under same restrictions

Best For

  • General-purpose chat applications
  • Production deployments needing stability
  • Teams wanting maximum ecosystem support
  • RAG systems with good general knowledge
  • Applications needing vision capabilities (Llama 3.2)

Qwen (Alibaba)

Alibaba's Qwen family excels in multilingual support and specialized variants, with fully permissive licensing.

Model Lineup

Qwen3 (April 2025): From Alibaba: "Qwen3 introduces hybrid reasoning—models can 'think' through complex problems or answer quickly, matching OpenAI's o3 reasoning capability."

  • Qwen3-235B-A22B: 235B total, 22B active (MoE), 128K context
  • Qwen3-32B: Dense model, 128K context
  • Qwen3-14B, 8B, 4B, 1.7B, 0.6B: Smaller variants
  • Trained on 36 trillion tokens in 119 languages

Qwen3 Latest (Late 2025):

  • Qwen3-2507: 1M token context support (August 2025)
  • Qwen3-Max: Outperforms Claude 4 Opus non-thinking, DeepSeek V3.1 (September 2025)
  • Qwen3-Next: Apache 2.0 licensed Instruct and Thinking models (September 2025)
  • Qwen3-Coder-480B-A35B: 480B MoE with 35B active—most agentic code model

Qwen2.5 (September 2024):

  • Qwen2.5-72B: Flagship dense model
  • Qwen2.5-Coder: State-of-the-art code model
  • Qwen2.5-Math: Mathematical reasoning specialist
  • Trained on 29+ languages

Qwen2.5-VL (Vision-Language):

  • 3B, 7B, 72B vision-language models
  • Native image and video understanding
  • Document/chart analysis

Benchmarks

BenchmarkQwen2.5-72BQwen3-235BLlama 3.3 70B
MMLU86.188.286.0
HumanEval86.689.588.4
MATH83.186.777.0
Multilingual Avg75.679.368.2

Strengths

  • Multilingual: Best-in-class support for 29+ languages
  • Specialized models: Coder, Math, VL variants
  • MoE efficiency: 22B active parameters with 235B quality
  • License: Fully Apache 2.0 (no restrictions)
  • Asian languages: Best Chinese, Japanese, Korean support

Weaknesses

  • Ecosystem: Smaller than Llama ecosystem
  • Documentation: Less extensive than Meta's
  • Community: Fewer available fine-tunes

License

Apache 2.0:

  • Fully permissive commercial use
  • No user limits
  • No attribution required
  • Can modify and redistribute freely

Best For

  • Multilingual applications
  • Coding assistants (Qwen2.5-Coder)
  • Mathematical reasoning (Qwen2.5-Math)
  • Asian market applications
  • Teams wanting fully permissive licensing
  • Vision-language applications (Qwen2.5-VL)

DeepSeek

Chinese AI lab that achieved breakthrough reasoning capabilities with innovative training approaches.

Model Lineup

DeepSeek-V3.2 (December 2025): From DeepSeek: "V3.2 is our reasoning-first model built for agents, performing comparably to GPT-5."

  • 671B total parameters (37B active)
  • DSA (DeepSeek Sparse Attention): Efficient attention for long-context scenarios
  • First DeepSeek model to integrate thinking directly into tool-use
  • Massive agent training: 1,800+ environments, 85k+ complex instructions

DeepSeek-V3.2-Speciale:

  • High-compute variant surpassing GPT-5, on par with Gemini 3 Pro
  • Gold medal in 2025 IMO and IOI, CMO, ICPC World Finals
  • Designed exclusively for deep reasoning (no tool-calling)
  • MIT License

DeepSeek-V3.1 (August 2025): From DeepSeek: "V3.1 combines the strengths of V3 and R1 into a single hybrid model with 'thinking' and 'non-thinking' modes."

  • 671B total parameters (37B active)
  • 128K context window
  • Hybrid thinking mode: Switch between chain-of-thought reasoning (like R1) and direct answers (like V3) via chat template
  • One model covers both general-purpose and reasoning-heavy use cases

DeepSeek-V3 (December 2024):

  • 671B total parameters (MoE)
  • 37B active parameters per token
  • Pre-training cost: $5.6M (remarkably efficient)
  • Strong general capabilities

DeepSeek-R1-0528 (May 2025): From DeepSeek: "R1-0528 features major improvements in inference and hallucination reduction, with performance approaching O3 and Gemini 2.5 Pro."

  • Updated reasoning model with structured JSON output
  • Built-in function-calling capabilities
  • Improved math, code, and logic benchmarks
  • 671B total, 37B active

DeepSeek-R1 (January 2025): From DeepSeek: "DeepSeek-R1 matches OpenAI-o1-0912 on math benchmarks with open weights."

  • Reasoning model trained with pure RL (GRPO)
  • Extended thinking capabilities
  • Shows chain-of-thought reasoning in <think> tags

DeepSeek-R1-Distill Series:

  • Distilled versions at 1.5B, 7B, 8B, 14B, 32B, 70B
  • Retain significant reasoning capability
  • Run on consumer hardware

From research: "DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks."

DeepSeek-Coder-V2:

  • 236B MoE (21B active)
  • State-of-the-art code understanding
  • 128K context window

Benchmarks

BenchmarkDeepSeek-R1o1-previewDeepSeek-V3
AIME 202479.8%74.4%39.2%
MATH-50097.3%96.4%90.2%
Codeforces2029 Elo1891 Elo1134 Elo
MMLU90.890.888.5

Strengths

  • Reasoning: Best-in-class mathematical and logical reasoning
  • Code: Excellent code generation and understanding
  • Efficiency: Remarkably cost-effective training
  • Transparency: Published detailed training methodology
  • Distillation: High-quality smaller models available

Weaknesses

  • Inference cost: Full R1 requires significant compute
  • Latency: Reasoning models produce many tokens
  • Language focus: Stronger in English/Chinese than other languages

License

DeepSeek License (MIT-like):

  • Free for commercial use
  • Some restrictions on specific use cases
  • Open weights available

Best For

  • Complex reasoning tasks
  • Mathematical problem-solving
  • Code generation and analysis
  • Research and analysis applications
  • Teams willing to trade latency for accuracy

Mistral

French AI company known for efficient, high-quality models with strong European presence. With Mistral 3, they've released the strongest fully open-weight model (Apache 2.0) developed outside China.

Model Lineup

Mistral Large 3 (December 2025): From Mistral: "Mistral Large 3 is our most capable model to date—a sparse mixture-of-experts trained with 41B active and 675B total parameters."

  • 675B total parameters (MoE), 41B active
  • 256K context window
  • Native vision capabilities (image analysis)
  • Multilingual: English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic
  • Best-in-class agentic capabilities with native function calling and JSON output
  • Apache 2.0 license (fully open)
  • Trained from scratch on 3000 NVIDIA H200 GPUs
  • Top open-source coding model on LMArena leaderboard

Mistral 3 Small Models (December 2025):

  • Mistral 3 14B, 8B, 3B dense models
  • All Apache 2.0 licensed
  • Optimized for efficiency

Mistral Large 2 (July 2024):

  • 123B parameters
  • 128K context window
  • Strong function calling
  • Multilingual (dozens of languages)

Mistral Small 3 (January 2025): From Mistral: "Mistral Small 3 (24B) achieves state-of-the-art capabilities comparable to larger models."

  • 24B parameters
  • Optimized for efficiency
  • Strong reasoning for size

Mixtral 8x22B:

  • MoE architecture: 176B total, 39B active
  • Strong general performance
  • Good efficiency

Codestral (May 2024):

  • 22B parameters
  • Specialized for code
  • 32K context window

Ministral (October 2024):

  • 3B and 8B models
  • Edge deployment focus
  • Strong for size

Benchmarks

BenchmarkMistral Large 3Mistral Large 2Mistral Small 3
MMLU88.284.081.0
HumanEval94.392.184.5
MATH82.176.569.4
GSM8K95.891.086.2

Strengths

  • Apache 2.0: Mistral Large 3 and Mistral 3 models are fully open (no restrictions)
  • Function calling: Best-in-class structured output and agentic capabilities
  • Efficiency: 41B active parameters with 675B-class quality
  • European: GDPR considerations, EU-based, no regional restrictions
  • Vision: Native multimodal capabilities in Large 3
  • Production-ready: Focus on deployment, NVFP4 checkpoint available

Weaknesses

  • Hardware requirements: Large 3 needs 8×H100 or Blackwell NVL72 for full precision
  • Smaller ecosystem: Less community activity than Llama
  • Legacy licensing: Older models (Codestral) have commercial restrictions

License

Mistral 3 Family: Apache 2.0 (fully permissive)

  • Mistral Large 3
  • Mistral 3 14B, 8B, 3B

Legacy models:

  • Apache 2.0: Mistral 7B, Mixtral
  • Commercial: Codestral, older Large versions

Best For

  • European deployments (GDPR compliance, no regional restrictions)
  • Function calling / structured output / agentic applications
  • Teams wanting frontier performance with Apache 2.0 license
  • Production systems needing vision + text capabilities
  • Organizations requiring fully open weights outside China

Other Notable Models

Kimi K2 (Moonshot AI)

State-of-the-art agentic model from Beijing startup (Alibaba-backed). One of the most cost-effective frontier models available.

Kimi K2 (July 2025): From Moonshot: "Kimi K2 is designed for tool use, reasoning, and autonomous problem-solving."

  • 1 trillion total parameters, 32B active (MoE)
  • State-of-the-art among non-thinking models for knowledge, math, coding
  • Trained with MuonClip optimizer on 15.5T tokens with zero instability
  • Pre-trained at unprecedented 1T scale with novel optimization techniques
  • SWE-bench Verified: 65.8% pass@1 (single-attempt patches)
  • SWE-bench Multilingual: 47.3% pass@1
  • Surpasses Claude Opus 4 on two benchmarks
  • Better overall performance than GPT-4.1 on coding metrics
  • Modified MIT license
  • Pricing: 0.15/1Minputtokens,0.15/1M input tokens, 2.50/1M output tokens (100x cheaper than Claude Opus 4 input)

Kimi K2 Thinking (November 2025): From Moonshot: "The most powerful open source thinking model in the Kimi series to date."

  • Outperforms GPT-5, Claude Sonnet 4.5 (Thinking), and Grok-4 on reasoning, coding, and agentic benchmarks
  • Automatically selects 200-300 tools for autonomous task completion
  • Training cost: $4.6M
  • Fully open-source despite beating proprietary competitors
  • Top position in reasoning and coding evaluations

Model Variants:

  • Kimi-K2-Base: Foundation model for researchers wanting full control for fine-tuning
  • Kimi-K2-Instruct: Post-trained model for general-purpose chat and agentic experiences

GLM-4 Series (Zhipu AI / Z.ai)

China's answer to Claude Code, open-sourced from Tsinghua University spinoff. Rapidly evolving series with strong coding and vision capabilities.

GLM-4.7 (December 22, 2025): From Z.ai: "GLM-4.7 achieves the highest SWE-bench Verified score among open-source models at 73.8%."

  • ~400B parameters, 200K context, 128K output
  • Open-weight 32B and 9B variants (base, reasoning, rumination)
  • SWE-bench Verified: 73.8% (highest open-source)
  • LiveCodeBench: 84.9% (beats Claude Sonnet 4.5)
  • AIME 2025: 95.7%
  • HLE (Humanity's Last Exam): 42.8% (outperforms GPT-5.1)
  • Code Arena: Rank #1 among open-source and Chinese models
  • "Preserved Thinking": Maintains reasoning chains across multiple turns (addresses biggest frustration in AI-assisted coding)
  • $3/month or free locally via HuggingFace/ModelScope with vLLM or SGLang

Agentic capabilities:

  • BrowseComp: 67.5 (web tasks)
  • τ²-Bench: 87.4 (interactive tool use)—new open-source SOTA, surpasses Claude Sonnet 4.5

GLM-4.6V (December 2025): From Z.ai: "A 128K context vision-language model with native tool calling."

  • GLM-4.6V: 106B parameters for cloud-scale inference
  • GLM-4.6V-Flash: 9B parameters for low-latency, local applications
  • 128K context window
  • Native multimodal function calling: Images, screenshots, and document pages pass directly as tool parameters
  • Tools can return visual outputs (search grids, charts, web pages, product images)
  • Model fuses visual outputs with text in the same reasoning chain
  • Optimized for frontend automation and multimodal reasoning

GLM-4.5 (Mid-2025):

  • 355B total parameters, 32B active (MoE)
  • Supports reasoning, tool use, coding, and agentic behaviors
  • GLM-4.5-Air: Smaller sibling for efficiency
  • Fast despite large parameter count due to MoE

MiniMax

Chinese startup with innovative attention mechanisms. Pioneering hybrid-attention and long-context reasoning.

MiniMax M2.1 (December 25, 2025): From MiniMax: "Significantly enhanced multi-language programming, built for real-world complex tasks."

  • Sparse MoE architecture with 10B active parameters
  • 204,800 token context window
  • First open-source model with Advanced Interleaved Thinking (separates reasoning from response)
  • Multi-language programming: Rust, Java, Go, C++, Kotlin, Objective-C, TypeScript, JavaScript
  • SWE-multilingual: 72.5%
  • VIBE aggregate: 88.6 (VIBE-Web: 91.5, VIBE-Android: 89.7)
  • Outperforms Claude Sonnet 4.5, approaches Claude Opus 4.5 in multilingual scenarios
  • ~10% the price of Claude Sonnet
  • 90 tokens/sec on RTX5090 with vLLM and FP8 weights
  • Runs on 4×A100 GPUs
  • Modified-MIT license, weights on HuggingFace
  • Inference: SGLang or vLLM (temp=1.0, top_p=0.95, top_k=40)

MiniMax-M1 (June 2025):

  • World's first large-scale hybrid-attention reasoning model
  • 1M token context (8x DeepSeek R1)
  • 80K token reasoning output
  • CISPO: RL algorithm 2x faster than DAPO

MiniMax-01:

  • 456B total parameters (45.9B active)
  • 4M token context (20-32x other models)

Google Gemma

Open version of Gemini technology:

  • Gemma 2 (2B, 9B, 27B)
  • Strong for size class
  • Good efficiency
  • Research-friendly license

Microsoft Phi

Quality-focused small models:

  • Phi-4 (14B): "Competitive with much larger models"
  • Trained primarily on synthetic data
  • Excellent for size
  • MIT license

HuggingFace SmolLM

Fully transparent small models:

  • SmolLM2 (135M, 360M, 1.7B)
  • Complete training transparency
  • Research-friendly
  • Apache 2.0 license

Cohere Command R

Enterprise-focused:

  • Command R+ (104B)
  • Strong RAG capabilities
  • Enterprise features
  • Commercial focus

GPT-OSS (OpenAI)

OpenAI's first open-weight release since GPT-2 (2019). A landmark moment for open-source AI.

GPT-OSS-120B (August 2025): From OpenAI: "State-of-the-art open-weight language models that deliver strong real-world performance at low cost."

  • 117B total parameters, 5.1B active (MoE)
  • Fits on a single 80GB GPU (H100 or AMD MI300X)
  • Trained using large-scale distillation and reinforcement learning
  • Three reasoning effort levels: low, medium, high (trade-off latency vs performance)
  • Full chain-of-thought access for debugging
  • Native agentic capabilities: function calling, web browsing, Python execution, structured outputs
  • Outperforms o3-mini, matches/exceeds o4-mini on MMLU (90%), GPQA (~80%), AIME 2024/2025
  • Apache 2.0 license (fully permissive)
  • Available on HuggingFace, GitHub, LM Studio, Ollama

GPT-OSS-20B:

  • Smaller variant for edge/efficiency use cases
  • Same Apache 2.0 license

From Artificial Analysis: "The most intelligent American open weights model."

Grok (xAI)

Elon Musk's xAI open-source releases:

Grok 2.5 (August 2025):

  • ~270B parameters
  • Trained on text-based reasoning tasks
  • Model weights (~500 GB across 42 files) + tokenizer
  • Custom license: Grok 2 Community License Agreement
    • Allows commercial and non-commercial use
    • Restriction: Cannot use to train other AI models
    • Revocable license (less permissive than Apache 2.0)
  • Grok 3 expected to follow in ~6 months

Grok-1 (March 2024):

  • 314B parameter MoE
  • Apache 2.0 license (fully permissive)
  • Historical significance: Early frontier open-source release

Falcon 3 & H1 (TII)

UAE's Technology Innovation Institute. Focus on efficient models that run on consumer hardware.

Falcon 3 (December 2024):

  • Model sizes: 1B, 3B, 7B, 10B (Base and Instruct variants)
  • Trained on 14 trillion tokens (2.5x predecessor)
  • 32K context (8K for 1B)
  • #1 on HuggingFace leaderboard at launch (for size class)
  • Beats Llama-3.1-8B, Qwen2.5-7B, Mistral NeMo-12B, Gemma2-9B
  • TII Falcon License (Apache 2.0-based, permissive)
  • Runs on laptops and light infrastructure

Falcon H1 (2025): From TII: "A family of hybrid-head language models redefining efficiency and performance."

  • Model sizes: 0.5B, 1.5B, 1.5B-Deep, 3B, 7B, 34B (+ Instruct variants)
  • Hybrid architecture: Transformer attention + State Space Model (Mamba-2)
  • 262K context window
  • 18 languages: English, Chinese, Japanese, Korean, German, French, Spanish, etc.
  • Falcon-H1-34B matches/outperforms 70B models (Qwen3-32B, Qwen2.5-72B, Llama3.3-70B)
  • Falcon-H1-1.5B-Deep rivals current 7B-10B models
  • Falcon-H1-0.5B performs like 2024's 7B models
  • Native support in llama.cpp, axolotl, llama-factory, unsloth
  • 55+ million downloads to date
  • Permissive open-source license

Comprehensive Benchmark Comparison

General Capabilities

ModelMMLUHumanEvalGSM8KMATHHellaSwag
Llama 3.3 70B86.088.493.077.088.6
Qwen2.5-72B86.186.691.683.188.5
DeepSeek-V388.582.689.390.288.9
Mistral Large 284.092.191.076.587.1
DeepSeek-R190.886.797.397.3-
GPT-4o (reference)88.790.295.376.695.3
Claude 3.5 Sonnet88.792.096.478.389.0

Coding Benchmarks

ModelHumanEvalMBPPLiveCodeBenchSWE-bench
DeepSeek-Coder-V290.280.443.422.0
Qwen2.5-Coder-32B92.783.242.123.5
Codestral81.178.236.218.3
Llama 3.3 70B88.475.633.816.4

Reasoning Benchmarks

ModelAIME 2024MATH-500GPQAARC-C
DeepSeek-R179.8%97.3%71.592.3
Qwen3-235B54.7%86.7%59.189.4
Llama 3.3 70B33.3%77.0%50.788.1
o1-preview (reference)74.4%96.4%73.3-

Chatbot Arena Rankings (Human Preference)

From LMSYS Chatbot Arena / LMArena (Late 2025). Rankings use 6M+ human votes with Elo-like scoring:

RankModelArena ScoreNotes
1Claude Opus 4.5~1450Proprietary frontier
2GPT-5.2~1440Proprietary frontier
3Gemini 3 Pro~1430Proprietary frontier
4Llama 4 Maverick1417Open-source
5DeepSeek-V3.2~1400Open-source
6Mistral Large 3~1395Open-source, top OS coding
7Kimi K2 Thinking~1390Open-source
8Qwen3-235B~1385Open-source
9GLM-4.7~1380Open-source

Key insight: The gap between proprietary and open-source models has narrowed from 17.5 to just 0.3 percentage points on MMLU. Open-source models now achieve 85-90% of frontier performance.

Hardware Requirements

Memory Calculation

LLM memory requirements follow this formula:

Code
Memory (GB) ≈ Parameters (B) × Bytes per Parameter

FP32: 4 bytes/param → 7B model = 28 GB
FP16: 2 bytes/param → 7B model = 14 GB
INT8: 1 byte/param → 7B model = 7 GB
INT4: 0.5 bytes/param → 7B model = 3.5 GB

Add ~20% overhead for KV cache and runtime.

Detailed Memory Requirements

Model SizeFP16INT8INT4 (AWQ/GPTQ)GGUF Q4_K_M
1.5B3 GB1.5 GB1 GB1.1 GB
7B14 GB7 GB4 GB4.4 GB
8B16 GB8 GB4.5 GB5 GB
13B26 GB13 GB7 GB8 GB
32B64 GB32 GB16 GB19 GB
70B140 GB70 GB35 GB40 GB
405B810 GB405 GB203 GB230 GB

GPU Recommendations by Use Case

Consumer Hardware

GPUVRAMCan RunTypical Use
RTX 306012 GB7B INT4, 3B FP16Development, small models
RTX 309024 GB13B INT4, 7B FP16Development, medium models
RTX 409024 GB32B INT4, 13B INT8Production-capable
2x RTX 409048 GB70B INT4Production workloads

Professional/Enterprise

GPUVRAMCan RunTypical Use
A10G24 GB32B INT4, 13B FP16Cloud inference
L40S48 GB70B INT4, 32B FP16Production inference
A100 40GB40 GB70B INT4, 32B FP16Training and inference
A100 80GB80 GB70B FP16Training and inference
H100 80GB80 GB70B FP16, optimizedHigh-throughput
8x H100640 GB405B FP16Frontier models

From research: "With a 24 GB GPU (e.g., RTX 3090/4090), you can comfortably run 4-bit quantized versions of models up to ~40B parameters."

From research: "With Pro/Enterprise GPU (e.g., 48-80GB VRAM), you can comfortably run 70B models like Llama 3.1 and Qwen2 72B."

Apple Silicon

ChipUnified MemoryCan RunNotes
M1/M2 (8GB)8 GB3B-7B Q4Basic use
M1/M2 Pro (16GB)16 GB13B Q4, 7B Q8Good development
M1/M2 Max (32GB)32 GB32B Q4, 13B Q8Serious development
M1/M2 Ultra (64GB)64 GB70B Q4Production-capable
M3 Max (128GB)128 GB70B Q8, 405B Q2High-end

Deployment Options

Local Inference Engines

Ollama

Simplest local deployment:

Bash
# Install
curl -fsSL https://ollama.com/install.sh | sh

# Run models
ollama pull llama3.3:70b
ollama run llama3.3:70b "Explain transformers"

# API access
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3:70b",
  "prompt": "Explain transformers"
}'

Pros: Simplest setup, automatic model management, good defaults Cons: Less optimization control, Mac/Linux focus

llama.cpp

Maximum performance and control:

Bash
# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j LLAMA_CUDA=1  # For NVIDIA GPUs

# Run
./llama-cli \
  -m models/llama-3.3-70b-instruct-q4_k_m.gguf \
  -p "Explain transformers" \
  -n 500 \
  --temp 0.7 \
  -ngl 99  # GPU layers

Pros: Best CPU performance, extensive quantization options, cross-platform Cons: Manual model management, more complex setup

vLLM

Production serving with high throughput:

Python
from vllm import LLM, SamplingParams

# Single GPU
llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    quantization="awq",
    gpu_memory_utilization=0.9,
)

# Multi-GPU
llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    tensor_parallel_size=2,  # 2 GPUs
    quantization="awq",
)

# Generate
outputs = llm.generate(
    ["Explain transformers"],
    SamplingParams(temperature=0.7, max_tokens=500)
)

Pros: Highest throughput, PagedAttention, production-ready Cons: GPU-only, requires more resources

TensorRT-LLM

Maximum NVIDIA performance:

Python
from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct")
outputs = llm.generate(
    ["Explain transformers"],
    SamplingParams(max_new_tokens=500)
)

Pros: Best NVIDIA performance, optimized kernels Cons: NVIDIA-only, complex setup

Comparison of Inference Engines

EngineBest ForThroughputEase of Use
OllamaLocal developmentMediumVery Easy
llama.cppCPU, Apple SiliconMediumMedium
vLLMProduction servingHighMedium
TensorRT-LLMMax NVIDIA perfHighestComplex
HF TransformersFlexibilityLowEasy

Cloud Deployment

Inference Providers

ProviderSpecialtyPricing Model
Together AIWide model selectionPer-token
Fireworks AISpeed, function callingPer-token
AnyscaleScalabilityPer-token
ReplicateEase of usePer-second
ModalServerlessPer-second
GroqSpeed (LPU)Per-token

Self-Hosted Cloud

AWS:

YAML
# SageMaker endpoint
Instance: ml.g5.12xlarge (4x A10G)
Model: Llama 3.3 70B AWQ
Throughput: ~50 tokens/sec
Cost: ~$7/hour

GCP:

YAML
# Vertex AI
Instance: a2-highgpu-1g (A100 40GB)
Model: Qwen2.5-72B GPTQ
Throughput: ~60 tokens/sec
Cost: ~$4/hour

Quantization Formats

FormatToolPrecisionBest For
GGUFllama.cppQ2-Q8CPU, Apple Silicon, flexibility
AWQvLLM, TGIINT4NVIDIA GPUs, production
GPTQvLLM, TGIINT4NVIDIA GPUs, production
GGMLLegacyQ4-Q8Deprecated, use GGUF
bitsandbytesHuggingFaceINT8/INT4Quick prototyping
ExLlamaV2CustomVariousOptimized inference

Quantization Quality Comparison

Impact on Llama 3.1 70B quality:

QuantizationMMLU ImpactSpeed GainMemory Reduction
FP16 (baseline)0%1x0%
INT8-0.5%1.5x50%
AWQ INT4-1.2%2x75%
GPTQ INT4-1.5%2x75%
GGUF Q4_K_M-1.8%2.5x75%
GGUF Q2_K-5.0%3x87%

Licensing Deep Dive

Fully Permissive (Apache 2.0 / MIT)

Apache 2.0:

  • Qwen (all models)
  • Mistral (7B, Mixtral)
  • Gemma
  • SmolLM
  • Phi (MIT)

What you can do:

  • Use commercially without limits
  • Modify and create derivatives
  • Distribute without attribution
  • No revenue sharing required

Permissive with Limits

Llama 3 Community License:

  • Free for companies with <700M monthly active users
  • Must request license above threshold
  • Can modify and redistribute
  • Must include attribution

DeepSeek License:

  • Generally permissive
  • Some restrictions on specific use cases
  • Review terms for your use case

Research / Non-Commercial

Some models (check individual licenses):

  • May restrict commercial use
  • May require academic attribution
  • May have geographic restrictions

License Selection Guide

Use CaseRecommended Models
Startup/SMBQwen (Apache 2.0), Llama (under 700M MAU)
EnterpriseQwen, licensed Llama, Mistral
ResearchAny (check publication requirements)
Products in AsiaQwen (best Asian language support)
EU compliance focusMistral (European company)

Practical Recommendations

By Use Case

General Chat / Customer Service

Recommended: Llama 3.3 70B or Qwen2.5-72B

For general-purpose chat, you need models with strong instruction following and broad knowledge. AWQ quantization reduces memory 4x with minimal quality loss, enabling 70B models on 2 GPUs.

Python
# vLLM serving
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    quantization="awq",
    tensor_parallel_size=2,
)

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=1024,
    top_p=0.9,
)

response = llm.generate([user_message], sampling_params)

Why: Best general capabilities, strong instruction following, extensive testing.

Coding Assistant

Recommended: DeepSeek-Coder-V2 or Qwen2.5-Coder-32B

Python
# With fill-in-the-middle
prompt = f"""<fim_prefix>{code_before}<fim_suffix>{code_after}<fim_middle>"""

response = llm.generate(prompt, SamplingParams(
    temperature=0.2,  # Lower for code
    max_tokens=512,
))

Why: Best code understanding, completion, and generation.

Complex Reasoning / Math

Recommended: DeepSeek-R1 or R1-Distill-32B

Python
# Allow extended thinking
response = llm.generate(
    "Solve this step by step: [problem]",
    SamplingParams(
        temperature=0.6,
        max_tokens=8192,  # Allow long reasoning
    )
)

Why: Best-in-class reasoning, shows work, highest accuracy on complex problems.

Multilingual Applications

Recommended: Qwen2.5-72B

Why: 29+ language support, best non-English performance.

Edge / Mobile Deployment

Recommended: Llama 3.2 3B or Phi-4-mini

Bash
# llama.cpp on mobile
./llama-cli \
  -m llama-3.2-3b-instruct-q4_k_m.gguf \
  -ngl 0 \  # CPU only
  -t 4 \    # 4 threads
  -c 2048   # Shorter context for memory

Why: Best quality at small size, optimized for resource constraints.

RAG / Knowledge Base

Recommended: Llama 3.3 70B with good retrieval

Python
from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

# Use open embeddings too
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5"
)

vectorstore = Chroma.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

Why: Strong instruction following, good at synthesizing retrieved context.

Architecture Recommendations

Single GPU (24GB)

YAML
Hardware: RTX 4090 or L40S
Model: Qwen2.5-32B-AWQ or Llama 3.1 8B FP16
Engine: vLLM or llama.cpp
Throughput: ~30-50 tokens/sec

Dual GPU (48GB)

YAML
Hardware: 2x RTX 4090 or 2x A10G
Model: Llama 3.3 70B AWQ
Engine: vLLM with tensor_parallel_size=2
Throughput: ~40-60 tokens/sec

Production Cluster

YAML
Hardware: 4x A100 80GB or 4x H100
Model: Llama 3.3 70B FP16 or DeepSeek-V3
Engine: vLLM or TensorRT-LLM
Throughput: ~200+ tokens/sec

Fine-Tuning Open Models

When to Fine-Tune

Good candidates:

  • Domain-specific terminology/style
  • Consistent output format requirements
  • Task-specific performance improvement
  • Reducing prompt length

Skip fine-tuning when:

  • Few-shot prompting works well
  • Data is limited (<1000 examples)
  • Requirements change frequently
  • General knowledge is sufficient

Fine-Tuning Approaches

LoRA (Low-Rank Adaptation)

LoRA freezes the base model and adds small trainable matrices to attention layers. Instead of updating billions of parameters, you train ~0.1% of them. This makes fine-tuning feasible on consumer GPUs while preserving most of the base model's knowledge.

The key parameters: r controls the rank (higher = more parameters = more expressivity), lora_alpha is a scaling factor (typically 2x r), and target_modules specifies which layers to adapt.

Python
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, lora_config)
# Trainable params: ~0.1% of total

Memory: ~16GB for 7B model, ~48GB for 70B model

QLoRA (Quantized LoRA)

QLoRA combines 4-bit quantization with LoRA, enabling 70B model fine-tuning on a single 24GB GPU. The base model is loaded in 4-bit precision while LoRA adapters train in full precision. The "nf4" quantization type is specifically designed for normally-distributed weights.

This is the practical choice for most teams: fine-tune large models without expensive hardware, then merge adapters with the base model for deployment.

Python
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
)
# Then apply LoRA

Memory: ~8GB for 7B model, ~24GB for 70B model

Fine-Tuning Frameworks

FrameworkBest ForComplexity
AxolotlProduction fine-tuningMedium
LLaMA-FactoryQuick experimentsLow
HuggingFace TRLResearch, flexibilityMedium
UnslothSpeed, memory efficiencyLow

Staying Current

Benchmarks to Monitor

BenchmarkWhat It MeasuresWhere to Find
LMSYS Chatbot ArenaHuman preferencechat.lmsys.org
Open LLM LeaderboardAutomated benchmarksHuggingFace
HumanEval / MBPPCode generationPapers/releases
MATH / GSM8KMathematical reasoningPapers/releases
MMLUGeneral knowledgePapers/releases

Information Sources

  • HuggingFace Model Hub: New model releases
  • r/LocalLLaMA: Community testing and discussion
  • Twitter/X: Researcher announcements
  • Model provider blogs: Official benchmarks
  • Papers With Code: Research tracking

Conclusion

Open-source LLMs have reached production maturity. The gap with proprietary models has narrowed from 17.5 to 0.3 percentage points—open-source now achieves 85-90% of frontier performance:

  1. Llama 4: MoE architecture, massive context (10M for Scout), best ecosystem—but EU-restricted
  2. Qwen3: Best multilingual (119 languages), hybrid reasoning, fully Apache 2.0, trained on 36T tokens
  3. DeepSeek V3.2: Reasoning-first for agents, V3.2-Speciale wins gold at IMO/IOI 2025
  4. Mistral Large 3: Strongest Apache 2.0 model outside China (675B MoE), 256K context, vision capabilities
  5. GPT-OSS-120B: OpenAI's return to open-source (Apache 2.0), most intelligent American open-weight model
  6. Kimi K2: Best value—1T params at $0.15/1M input; K2 Thinking beats GPT-5 and Claude 4.5
  7. GLM-4.7: Highest open-source SWE-bench (73.8%), $3/month, "preserved thinking" for coding
  8. MiniMax M2.1: Advanced interleaved thinking, 204K context, multilingual coding champion
  9. Grok 2.5: xAI's 270B open release, with Grok 3 coming soon
  10. Falcon H1: Hybrid Transformer+Mamba, 262K context, 34B matches 70B performance

For many applications, open-source is now the better choice for cost, privacy, and control.

Recommendation:

  • General use: Llama 4 Maverick, Qwen3-235B, Mistral Large 3, or GPT-OSS-120B
  • EU users: Qwen3, Mistral Large 3, GPT-OSS-120B, or GLM-4.7 (Llama 4 prohibited)
  • Reasoning: DeepSeek V3.2-Speciale, Kimi K2 Thinking, GPT-OSS-120B (high effort), or DeepSeek-R1-0528
  • Code: GLM-4.7 (73.8% SWE-bench), MiniMax M2.1, or Qwen3-Coder-480B
  • Agentic: Kimi K2 (65.8% SWE-bench, tool use), GPT-OSS-120B, MiniMax M2.1, or GLM-4.7
  • Vision: GLM-4.6V (native tool calling), Mistral Large 3, or Llama 4 Maverick
  • Long context: MiniMax-01 (4M tokens), Falcon H1-34B (262K), MiniMax-M1 (1M), or Llama 4 Scout (10M)
  • Edge/Efficient: Falcon H1 (0.5B-34B), Falcon 3 (1B-10B), GPT-OSS-20B, Llama 3.2 3B, or Qwen3-4B
  • Apache 2.0 only: Qwen3, Mistral Large 3, GPT-OSS, or Falcon

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles