Which open-source model is closest to GPT-5/Claude Opus 4.5?

**Kimi K2 Thinking** outperforms GPT-5 and Claude Sonnet 4.5 on reasoning and coding benchmarks. **GLM-4.7** beats GPT-5.1 on Humanity's Last Exam (42.8%). **DeepSeek V3.2-Speciale** won gold at IMO and IOI 2025. For general chat, **Llama 4 Maverick** scores 1417 on Chatbot Arena—within striking distance of proprietary models. The gap is now 0.3 percentage points on MMLU.

Which model is best for coding?

**GLM-4.7** has the highest open-source SWE-bench (73.8%) and LiveCodeBench (84.9%). **MiniMax M2.1** excels at multilingual coding (72.5% SWE-multilingual). **Kimi K2** scores 65.8% on SWE-bench with strong agentic capabilities. All three cost a fraction of Claude/GPT.

Can I use Llama 4 in the EU?

**No.** Llama 4 is explicitly prohibited for EU users. For EU deployments, use: **Mistral Large 3** (Apache 2.0, European company, no restrictions), **Qwen3** (Apache 2.0, no geographic restrictions), **GLM-4.7** (no EU restrictions).

What's the best Apache 2.0 model?

**GPT-OSS-120B** is the most intelligent American open-weight model (Apache 2.0), fits on a single H100. **Mistral Large 3** (675B MoE, 41B active) is the strongest fully open Apache 2.0 model outside China. **Qwen3-235B** offers hybrid reasoning under Apache 2.0. **Falcon H1-34B** matches 70B models with hybrid Transformer+Mamba architecture. The entire **Mistral 3 family** and **Falcon** series are Apache 2.0.

What's the best model for my 24GB GPU?

**Kimi K2 Instruct** (32B active) with quantization, **GPT-OSS-20B** (Apache 2.0, OpenAI quality), **Falcon H1-7B or H1-3B** (hybrid architecture, 262K context), **Qwen3-32B** with AWQ quantization, **GLM-4.7 9B variant** (fits easily), **Mistral 3 14B** (Apache 2.0), **MiniMax M2.1** (10B active) with FP8, **Falcon 3-10B** (runs on laptops).

Which model has the longest context?

| Model | Context Window | | --- | --- | | Llama 4 Scout | 10M tokens | | MiniMax-01 | 4M tokens | | MiniMax-M1 | 1M tokens | | Qwen3-2507 | 1M tokens | | Falcon H1 | 262K tokens | | Mistral Large 3 | 256K tokens | | MiniMax M2.1 | 204K tokens | | GLM-4.6V | 128K tokens |

What's the cheapest frontier-class open model?

**Kimi K2:** $0.15/1M input, $2.50/1M output (100x cheaper than Claude Opus 4 input) **GLM-4.7:** $3/month or free locally **MiniMax M2.1:** ~10% the price of Claude Sonnet

Should I use open-source or proprietary APIs?

**Use proprietary APIs when:** Prototyping, variable load, need absolute frontier capabilities, limited ML engineering resources. **Use open-source when:** High volume (>100K requests/day), privacy requirements, need fine-tuning, predictable costs, low latency, EU compliance (avoid Llama 4).

How do I choose between all these Chinese models?

| Model | Best For | | --- | --- | | **DeepSeek V3.2** | Reasoning, agents, open methodology | | **Kimi K2** | Cost efficiency, agentic workflows | | **GLM-4.7** | Coding (highest SWE-bench), "preserved thinking" | | **MiniMax M2.1** | Multilingual coding, interleaved thinking | | **Qwen3** | Multilingual (119 languages), Apache 2.0 |

How often do I need to update models?

The open-source landscape moves fast—major releases every 2-3 months now. For production, upgrade when: (1) New model offers >5% improvement on your benchmarks, (2) Security/safety updates are released, (3) Better cost/performance ratio available. Consider **Kimi K2** and **GLM-4.7** as cost-effective alternatives to proprietary APIs.

Open-Source LLMs: The Complete 2025 Guide | Enrico Piovano

The Open-Source Revolution

Open-source models have reached a critical inflection point. They now rival—and sometimes exceed—proprietary alternatives in many tasks while offering complete control over deployment, customization, and costs.

Why the gap closed so quickly: In 2023, open-source models were clearly inferior—GPT-4 seemed unreachable. Two factors changed this: (1) Training recipes became public. Llama 2's technical report, InstructGPT's RLHF paper, and countless fine-tuning guides democratized the "secret sauce." (2) Scale became accessible. Cloud providers offer H100 clusters at reasonable rates; training a 70B model costs ~ $1M instead of$ 100M. The frontier is still proprietary, but the gap is months, not years.

The strategic calculus has flipped: When open models were 2+ years behind, the choice was easy—use APIs for quality. Now the calculation is different. At 100K+ requests/day, self-hosting costs 10x less. For sensitive data, on-premises is the only option. For latency-critical applications, local inference wins. Open-source went from "not good enough" to "often the right choice."

From research: "Leading open-source models like Llama 4 Maverick, DeepSeek V3.2, Mistral Large 3, GPT-OSS-120B, Kimi K2 Thinking, and GLM-4.7 now match or exceed GPT-4o and Claude 3.5 Sonnet on most benchmarks, with some beating GPT-5 and Claude 4.5 on specific tasks."

2025 highlights:

Llama 4 introduces MoE architecture with 10M context (Scout) and matches DeepSeek V3 at half the active parameters
Qwen3 brings hybrid reasoning modes—"thinking" and "non-thinking"—under Apache 2.0, trained on 36T tokens in 119 languages
DeepSeek V3.2 is reasoning-first for agents, V3.2-Speciale wins gold at IMO/IOI 2025
Mistral Large 3 (675B MoE, 41B active) is the strongest Apache 2.0 model outside China with 256K context
GPT-OSS-120B marks OpenAI's return to open-source—first open weights since GPT-2 (2019), Apache 2.0
Kimi K2 (1T params, 32B active) from Moonshot AI sets new agentic benchmarks; K2 Thinking beats GPT-5 and Claude 4.5 Sonnet
GLM-4.7 from Zhipu AI achieves highest open-source SWE-bench (73.8%) and costs $3/month
Grok 2.5 from xAI open-sourced (270B), with Grok 3 expected to follow
Falcon H1 (TII) brings hybrid Transformer+Mamba architecture with 262K context; 34B matches 70B models

This guide provides everything you need to choose, deploy, and optimize open-source LLMs for production use.

Why Open-Source Matters

The Strategic Case

Benefit	Description	Impact
Cost at Scale	No per-token API fees	10-100x cheaper at high volume
Data Privacy	Data never leaves your infrastructure	Regulatory compliance, IP protection
Customization	Full fine-tuning capability	Domain-specific optimization
Control	No vendor lock-in or API changes	Business continuity
Transparency	Inspect weights and behavior	Debugging, safety analysis
Latency	Local inference, no network round-trip	Real-time applications

When to Use Open-Source

Strong fit:

High-volume applications (>100K requests/day)
Sensitive data that can't leave your infrastructure
Need for fine-tuning on proprietary data
Edge deployment or offline operation
Cost-sensitive applications
Applications requiring low latency (<100ms)

Consider proprietary APIs when:

Prototyping and MVP development
Variable, unpredictable load
Need absolute frontier capabilities
Limited ML engineering resources
Rapid iteration more important than cost

Cost Comparison Example

For 1 million tokens/day (30M tokens/month):

Option	Monthly Cost	Notes
GPT-4o API	~$450	$15/1M input tokens
Claude 3.5 Sonnet	~$270	$3/1M input,$ 15/1M output
Self-hosted Llama 70B	~$150-300	A100 cloud instance
Self-hosted on-prem	$50-100	After hardware amortization

At 10M tokens/day, self-hosting typically saves 5-10x.

The Major Model Families

Llama (Meta)

Meta's Llama series has become the de facto standard for open-source LLMs, with the broadest ecosystem support.

Model Lineup

Llama 4 (April 2025): From Meta: "Llama 4 is our most intelligent model family, designed to enable people to build more personalized multimodal experiences."

The first Llama family using Mixture-of-Experts (MoE) architecture:

Llama 4 Scout: 17B active / 109B total parameters (16 experts), 10M context window
Llama 4 Maverick: 17B active / 400B total parameters (128 experts), 1M context window
Llama 4 Behemoth: 288B active / ~2T total parameters (in training)

From Meta: "Scout is the best multimodal model in the world in its class, fitting on a single NVIDIA H100. Maverick beats GPT-4o and Gemini 2.0 Flash across benchmarks while achieving results comparable to DeepSeek V3 at less than half the active parameters."

⚠️ EU Restriction: Llama 4 is prohibited for users domiciled in the EU due to AI and data privacy regulations.

Llama 3.3 70B (December 2024): From Meta: "Offers performance comparable to the 405B parameter model at a fraction of the computational cost."

70B parameters
128K context window
Instruction-tuned
Multilingual support (8 languages)

Llama 3.2 (September 2024):

Vision models: 11B and 90B with image understanding
Edge models: 1B and 3B for mobile/edge deployment
Multimodal capabilities integrated

Llama 3.1 (July 2024):

8B, 70B, and 405B parameter versions
405B is the largest open model available
128K context window
Improved multilingual and tool use

Benchmarks

Benchmark	Llama 3.3 70B	GPT-4o	Claude 3.5 Sonnet
MMLU (5-shot)	86.0	88.7	88.7
HumanEval	88.4	90.2	92.0
GSM8K	93.0	95.3	96.4
MATH	77.0	76.6	78.3

Strengths

Ecosystem: Widest tool support (vLLM, llama.cpp, TensorRT-LLM, all major frameworks)
Documentation: Best-documented training process in open-source
Fine-tuning: Extensive fine-tuning ecosystem (Axolotl, LLaMA-Factory, TRL)
Stability: Most battle-tested in production
Community: Largest community, most available fine-tunes

Weaknesses

License restrictions: 700M MAU limit for commercial use
Multilingual: Good but not best-in-class (8 languages vs Qwen's 29+)
Reasoning: Trails DeepSeek R1 on complex reasoning

License

Llama 3 Community License:

Free for commercial use under 700M monthly active users
Above threshold requires separate commercial license from Meta
Allows modification and redistribution
Requires attribution

Llama 4 License:

Free for commercial use under 700M monthly active users
EU users prohibited from using or distributing Llama 4 models
Multimodal capabilities (Scout, Maverick) under same restrictions

Best For

General-purpose chat applications
Production deployments needing stability
Teams wanting maximum ecosystem support
RAG systems with good general knowledge
Applications needing vision capabilities (Llama 3.2)

Qwen (Alibaba)

Alibaba's Qwen family excels in multilingual support and specialized variants, with fully permissive licensing.

Model Lineup

Qwen3 (April 2025): From Alibaba: "Qwen3 introduces hybrid reasoning—models can 'think' through complex problems or answer quickly, matching OpenAI's o3 reasoning capability."

Qwen3-235B-A22B: 235B total, 22B active (MoE), 128K context
Qwen3-32B: Dense model, 128K context
Qwen3-14B, 8B, 4B, 1.7B, 0.6B: Smaller variants
Trained on 36 trillion tokens in 119 languages

Qwen3 Latest (Late 2025):

Qwen3-2507: 1M token context support (August 2025)
Qwen3-Max: Outperforms Claude 4 Opus non-thinking, DeepSeek V3.1 (September 2025)
Qwen3-Next: Apache 2.0 licensed Instruct and Thinking models (September 2025)
Qwen3-Coder-480B-A35B: 480B MoE with 35B active—most agentic code model

Qwen2.5 (September 2024):

Qwen2.5-72B: Flagship dense model
Qwen2.5-Coder: State-of-the-art code model
Qwen2.5-Math: Mathematical reasoning specialist
Trained on 29+ languages

Qwen2.5-VL (Vision-Language):

3B, 7B, 72B vision-language models
Native image and video understanding
Document/chart analysis

Benchmarks

Benchmark	Qwen2.5-72B	Qwen3-235B	Llama 3.3 70B
MMLU	86.1	88.2	86.0
HumanEval	86.6	89.5	88.4
MATH	83.1	86.7	77.0
Multilingual Avg	75.6	79.3	68.2

Strengths

Multilingual: Best-in-class support for 29+ languages
Specialized models: Coder, Math, VL variants
MoE efficiency: 22B active parameters with 235B quality
License: Fully Apache 2.0 (no restrictions)
Asian languages: Best Chinese, Japanese, Korean support

Weaknesses

Ecosystem: Smaller than Llama ecosystem
Documentation: Less extensive than Meta's
Community: Fewer available fine-tunes

License

Apache 2.0:

Fully permissive commercial use
No user limits
No attribution required
Can modify and redistribute freely

Best For

Multilingual applications
Coding assistants (Qwen2.5-Coder)
Mathematical reasoning (Qwen2.5-Math)
Asian market applications
Teams wanting fully permissive licensing
Vision-language applications (Qwen2.5-VL)

DeepSeek

Chinese AI lab that achieved breakthrough reasoning capabilities with innovative training approaches.

Model Lineup

DeepSeek-V3.2 (December 2025): From DeepSeek: "V3.2 is our reasoning-first model built for agents, performing comparably to GPT-5."

671B total parameters (37B active)
DSA (DeepSeek Sparse Attention): Efficient attention for long-context scenarios
First DeepSeek model to integrate thinking directly into tool-use
Massive agent training: 1,800+ environments, 85k+ complex instructions

DeepSeek-V3.2-Speciale:

High-compute variant surpassing GPT-5, on par with Gemini 3 Pro
Gold medal in 2025 IMO and IOI, CMO, ICPC World Finals
Designed exclusively for deep reasoning (no tool-calling)
MIT License

DeepSeek-V3.1 (August 2025): From DeepSeek: "V3.1 combines the strengths of V3 and R1 into a single hybrid model with 'thinking' and 'non-thinking' modes."

671B total parameters (37B active)
128K context window
Hybrid thinking mode: Switch between chain-of-thought reasoning (like R1) and direct answers (like V3) via chat template
One model covers both general-purpose and reasoning-heavy use cases

DeepSeek-V3 (December 2024):

671B total parameters (MoE)
37B active parameters per token
Pre-training cost: $5.6M (remarkably efficient)
Strong general capabilities

DeepSeek-R1-0528 (May 2025): From DeepSeek: "R1-0528 features major improvements in inference and hallucination reduction, with performance approaching O3 and Gemini 2.5 Pro."

Updated reasoning model with structured JSON output
Built-in function-calling capabilities
Improved math, code, and logic benchmarks
671B total, 37B active

DeepSeek-R1 (January 2025): From DeepSeek: "DeepSeek-R1 matches OpenAI-o1-0912 on math benchmarks with open weights."

Reasoning model trained with pure RL (GRPO)
Extended thinking capabilities
Shows chain-of-thought reasoning in <think> tags

DeepSeek-R1-Distill Series:

Distilled versions at 1.5B, 7B, 8B, 14B, 32B, 70B
Retain significant reasoning capability
Run on consumer hardware

From research: "DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks."

DeepSeek-Coder-V2:

236B MoE (21B active)
State-of-the-art code understanding
128K context window

Benchmarks

Benchmark	DeepSeek-R1	o1-preview	DeepSeek-V3
AIME 2024	79.8%	74.4%	39.2%
MATH-500	97.3%	96.4%	90.2%
Codeforces	2029 Elo	1891 Elo	1134 Elo
MMLU	90.8	90.8	88.5

Strengths

Reasoning: Best-in-class mathematical and logical reasoning
Code: Excellent code generation and understanding
Efficiency: Remarkably cost-effective training
Transparency: Published detailed training methodology
Distillation: High-quality smaller models available

Weaknesses

Inference cost: Full R1 requires significant compute
Latency: Reasoning models produce many tokens
Language focus: Stronger in English/Chinese than other languages

License

DeepSeek License (MIT-like):

Free for commercial use
Some restrictions on specific use cases
Open weights available

Best For

Complex reasoning tasks
Mathematical problem-solving
Code generation and analysis
Research and analysis applications
Teams willing to trade latency for accuracy

Mistral

French AI company known for efficient, high-quality models with strong European presence. With Mistral 3, they've released the strongest fully open-weight model (Apache 2.0) developed outside China.

Model Lineup

Mistral Large 3 (December 2025): From Mistral: "Mistral Large 3 is our most capable model to date—a sparse mixture-of-experts trained with 41B active and 675B total parameters."

675B total parameters (MoE), 41B active
256K context window
Native vision capabilities (image analysis)
Multilingual: English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic
Best-in-class agentic capabilities with native function calling and JSON output
Apache 2.0 license (fully open)
Trained from scratch on 3000 NVIDIA H200 GPUs
Top open-source coding model on LMArena leaderboard

Mistral 3 Small Models (December 2025):

Mistral 3 14B, 8B, 3B dense models
All Apache 2.0 licensed
Optimized for efficiency

Mistral Large 2 (July 2024):

123B parameters
128K context window
Strong function calling
Multilingual (dozens of languages)

Mistral Small 3 (January 2025): From Mistral: "Mistral Small 3 (24B) achieves state-of-the-art capabilities comparable to larger models."

24B parameters
Optimized for efficiency
Strong reasoning for size

Mixtral 8x22B:

MoE architecture: 176B total, 39B active
Strong general performance
Good efficiency

Codestral (May 2024):

22B parameters
Specialized for code
32K context window

Ministral (October 2024):

3B and 8B models
Edge deployment focus
Strong for size

Benchmarks

Benchmark	Mistral Large 3	Mistral Large 2	Mistral Small 3
MMLU	88.2	84.0	81.0
HumanEval	94.3	92.1	84.5
MATH	82.1	76.5	69.4
GSM8K	95.8	91.0	86.2

Strengths

Apache 2.0: Mistral Large 3 and Mistral 3 models are fully open (no restrictions)
Function calling: Best-in-class structured output and agentic capabilities
Efficiency: 41B active parameters with 675B-class quality
European: GDPR considerations, EU-based, no regional restrictions
Vision: Native multimodal capabilities in Large 3
Production-ready: Focus on deployment, NVFP4 checkpoint available

Weaknesses

Hardware requirements: Large 3 needs 8×H100 or Blackwell NVL72 for full precision
Smaller ecosystem: Less community activity than Llama
Legacy licensing: Older models (Codestral) have commercial restrictions

License

Mistral 3 Family: Apache 2.0 (fully permissive)

Mistral Large 3
Mistral 3 14B, 8B, 3B

Legacy models:

Apache 2.0: Mistral 7B, Mixtral
Commercial: Codestral, older Large versions

Best For

European deployments (GDPR compliance, no regional restrictions)
Function calling / structured output / agentic applications
Teams wanting frontier performance with Apache 2.0 license
Production systems needing vision + text capabilities
Organizations requiring fully open weights outside China

Other Notable Models

Kimi K2 (Moonshot AI)

State-of-the-art agentic model from Beijing startup (Alibaba-backed). One of the most cost-effective frontier models available.

Kimi K2 (July 2025): From Moonshot: "Kimi K2 is designed for tool use, reasoning, and autonomous problem-solving."

1 trillion total parameters, 32B active (MoE)
State-of-the-art among non-thinking models for knowledge, math, coding
Trained with MuonClip optimizer on 15.5T tokens with zero instability
Pre-trained at unprecedented 1T scale with novel optimization techniques
SWE-bench Verified: 65.8% pass@1 (single-attempt patches)
SWE-bench Multilingual: 47.3% pass@1
Surpasses Claude Opus 4 on two benchmarks
Better overall performance than GPT-4.1 on coding metrics
Modified MIT license
Pricing: $0.15/1M input tokens,$ 2.50/1M output tokens (100x cheaper than Claude Opus 4 input)

Kimi K2 Thinking (November 2025): From Moonshot: "The most powerful open source thinking model in the Kimi series to date."

Outperforms GPT-5, Claude Sonnet 4.5 (Thinking), and Grok-4 on reasoning, coding, and agentic benchmarks
Automatically selects 200-300 tools for autonomous task completion
Training cost: $4.6M
Fully open-source despite beating proprietary competitors
Top position in reasoning and coding evaluations

Model Variants:

Kimi-K2-Base: Foundation model for researchers wanting full control for fine-tuning
Kimi-K2-Instruct: Post-trained model for general-purpose chat and agentic experiences

GLM-4 Series (Zhipu AI / Z.ai)

China's answer to Claude Code, open-sourced from Tsinghua University spinoff. Rapidly evolving series with strong coding and vision capabilities.

GLM-4.7 (December 22, 2025): From Z.ai: "GLM-4.7 achieves the highest SWE-bench Verified score among open-source models at 73.8%."

~400B parameters, 200K context, 128K output
Open-weight 32B and 9B variants (base, reasoning, rumination)
SWE-bench Verified: 73.8% (highest open-source)
LiveCodeBench: 84.9% (beats Claude Sonnet 4.5)
AIME 2025: 95.7%
HLE (Humanity's Last Exam): 42.8% (outperforms GPT-5.1)
Code Arena: Rank #1 among open-source and Chinese models
"Preserved Thinking": Maintains reasoning chains across multiple turns (addresses biggest frustration in AI-assisted coding)
$3/month or free locally via HuggingFace/ModelScope with vLLM or SGLang

Agentic capabilities:

BrowseComp: 67.5 (web tasks)
τ²-Bench: 87.4 (interactive tool use)—new open-source SOTA, surpasses Claude Sonnet 4.5

GLM-4.6V (December 2025): From Z.ai: "A 128K context vision-language model with native tool calling."

GLM-4.6V: 106B parameters for cloud-scale inference
GLM-4.6V-Flash: 9B parameters for low-latency, local applications
128K context window
Native multimodal function calling: Images, screenshots, and document pages pass directly as tool parameters
Tools can return visual outputs (search grids, charts, web pages, product images)
Model fuses visual outputs with text in the same reasoning chain
Optimized for frontend automation and multimodal reasoning

GLM-4.5 (Mid-2025):

355B total parameters, 32B active (MoE)
Supports reasoning, tool use, coding, and agentic behaviors
GLM-4.5-Air: Smaller sibling for efficiency
Fast despite large parameter count due to MoE

MiniMax

Chinese startup with innovative attention mechanisms. Pioneering hybrid-attention and long-context reasoning.

MiniMax M2.1 (December 25, 2025): From MiniMax: "Significantly enhanced multi-language programming, built for real-world complex tasks."

Sparse MoE architecture with 10B active parameters
204,800 token context window
First open-source model with Advanced Interleaved Thinking (separates reasoning from response)
Multi-language programming: Rust, Java, Go, C++, Kotlin, Objective-C, TypeScript, JavaScript
SWE-multilingual: 72.5%
VIBE aggregate: 88.6 (VIBE-Web: 91.5, VIBE-Android: 89.7)
Outperforms Claude Sonnet 4.5, approaches Claude Opus 4.5 in multilingual scenarios
~10% the price of Claude Sonnet
90 tokens/sec on RTX5090 with vLLM and FP8 weights
Runs on 4×A100 GPUs
Modified-MIT license, weights on HuggingFace
Inference: SGLang or vLLM (temp=1.0, top_p=0.95, top_k=40)

MiniMax-M1 (June 2025):

World's first large-scale hybrid-attention reasoning model
1M token context (8x DeepSeek R1)
80K token reasoning output
CISPO: RL algorithm 2x faster than DAPO

MiniMax-01:

456B total parameters (45.9B active)
4M token context (20-32x other models)

Google Gemma

Open version of Gemini technology:

Gemma 2 (2B, 9B, 27B)
Strong for size class
Good efficiency
Research-friendly license

Microsoft Phi

Quality-focused small models:

Phi-4 (14B): "Competitive with much larger models"
Trained primarily on synthetic data
Excellent for size
MIT license

HuggingFace SmolLM

Fully transparent small models:

SmolLM2 (135M, 360M, 1.7B)
Complete training transparency
Research-friendly
Apache 2.0 license

Cohere Command R

Enterprise-focused:

Command R+ (104B)
Strong RAG capabilities
Enterprise features
Commercial focus

GPT-OSS (OpenAI)

OpenAI's first open-weight release since GPT-2 (2019). A landmark moment for open-source AI.

GPT-OSS-120B (August 2025): From OpenAI: "State-of-the-art open-weight language models that deliver strong real-world performance at low cost."

117B total parameters, 5.1B active (MoE)
Fits on a single 80GB GPU (H100 or AMD MI300X)
Trained using large-scale distillation and reinforcement learning
Three reasoning effort levels: low, medium, high (trade-off latency vs performance)
Full chain-of-thought access for debugging
Native agentic capabilities: function calling, web browsing, Python execution, structured outputs
Outperforms o3-mini, matches/exceeds o4-mini on MMLU (90%), GPQA (~80%), AIME 2024/2025
Apache 2.0 license (fully permissive)
Available on HuggingFace, GitHub, LM Studio, Ollama

GPT-OSS-20B:

Smaller variant for edge/efficiency use cases
Same Apache 2.0 license

From Artificial Analysis: "The most intelligent American open weights model."

Grok (xAI)

Elon Musk's xAI open-source releases:

Grok 2.5 (August 2025):

~270B parameters
Trained on text-based reasoning tasks
Model weights (~500 GB across 42 files) + tokenizer
Custom license: Grok 2 Community License Agreement
- Allows commercial and non-commercial use
- Restriction: Cannot use to train other AI models
- Revocable license (less permissive than Apache 2.0)
Grok 3 expected to follow in ~6 months

Grok-1 (March 2024):

314B parameter MoE
Apache 2.0 license (fully permissive)
Historical significance: Early frontier open-source release

Falcon 3 & H1 (TII)

UAE's Technology Innovation Institute. Focus on efficient models that run on consumer hardware.

Falcon 3 (December 2024):

Model sizes: 1B, 3B, 7B, 10B (Base and Instruct variants)
Trained on 14 trillion tokens (2.5x predecessor)
32K context (8K for 1B)
#1 on HuggingFace leaderboard at launch (for size class)
Beats Llama-3.1-8B, Qwen2.5-7B, Mistral NeMo-12B, Gemma2-9B
TII Falcon License (Apache 2.0-based, permissive)
Runs on laptops and light infrastructure

Falcon H1 (2025): From TII: "A family of hybrid-head language models redefining efficiency and performance."

Model sizes: 0.5B, 1.5B, 1.5B-Deep, 3B, 7B, 34B (+ Instruct variants)
Hybrid architecture: Transformer attention + State Space Model (Mamba-2)
262K context window
18 languages: English, Chinese, Japanese, Korean, German, French, Spanish, etc.
Falcon-H1-34B matches/outperforms 70B models (Qwen3-32B, Qwen2.5-72B, Llama3.3-70B)
Falcon-H1-1.5B-Deep rivals current 7B-10B models
Falcon-H1-0.5B performs like 2024's 7B models
Native support in llama.cpp, axolotl, llama-factory, unsloth
55+ million downloads to date
Permissive open-source license

Comprehensive Benchmark Comparison

General Capabilities

Model	MMLU	HumanEval	GSM8K	MATH	HellaSwag
Llama 3.3 70B	86.0	88.4	93.0	77.0	88.6
Qwen2.5-72B	86.1	86.6	91.6	83.1	88.5
DeepSeek-V3	88.5	82.6	89.3	90.2	88.9
Mistral Large 2	84.0	92.1	91.0	76.5	87.1
DeepSeek-R1	90.8	86.7	97.3	97.3	-
GPT-4o (reference)	88.7	90.2	95.3	76.6	95.3
Claude 3.5 Sonnet	88.7	92.0	96.4	78.3	89.0

Coding Benchmarks

Model	HumanEval	MBPP	LiveCodeBench	SWE-bench
DeepSeek-Coder-V2	90.2	80.4	43.4	22.0
Qwen2.5-Coder-32B	92.7	83.2	42.1	23.5
Codestral	81.1	78.2	36.2	18.3
Llama 3.3 70B	88.4	75.6	33.8	16.4

Reasoning Benchmarks

Model	AIME 2024	MATH-500	GPQA	ARC-C
DeepSeek-R1	79.8%	97.3%	71.5	92.3
Qwen3-235B	54.7%	86.7%	59.1	89.4
Llama 3.3 70B	33.3%	77.0%	50.7	88.1
o1-preview (reference)	74.4%	96.4%	73.3	-

Chatbot Arena Rankings (Human Preference)

From LMSYS Chatbot Arena / LMArena (Late 2025). Rankings use 6M+ human votes with Elo-like scoring:

Rank	Model	Arena Score	Notes
1	Claude Opus 4.5	~1450	Proprietary frontier
2	GPT-5.2	~1440	Proprietary frontier
3	Gemini 3 Pro	~1430	Proprietary frontier
4	Llama 4 Maverick	1417	Open-source
5	DeepSeek-V3.2	~1400	Open-source
6	Mistral Large 3	~1395	Open-source, top OS coding
7	Kimi K2 Thinking	~1390	Open-source
8	Qwen3-235B	~1385	Open-source
9	GLM-4.7	~1380	Open-source

Key insight: The gap between proprietary and open-source models has narrowed from 17.5 to just 0.3 percentage points on MMLU. Open-source models now achieve 85-90% of frontier performance.

Hardware Requirements

Memory Calculation

LLM memory requirements follow this formula:

Code

Memory (GB) ≈ Parameters (B) × Bytes per Parameter

FP32: 4 bytes/param → 7B model = 28 GB
FP16: 2 bytes/param → 7B model = 14 GB
INT8: 1 byte/param → 7B model = 7 GB
INT4: 0.5 bytes/param → 7B model = 3.5 GB

Add ~20% overhead for KV cache and runtime.

Detailed Memory Requirements

Model Size	FP16	INT8	INT4 (AWQ/GPTQ)	GGUF Q4_K_M
1.5B	3 GB	1.5 GB	1 GB	1.1 GB
7B	14 GB	7 GB	4 GB	4.4 GB
8B	16 GB	8 GB	4.5 GB	5 GB
13B	26 GB	13 GB	7 GB	8 GB
32B	64 GB	32 GB	16 GB	19 GB
70B	140 GB	70 GB	35 GB	40 GB
405B	810 GB	405 GB	203 GB	230 GB

GPU Recommendations by Use Case

Consumer Hardware

GPU	VRAM	Can Run	Typical Use
RTX 3060	12 GB	7B INT4, 3B FP16	Development, small models
RTX 3090	24 GB	13B INT4, 7B FP16	Development, medium models
RTX 4090	24 GB	32B INT4, 13B INT8	Production-capable
2x RTX 4090	48 GB	70B INT4	Production workloads

Professional/Enterprise

GPU	VRAM	Can Run	Typical Use
A10G	24 GB	32B INT4, 13B FP16	Cloud inference
L40S	48 GB	70B INT4, 32B FP16	Production inference
A100 40GB	40 GB	70B INT4, 32B FP16	Training and inference
A100 80GB	80 GB	70B FP16	Training and inference
H100 80GB	80 GB	70B FP16, optimized	High-throughput
8x H100	640 GB	405B FP16	Frontier models

From research: "With a 24 GB GPU (e.g., RTX 3090/4090), you can comfortably run 4-bit quantized versions of models up to ~40B parameters."

From research: "With Pro/Enterprise GPU (e.g., 48-80GB VRAM), you can comfortably run 70B models like Llama 3.1 and Qwen2 72B."

Apple Silicon

Chip	Unified Memory	Can Run	Notes
M1/M2 (8GB)	8 GB	3B-7B Q4	Basic use
M1/M2 Pro (16GB)	16 GB	13B Q4, 7B Q8	Good development
M1/M2 Max (32GB)	32 GB	32B Q4, 13B Q8	Serious development
M1/M2 Ultra (64GB)	64 GB	70B Q4	Production-capable
M3 Max (128GB)	128 GB	70B Q8, 405B Q2	High-end

Deployment Options

Local Inference Engines

Ollama

Simplest local deployment:

Bash

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Run models
ollama pull llama3.3:70b
ollama run llama3.3:70b "Explain transformers"

# API access
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3:70b",
  "prompt": "Explain transformers"
}'

Pros: Simplest setup, automatic model management, good defaults Cons: Less optimization control, Mac/Linux focus

llama.cpp

Maximum performance and control:

Bash

# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j LLAMA_CUDA=1  # For NVIDIA GPUs

# Run
./llama-cli \
  -m models/llama-3.3-70b-instruct-q4_k_m.gguf \
  -p "Explain transformers" \
  -n 500 \
  --temp 0.7 \
  -ngl 99  # GPU layers

Pros: Best CPU performance, extensive quantization options, cross-platform Cons: Manual model management, more complex setup

vLLM

Production serving with high throughput:

Python

from vllm import LLM, SamplingParams

# Single GPU
llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    quantization="awq",
    gpu_memory_utilization=0.9,
)

# Multi-GPU
llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    tensor_parallel_size=2,  # 2 GPUs
    quantization="awq",
)

# Generate
outputs = llm.generate(
    ["Explain transformers"],
    SamplingParams(temperature=0.7, max_tokens=500)
)

Pros: Highest throughput, PagedAttention, production-ready Cons: GPU-only, requires more resources

TensorRT-LLM

Maximum NVIDIA performance:

Python

from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct")
outputs = llm.generate(
    ["Explain transformers"],
    SamplingParams(max_new_tokens=500)
)

Pros: Best NVIDIA performance, optimized kernels Cons: NVIDIA-only, complex setup

Comparison of Inference Engines

Engine	Best For	Throughput	Ease of Use
Ollama	Local development	Medium	Very Easy
llama.cpp	CPU, Apple Silicon	Medium	Medium
vLLM	Production serving	High	Medium
TensorRT-LLM	Max NVIDIA perf	Highest	Complex
HF Transformers	Flexibility	Low	Easy

Cloud Deployment

Inference Providers

Provider	Specialty	Pricing Model
Together AI	Wide model selection	Per-token
Fireworks AI	Speed, function calling	Per-token
Anyscale	Scalability	Per-token
Replicate	Ease of use	Per-second
Modal	Serverless	Per-second
Groq	Speed (LPU)	Per-token

Self-Hosted Cloud

AWS:

YAML

# SageMaker endpoint
Instance: ml.g5.12xlarge (4x A10G)
Model: Llama 3.3 70B AWQ
Throughput: ~50 tokens/sec
Cost: ~$7/hour

GCP:

YAML

# Vertex AI
Instance: a2-highgpu-1g (A100 40GB)
Model: Qwen2.5-72B GPTQ
Throughput: ~60 tokens/sec
Cost: ~$4/hour

Quantization Formats

Format	Tool	Precision	Best For
GGUF	llama.cpp	Q2-Q8	CPU, Apple Silicon, flexibility
AWQ	vLLM, TGI	INT4	NVIDIA GPUs, production
GPTQ	vLLM, TGI	INT4	NVIDIA GPUs, production
GGML	Legacy	Q4-Q8	Deprecated, use GGUF
bitsandbytes	HuggingFace	INT8/INT4	Quick prototyping
ExLlamaV2	Custom	Various	Optimized inference

Quantization Quality Comparison

Impact on Llama 3.1 70B quality:

Quantization	MMLU Impact	Speed Gain	Memory Reduction
FP16 (baseline)	0%	1x	0%
INT8	-0.5%	1.5x	50%
AWQ INT4	-1.2%	2x	75%
GPTQ INT4	-1.5%	2x	75%
GGUF Q4_K_M	-1.8%	2.5x	75%
GGUF Q2_K	-5.0%	3x	87%

Licensing Deep Dive

Fully Permissive (Apache 2.0 / MIT)

Apache 2.0:

Qwen (all models)
Mistral (7B, Mixtral)
Gemma
SmolLM
Phi (MIT)

What you can do:

Use commercially without limits
Modify and create derivatives
Distribute without attribution
No revenue sharing required

Permissive with Limits

Llama 3 Community License:

Free for companies with <700M monthly active users
Must request license above threshold
Can modify and redistribute
Must include attribution

DeepSeek License:

Generally permissive
Some restrictions on specific use cases
Review terms for your use case

Research / Non-Commercial

Some models (check individual licenses):

May restrict commercial use
May require academic attribution
May have geographic restrictions

License Selection Guide

Use Case	Recommended Models
Startup/SMB	Qwen (Apache 2.0), Llama (under 700M MAU)
Enterprise	Qwen, licensed Llama, Mistral
Research	Any (check publication requirements)
Products in Asia	Qwen (best Asian language support)
EU compliance focus	Mistral (European company)

Practical Recommendations

By Use Case

General Chat / Customer Service

Recommended: Llama 3.3 70B or Qwen2.5-72B

For general-purpose chat, you need models with strong instruction following and broad knowledge. AWQ quantization reduces memory 4x with minimal quality loss, enabling 70B models on 2 GPUs.

Python

# vLLM serving
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    quantization="awq",
    tensor_parallel_size=2,
)

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=1024,
    top_p=0.9,
)

response = llm.generate([user_message], sampling_params)

Why: Best general capabilities, strong instruction following, extensive testing.

Coding Assistant

Recommended: DeepSeek-Coder-V2 or Qwen2.5-Coder-32B

Python

# With fill-in-the-middle
prompt = f"""<fim_prefix>{code_before}<fim_suffix>{code_after}<fim_middle>"""

response = llm.generate(prompt, SamplingParams(
    temperature=0.2,  # Lower for code
    max_tokens=512,
))

Why: Best code understanding, completion, and generation.

Complex Reasoning / Math

Recommended: DeepSeek-R1 or R1-Distill-32B

Python

# Allow extended thinking
response = llm.generate(
    "Solve this step by step: [problem]",
    SamplingParams(
        temperature=0.6,
        max_tokens=8192,  # Allow long reasoning
    )
)

Why: Best-in-class reasoning, shows work, highest accuracy on complex problems.

Multilingual Applications

Recommended: Qwen2.5-72B

Why: 29+ language support, best non-English performance.

Edge / Mobile Deployment

Recommended: Llama 3.2 3B or Phi-4-mini

Bash

# llama.cpp on mobile
./llama-cli \
  -m llama-3.2-3b-instruct-q4_k_m.gguf \
  -ngl 0 \  # CPU only
  -t 4 \    # 4 threads
  -c 2048   # Shorter context for memory

Why: Best quality at small size, optimized for resource constraints.

RAG / Knowledge Base

Recommended: Llama 3.3 70B with good retrieval

Python

from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

# Use open embeddings too
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5"
)

vectorstore = Chroma.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

Why: Strong instruction following, good at synthesizing retrieved context.

Architecture Recommendations

Single GPU (24GB)

YAML

Hardware: RTX 4090 or L40S
Model: Qwen2.5-32B-AWQ or Llama 3.1 8B FP16
Engine: vLLM or llama.cpp
Throughput: ~30-50 tokens/sec

Dual GPU (48GB)

YAML

Hardware: 2x RTX 4090 or 2x A10G
Model: Llama 3.3 70B AWQ
Engine: vLLM with tensor_parallel_size=2
Throughput: ~40-60 tokens/sec

Production Cluster

YAML

Hardware: 4x A100 80GB or 4x H100
Model: Llama 3.3 70B FP16 or DeepSeek-V3
Engine: vLLM or TensorRT-LLM
Throughput: ~200+ tokens/sec

Fine-Tuning Open Models

When to Fine-Tune

Good candidates:

Domain-specific terminology/style
Consistent output format requirements
Task-specific performance improvement
Reducing prompt length

Skip fine-tuning when:

Few-shot prompting works well
Data is limited (<1000 examples)
Requirements change frequently
General knowledge is sufficient

Fine-Tuning Approaches

LoRA (Low-Rank Adaptation)

LoRA freezes the base model and adds small trainable matrices to attention layers. Instead of updating billions of parameters, you train ~0.1% of them. This makes fine-tuning feasible on consumer GPUs while preserving most of the base model's knowledge.

The key parameters: r controls the rank (higher = more parameters = more expressivity), lora_alpha is a scaling factor (typically 2x r), and target_modules specifies which layers to adapt.

Python

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, lora_config)
# Trainable params: ~0.1% of total

Memory: ~16GB for 7B model, ~48GB for 70B model

QLoRA (Quantized LoRA)

QLoRA combines 4-bit quantization with LoRA, enabling 70B model fine-tuning on a single 24GB GPU. The base model is loaded in 4-bit precision while LoRA adapters train in full precision. The "nf4" quantization type is specifically designed for normally-distributed weights.

This is the practical choice for most teams: fine-tune large models without expensive hardware, then merge adapters with the base model for deployment.

Python

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
)
# Then apply LoRA

Memory: ~8GB for 7B model, ~24GB for 70B model

Fine-Tuning Frameworks

Framework	Best For	Complexity
Axolotl	Production fine-tuning	Medium
LLaMA-Factory	Quick experiments	Low
HuggingFace TRL	Research, flexibility	Medium
Unsloth	Speed, memory efficiency	Low

Staying Current

Benchmarks to Monitor

Benchmark	What It Measures	Where to Find
LMSYS Chatbot Arena	Human preference	chat.lmsys.org
Open LLM Leaderboard	Automated benchmarks	HuggingFace
HumanEval / MBPP	Code generation	Papers/releases
MATH / GSM8K	Mathematical reasoning	Papers/releases
MMLU	General knowledge	Papers/releases

Information Sources

HuggingFace Model Hub: New model releases
r/LocalLLaMA: Community testing and discussion
Twitter/X: Researcher announcements
Model provider blogs: Official benchmarks
Papers With Code: Research tracking

Conclusion

Open-source LLMs have reached production maturity. The gap with proprietary models has narrowed from 17.5 to 0.3 percentage points—open-source now achieves 85-90% of frontier performance:

Llama 4: MoE architecture, massive context (10M for Scout), best ecosystem—but EU-restricted
Qwen3: Best multilingual (119 languages), hybrid reasoning, fully Apache 2.0, trained on 36T tokens
DeepSeek V3.2: Reasoning-first for agents, V3.2-Speciale wins gold at IMO/IOI 2025
Mistral Large 3: Strongest Apache 2.0 model outside China (675B MoE), 256K context, vision capabilities
GPT-OSS-120B: OpenAI's return to open-source (Apache 2.0), most intelligent American open-weight model
Kimi K2: Best value—1T params at $0.15/1M input; K2 Thinking beats GPT-5 and Claude 4.5
GLM-4.7: Highest open-source SWE-bench (73.8%), $3/month, "preserved thinking" for coding
MiniMax M2.1: Advanced interleaved thinking, 204K context, multilingual coding champion
Grok 2.5: xAI's 270B open release, with Grok 3 coming soon
Falcon H1: Hybrid Transformer+Mamba, 262K context, 34B matches 70B performance

For many applications, open-source is now the better choice for cost, privacy, and control.

Recommendation:

General use: Llama 4 Maverick, Qwen3-235B, Mistral Large 3, or GPT-OSS-120B
EU users: Qwen3, Mistral Large 3, GPT-OSS-120B, or GLM-4.7 (Llama 4 prohibited)
Reasoning: DeepSeek V3.2-Speciale, Kimi K2 Thinking, GPT-OSS-120B (high effort), or DeepSeek-R1-0528
Code: GLM-4.7 (73.8% SWE-bench), MiniMax M2.1, or Qwen3-Coder-480B
Agentic: Kimi K2 (65.8% SWE-bench, tool use), GPT-OSS-120B, MiniMax M2.1, or GLM-4.7
Vision: GLM-4.6V (native tool calling), Mistral Large 3, or Llama 4 Maverick
Long context: MiniMax-01 (4M tokens), Falcon H1-34B (262K), MiniMax-M1 (1M), or Llama 4 Scout (10M)
Edge/Efficient: Falcon H1 (0.5B-34B), Falcon 3 (1B-10B), GPT-OSS-20B, Llama 3.2 3B, or Qwen3-4B
Apache 2.0 only: Qwen3, Mistral Large 3, GPT-OSS, or Falcon

Model	Context Window
Llama 4 Scout	10M tokens
MiniMax-01	4M tokens
MiniMax-M1	1M tokens
Qwen3-2507	1M tokens
Falcon H1	262K tokens
Mistral Large 3	256K tokens
MiniMax M2.1	204K tokens
GLM-4.6V	128K tokens

Model	Best For
DeepSeek V3.2	Reasoning, agents, open methodology
Kimi K2	Cost efficiency, agentic workflows
GLM-4.7	Coding (highest SWE-bench), "preserved thinking"
MiniMax M2.1	Multilingual coding, interleaved thinking
Qwen3	Multilingual (119 languages), Apache 2.0

Table of Contents

The Open-Source Revolution

Why Open-Source Matters

The Strategic Case

When to Use Open-Source

Cost Comparison Example

The Major Model Families

Llama (Meta)

Model Lineup

Benchmarks

Strengths

Weaknesses

License

Best For

Qwen (Alibaba)

Model Lineup

Benchmarks

Strengths

Weaknesses

License

Best For

DeepSeek

Model Lineup

Benchmarks

Strengths

Weaknesses

License

Best For

Mistral

Model Lineup

Benchmarks

Strengths

Weaknesses

License

Best For

Other Notable Models

Kimi K2 (Moonshot AI)

GLM-4 Series (Zhipu AI / Z.ai)

MiniMax

Google Gemma

Microsoft Phi

HuggingFace SmolLM

Cohere Command R

GPT-OSS (OpenAI)

Grok (xAI)

Falcon 3 & H1 (TII)

Comprehensive Benchmark Comparison

General Capabilities

Coding Benchmarks

Reasoning Benchmarks

Chatbot Arena Rankings (Human Preference)

Hardware Requirements

Memory Calculation

Detailed Memory Requirements

GPU Recommendations by Use Case

Consumer Hardware

Professional/Enterprise

Apple Silicon

Deployment Options

Local Inference Engines

Ollama

llama.cpp

vLLM

TensorRT-LLM

Comparison of Inference Engines

Cloud Deployment

Inference Providers

Self-Hosted Cloud

Quantization Formats

Quantization Quality Comparison

Licensing Deep Dive

Fully Permissive (Apache 2.0 / MIT)

Permissive with Limits

Research / Non-Commercial

License Selection Guide

Practical Recommendations

By Use Case

General Chat / Customer Service

Coding Assistant

Complex Reasoning / Math