How much capability do SLMs lose compared to large models?

Depends on the task. For focused use cases (summarization, classification, simple Q&A), SLMs retain 90%+ capability. For complex reasoning or broad knowledge tasks, the gap widens. Always benchmark on your specific use case.

Can I run an SLM on a smartphone?

Yes, with the right model and quantization. Llama 3.2 1B quantized to 4-bit runs on modern smartphones. Expect 10-30 tokens/second on flagship devices. Apps like MLC Chat and LLM Farm demonstrate this.

Should I distill my own model or use pre-distilled ones?

Start with pre-distilled models (DeepSeek R1 distillations, Phi-4-mini). Only distill your own if you need specific capabilities not in existing models, have substantial training resources, and have evaluated that fine-tuning isn't sufficient.

What's the minimum hardware for SLM development?

For inference: A laptop with 8GB RAM can run 3B models quantized to 4-bit. For fine-tuning: A single GPU with 24GB VRAM (RTX 4090, A10) handles most SLMs with LoRA. For distillation: Multiple high-end GPUs recommended.

How do I choose between quantization levels?

Q4_K_M (4-bit) offers the best size/quality tradeoff for most use cases. Q8_0 (8-bit) when you need higher quality and have memory. Q2/Q3 only for extreme memory constraints with significant quality loss.

Small Language Models: Edge Deployment and Knowledge Distillation | Enrico Piovano

The SLM Revolution

While frontier models grow to hundreds of billions of parameters, a counter-trend has emerged: Small Language Models (SLMs) that run efficiently on edge devices while retaining most capabilities.

From research: "The global SLM market, valued at USD 0.93 billion in 2025, is projected to reach USD 5.45 billion by 2032, with a CAGR of 28.7%."

The business case is compelling: "Enterprises deploying SLMs reported an average 73% cost reduction compared to equivalent LLM implementations while maintaining 90%+ of functionality for targeted use cases."

What Are Small Language Models?

SLMs are typically models with fewer than 10 billion parameters (often under 3 billion) designed for:

On-device deployment
Low latency requirements
Privacy-sensitive applications
Cost-effective inference

From research: "A countertrend has emerged to develop SLMs—models with far fewer parameters (on the order of 10⁸–10⁹) that retain useful generative abilities."

Top SLMs in 2025

Llama 3.2 (1B and 3B)

Meta's edge-optimized models:

From Meta: "The Llama 3.2 1B and 3B models support context length of 128K tokens and are state-of-the-art in their class for on-device use cases like summarization, instruction following, and rewriting tasks running locally at the edge."

Performance: From Meta: "The 3B model outperforms the Gemma 2 2.6B and Phi 3.5-mini models on tasks such as following instructions, summarization, prompt rewriting, and tool-use."

Key features:

128K context length
Multilingual text generation
Tool calling capabilities
Optimized for Qualcomm, MediaTek, and Arm processors

From Meta: "These models empower developers to build personalized, on-device agentic applications with strong privacy where data never leaves the device."

Phi-4 and Phi-4-mini

Microsoft's quality-over-size approach:

From research: "Phi-4-mini-instruct packs 3.8B parameters trained on 5 trillion tokens. The context length goes up to 128K, and it scored 91.1% accuracy on SimpleQA factual questions."

Key innovation: From research: "Phi-4 emphasizes data quality over size. It was trained on synthetic data, filtered public content, and academic resources."

Variants:

Phi-4 (14B): Full-size model with strong reasoning
Phi-4-mini (3.8B): Optimized for edge
Phi-4-mini-reasoning: Focus on reasoning tasks

Qwen 2.5 Series

Alibaba's multilingual powerhouse:

From research: "Qwen2.5 models are pretrained on Alibaba's latest large-scale dataset, encompassing up to 18 trillion tokens. The model supports up to 128K tokens and has multilingual support."

Key features:

29+ language support
Specialized coding variants (Qwen2.5-Coder)
Mathematical reasoning variants (Qwen2.5-Math)
Sizes from 0.5B to 72B

Other Notable SLMs

Model	Parameters	Context	Best For
SmolLM3-3B	3B	8K	General instruction following
Ministral-3B	3.4B	128K	Edge + multimodal
Gemma 2 2B	2B	8K	Efficient general use
DeepSeek-R1-1.5B	1.5B	64K	Reasoning (distilled)

From research: "SmolLM3-3B is a fully open instruct and reasoning model from Hugging Face. At the 3B scale, it outperforms Llama-3.2-3B and Qwen2.5-3B."

Knowledge Distillation

How It Works

Knowledge distillation transfers capabilities from a large "teacher" model to a smaller "student" model.

Why distillation works better than training small models from scratch: A large model learns rich representations during pretraining—it understands nuances, relationships, and patterns that emerge only at scale. When you train a small model directly on the same data, it doesn't have the capacity to learn these nuances. But when you train a small model to mimic a large model's outputs, you're giving it a "shortcut" to capabilities it couldn't discover on its own. The teacher's soft probability distributions contain information about which mistakes are "close" (cat vs. dog) versus "far" (cat vs. car)—information lost in hard labels.

The distillation-RAG tradeoff: For adding knowledge to small models, you have two options: distill the knowledge into weights, or retrieve it at runtime. Distillation bakes knowledge in permanently but is limited by model capacity. RAG keeps knowledge external but adds latency and complexity. For edge deployment, distillation is often preferred because it eliminates the retrieval dependency.

From IBM: "Student-teacher distillation is a model compression technique where a smaller 'student' model learns to mimic the behavior of a larger, more complex 'teacher' model."

From research: "You can often achieve 5-10x compression while retaining 90-95% of the original accuracy."

Distillation Types

1. Response-Based Distillation: Train student to match teacher's outputs.

From research: "This is the most common and easiest-to-implement type that relies on the teacher-model's outputs. The student model is trained to mimic the prediction of the teacher model."

Python

# Simple response-based distillation loss
def distillation_loss(student_logits, teacher_logits, temperature=2.0):
    soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
    soft_student = F.log_softmax(student_logits / temperature, dim=-1)
    return F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (temperature ** 2)

Understanding temperature in distillation: The temperature parameter controls how "soft" the probability distributions are. At temperature=1, the distributions are sharp (confident predictions dominate). At temperature=2+, distributions become softer, revealing more about the teacher's uncertainty and secondary predictions. Higher temperatures transfer more nuanced information but can also transfer noise. The (temperature ** 2) scaling factor compensates for gradient magnitude changes at different temperatures.

Why KL divergence, not cross-entropy: Cross-entropy treats all wrong answers equally. KL divergence measures how different the student's distribution is from the teacher's, preserving the relative probabilities of all outputs. If the teacher thinks "dog" is 60% likely and "wolf" is 30% likely, the student learns both—not just that "dog" wins.

2. Feature-Based Distillation: Match intermediate representations.

From research: "Instead of just copying outputs, the student mimics intermediate representations, activations from specific layers. You may add loss terms to align feature maps using L2 or cosine similarity losses."

3. Attention-Based Distillation: Replicate attention patterns.

From research: "Transformer models use attention heads to distribute focus across input tokens. Attention-based KD encourages the student to replicate these attention maps. DistilBERT used this approach effectively."

Advanced Distillation Techniques

Multi-Teacher Distillation: From research: "Multi-Teacher Knowledge Distillation extends this paradigm by aggregating knowledge from multiple teacher models to improve generalization and robustness."

MINILLM: From research: "Researchers replaced the standard KD method's forward KL divergence objective with a reverse KLD objective to prevent the student model from overestimating low-probability regions of the teacher's distribution."

Generalized Knowledge Distillation (GKD): From research: "GKD trains the student model by incorporating feedback from the teacher model on the student-generated sequences. GKD also allows the flexibility to use alternative loss functions and facilitates integration with reinforcement learning fine-tuning."

DeepSeek R1 Distillation Example

DeepSeek released distilled reasoning models at multiple sizes:

Model	Base	Parameters	AIME 2024
DeepSeek-R1	DeepSeek-V3	671B (37B active)	79.8%
R1-Distill-Qwen-32B	Qwen2.5-32B	32B	72.6%
R1-Distill-Qwen-14B	Qwen2.5-14B	14B	69.7%
R1-Distill-Qwen-7B	Qwen2.5-7B	7B	55.5%
R1-Distill-Qwen-1.5B	Qwen2.5-1.5B	1.5B	28.9%

The distillation process:

Generate reasoning traces from R1
Use traces as training data for smaller models
Smaller models learn similar reasoning patterns

Model Compression Techniques

Beyond Distillation

From research: "The four key techniques are: model quantization, model pruning, knowledge distillation, and Low-Rank Adaptation (LoRA)."

Quantization

Reduce precision of weights and activations:

Format	Bits	Memory	Quality
FP16	16	2 bytes/param	~100%
INT8	8	1 byte/param	~99%
INT4	4	0.5 bytes/param	~95-97%

From research: "Most of these models support GGUF quantization, which cuts down memory usage and speeds up inference significantly. That's what makes running them on consumer hardware actually practical."

Pruning

Remove unnecessary weights:

Magnitude pruning: Remove small weights
Structured pruning: Remove entire neurons/heads
Movement pruning: Remove based on training dynamics

Combined Approaches

Best results often combine techniques:

Distill to smaller architecture
Quantize to INT4/INT8
Prune redundant connections

Edge Deployment

Why Edge?

Deploying models on edge devices—phones, laptops, IoT devices—fundamentally changes what's possible with AI. Instead of every interaction requiring a round-trip to a datacenter, the model runs locally.

The latency transformation: Cloud API calls typically take 200-2000ms depending on load and network conditions. Local inference on optimized hardware can run in 10-50ms. For interactive applications (typing assistants, real-time translation, voice interfaces), this difference is transformative. You can respond between keystrokes, not after the user stops typing.

The privacy equation: When data never leaves the device, privacy concerns evaporate. Medical notes, financial documents, personal messages—users can get AI assistance without trusting a third party. For enterprise deployment, this eliminates entire categories of compliance concerns (HIPAA, GDPR data transfer rules, etc.).

The economics at scale: At millions of queries, cloud API costs add up ( $0.01-0.10 per query × 1M queries =$ 10K-100K/month). A one-time investment in on-device capability eliminates variable costs. The breakeven depends on query volume, but for high-usage applications, edge deployment often wins financially.

From research: "Small Language Models deployed on edge devices overcome cloud dependency by reducing latency, bandwidth, and privacy risks."

Benefits:

Near-instant response times (milliseconds vs seconds)
Works offline / limited connectivity
Reduced data transmission costs
Enhanced privacy (data stays local)

From research: "The projection that 75% of enterprise data will be processed at the edge by 2025 highlights the critical role SLMs will play."

Deployment Options

1. llama.cpp / GGUF

Hardware Requirements

Model Size	RAM Required	Suitable Devices
1B (Q4)	1-2 GB	Smartphones, RPi 5
3B (Q4)	2-4 GB	Laptops, tablets
7B (Q4)	4-6 GB	Gaming PCs, M1+ Macs
13B (Q4)	8-12 GB	Workstations

Use Cases

On-Device Assistants

From Meta: "Build personalized, on-device agentic applications with strong privacy where data never leaves the device."

Examples:

Personal note-taking with summarization
Local code completion
Offline translation
Smart home control

Enterprise Edge

From research: "Today, industries already deploy SLMs in scenarios including real-time healthcare diagnostics, robotics, smart homes, and autonomous navigation systems."

Healthcare: Analyze medical data locally, HIPAA compliant Manufacturing: Real-time quality control Retail: In-store recommendations without cloud latency

Hybrid Architectures

Combine edge SLMs with cloud LLMs:

Python

def hybrid_inference(query: str, complexity_threshold: float = 0.7):
    # Estimate query complexity
    complexity = estimate_complexity(query)

    if complexity < complexity_threshold:
        # Handle locally with SLM
        return local_slm.generate(query)
    else:
        # Route to cloud LLM for complex queries
        return cloud_llm.generate(query)

Performance Optimization

Speculative Decoding with SLMs

Use SLM as draft model for larger target:

Python

# SLM generates candidate tokens quickly
# LLM verifies in parallel
draft_model = load_model("llama-3.2-1b")
target_model = load_model("llama-3.1-70b")

# 2-3x speedup on inference
response = speculative_generate(
    draft=draft_model,
    target=target_model,
    prompt=prompt,
    k=4  # tokens to speculate
)

Batching Strategies

Even on edge, batching helps:

Python

# Continuous batching for multiple requests
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_model_len=4096,
    gpu_memory_utilization=0.9
)

outputs = llm.generate(prompts, sampling_params)

Choosing the Right SLM

Decision Framework

Requirement	Recommended Model
Minimum size, basic tasks	Llama 3.2 1B, SmolLM 1.7B
General assistant	Llama 3.2 3B, Phi-4-mini
Coding tasks	Qwen2.5-Coder-3B
Reasoning tasks	Phi-4-mini-reasoning, R1-Distill
Multilingual	Qwen2.5-3B (29 languages)
Multimodal	Ministral-3B (vision + text)

Benchmarking Your Use Case

Always test on your specific tasks:

Python

def evaluate_slm(model, test_cases):
    results = []
    for case in test_cases:
        start = time.time()
        output = model.generate(case["input"])
        latency = time.time() - start

        results.append({
            "correct": evaluate_output(output, case["expected"]),
            "latency": latency,
            "tokens": count_tokens(output)
        })

    return {
        "accuracy": sum(r["correct"] for r in results) / len(results),
        "avg_latency": sum(r["latency"] for r in results) / len(results),
        "tokens_per_second": sum(r["tokens"] for r in results) / sum(r["latency"] for r in results)
    }

Conclusion

Small Language Models enable AI deployment where it wasn't previously possible:

Edge devices with millisecond latency
Privacy-preserving applications
Cost-effective inference at scale
Hybrid architectures combining local and cloud

The techniques—knowledge distillation, quantization, and optimized runtimes—continue to improve, making SLMs increasingly capable.

Small Language Models: Edge Deployment and Knowledge Distillation

Table of Contents

The SLM Revolution

What Are Small Language Models?

Top SLMs in 2025

Llama 3.2 (1B and 3B)

Phi-4 and Phi-4-mini

Qwen 2.5 Series

Other Notable SLMs

Knowledge Distillation

How It Works

Distillation Types

Advanced Distillation Techniques

DeepSeek R1 Distillation Example

Model Compression Techniques

Beyond Distillation

Quantization

Pruning

Combined Approaches

Edge Deployment

Why Edge?

Deployment Options

Hardware Requirements

Use Cases

On-Device Assistants

Enterprise Edge

Hybrid Architectures

Performance Optimization

Speculative Decoding with SLMs

Batching Strategies

Choosing the Right SLM

Decision Framework

Benchmarking Your Use Case

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Open-Source LLMs: The Complete 2025 Guide

LLM Inference Optimization: From Quantization to Speculative Decoding

LLM Frameworks: LangChain, LlamaIndex, LangGraph, and Beyond