Skip to main content
Back to Blog

Small Language Models: Edge Deployment and Knowledge Distillation

The rise of Small Language Models (SLMs)—from Llama 3.2 to Phi-4 to Qwen 2.5. Understanding knowledge distillation, quantization, and deploying AI at the edge.

9 min read
Share:

The SLM Revolution

While frontier models grow to hundreds of billions of parameters, a counter-trend has emerged: Small Language Models (SLMs) that run efficiently on edge devices while retaining most capabilities.

From research: "The global SLM market, valued at USD 0.93 billion in 2025, is projected to reach USD 5.45 billion by 2032, with a CAGR of 28.7%."

The business case is compelling: "Enterprises deploying SLMs reported an average 73% cost reduction compared to equivalent LLM implementations while maintaining 90%+ of functionality for targeted use cases."

What Are Small Language Models?

SLMs are typically models with fewer than 10 billion parameters (often under 3 billion) designed for:

  • On-device deployment
  • Low latency requirements
  • Privacy-sensitive applications
  • Cost-effective inference

From research: "A countertrend has emerged to develop SLMs—models with far fewer parameters (on the order of 10⁸–10⁹) that retain useful generative abilities."

Top SLMs in 2025

Llama 3.2 (1B and 3B)

Meta's edge-optimized models:

From Meta: "The Llama 3.2 1B and 3B models support context length of 128K tokens and are state-of-the-art in their class for on-device use cases like summarization, instruction following, and rewriting tasks running locally at the edge."

Performance: From Meta: "The 3B model outperforms the Gemma 2 2.6B and Phi 3.5-mini models on tasks such as following instructions, summarization, prompt rewriting, and tool-use."

Key features:

  • 128K context length
  • Multilingual text generation
  • Tool calling capabilities
  • Optimized for Qualcomm, MediaTek, and Arm processors

From Meta: "These models empower developers to build personalized, on-device agentic applications with strong privacy where data never leaves the device."

Phi-4 and Phi-4-mini

Microsoft's quality-over-size approach:

From research: "Phi-4-mini-instruct packs 3.8B parameters trained on 5 trillion tokens. The context length goes up to 128K, and it scored 91.1% accuracy on SimpleQA factual questions."

Key innovation: From research: "Phi-4 emphasizes data quality over size. It was trained on synthetic data, filtered public content, and academic resources."

Variants:

  • Phi-4 (14B): Full-size model with strong reasoning
  • Phi-4-mini (3.8B): Optimized for edge
  • Phi-4-mini-reasoning: Focus on reasoning tasks

Qwen 2.5 Series

Alibaba's multilingual powerhouse:

From research: "Qwen2.5 models are pretrained on Alibaba's latest large-scale dataset, encompassing up to 18 trillion tokens. The model supports up to 128K tokens and has multilingual support."

Key features:

  • 29+ language support
  • Specialized coding variants (Qwen2.5-Coder)
  • Mathematical reasoning variants (Qwen2.5-Math)
  • Sizes from 0.5B to 72B

Other Notable SLMs

ModelParametersContextBest For
SmolLM3-3B3B8KGeneral instruction following
Ministral-3B3.4B128KEdge + multimodal
Gemma 2 2B2B8KEfficient general use
DeepSeek-R1-1.5B1.5B64KReasoning (distilled)

From research: "SmolLM3-3B is a fully open instruct and reasoning model from Hugging Face. At the 3B scale, it outperforms Llama-3.2-3B and Qwen2.5-3B."

Knowledge Distillation

How It Works

Knowledge distillation transfers capabilities from a large "teacher" model to a smaller "student" model.

Why distillation works better than training small models from scratch: A large model learns rich representations during pretraining—it understands nuances, relationships, and patterns that emerge only at scale. When you train a small model directly on the same data, it doesn't have the capacity to learn these nuances. But when you train a small model to mimic a large model's outputs, you're giving it a "shortcut" to capabilities it couldn't discover on its own. The teacher's soft probability distributions contain information about which mistakes are "close" (cat vs. dog) versus "far" (cat vs. car)—information lost in hard labels.

The distillation-RAG tradeoff: For adding knowledge to small models, you have two options: distill the knowledge into weights, or retrieve it at runtime. Distillation bakes knowledge in permanently but is limited by model capacity. RAG keeps knowledge external but adds latency and complexity. For edge deployment, distillation is often preferred because it eliminates the retrieval dependency.

From IBM: "Student-teacher distillation is a model compression technique where a smaller 'student' model learns to mimic the behavior of a larger, more complex 'teacher' model."

From research: "You can often achieve 5-10x compression while retaining 90-95% of the original accuracy."

Distillation Types

1. Response-Based Distillation: Train student to match teacher's outputs.

From research: "This is the most common and easiest-to-implement type that relies on the teacher-model's outputs. The student model is trained to mimic the prediction of the teacher model."

Python
# Simple response-based distillation loss
def distillation_loss(student_logits, teacher_logits, temperature=2.0):
    soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
    soft_student = F.log_softmax(student_logits / temperature, dim=-1)
    return F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (temperature ** 2)

Understanding temperature in distillation: The temperature parameter controls how "soft" the probability distributions are. At temperature=1, the distributions are sharp (confident predictions dominate). At temperature=2+, distributions become softer, revealing more about the teacher's uncertainty and secondary predictions. Higher temperatures transfer more nuanced information but can also transfer noise. The (temperature ** 2) scaling factor compensates for gradient magnitude changes at different temperatures.

Why KL divergence, not cross-entropy: Cross-entropy treats all wrong answers equally. KL divergence measures how different the student's distribution is from the teacher's, preserving the relative probabilities of all outputs. If the teacher thinks "dog" is 60% likely and "wolf" is 30% likely, the student learns both—not just that "dog" wins.

2. Feature-Based Distillation: Match intermediate representations.

From research: "Instead of just copying outputs, the student mimics intermediate representations, activations from specific layers. You may add loss terms to align feature maps using L2 or cosine similarity losses."

3. Attention-Based Distillation: Replicate attention patterns.

From research: "Transformer models use attention heads to distribute focus across input tokens. Attention-based KD encourages the student to replicate these attention maps. DistilBERT used this approach effectively."

Advanced Distillation Techniques

Multi-Teacher Distillation: From research: "Multi-Teacher Knowledge Distillation extends this paradigm by aggregating knowledge from multiple teacher models to improve generalization and robustness."

MINILLM: From research: "Researchers replaced the standard KD method's forward KL divergence objective with a reverse KLD objective to prevent the student model from overestimating low-probability regions of the teacher's distribution."

Generalized Knowledge Distillation (GKD): From research: "GKD trains the student model by incorporating feedback from the teacher model on the student-generated sequences. GKD also allows the flexibility to use alternative loss functions and facilitates integration with reinforcement learning fine-tuning."

DeepSeek R1 Distillation Example

DeepSeek released distilled reasoning models at multiple sizes:

ModelBaseParametersAIME 2024
DeepSeek-R1DeepSeek-V3671B (37B active)79.8%
R1-Distill-Qwen-32BQwen2.5-32B32B72.6%
R1-Distill-Qwen-14BQwen2.5-14B14B69.7%
R1-Distill-Qwen-7BQwen2.5-7B7B55.5%
R1-Distill-Qwen-1.5BQwen2.5-1.5B1.5B28.9%

The distillation process:

  1. Generate reasoning traces from R1
  2. Use traces as training data for smaller models
  3. Smaller models learn similar reasoning patterns

Model Compression Techniques

Beyond Distillation

From research: "The four key techniques are: model quantization, model pruning, knowledge distillation, and Low-Rank Adaptation (LoRA)."

Quantization

Reduce precision of weights and activations:

FormatBitsMemoryQuality
FP16162 bytes/param~100%
INT881 byte/param~99%
INT440.5 bytes/param~95-97%

From research: "Most of these models support GGUF quantization, which cuts down memory usage and speeds up inference significantly. That's what makes running them on consumer hardware actually practical."

Pruning

Remove unnecessary weights:

  • Magnitude pruning: Remove small weights
  • Structured pruning: Remove entire neurons/heads
  • Movement pruning: Remove based on training dynamics

Combined Approaches

Best results often combine techniques:

  1. Distill to smaller architecture
  2. Quantize to INT4/INT8
  3. Prune redundant connections

Edge Deployment

Why Edge?

Deploying models on edge devices—phones, laptops, IoT devices—fundamentally changes what's possible with AI. Instead of every interaction requiring a round-trip to a datacenter, the model runs locally.

The latency transformation: Cloud API calls typically take 200-2000ms depending on load and network conditions. Local inference on optimized hardware can run in 10-50ms. For interactive applications (typing assistants, real-time translation, voice interfaces), this difference is transformative. You can respond between keystrokes, not after the user stops typing.

The privacy equation: When data never leaves the device, privacy concerns evaporate. Medical notes, financial documents, personal messages—users can get AI assistance without trusting a third party. For enterprise deployment, this eliminates entire categories of compliance concerns (HIPAA, GDPR data transfer rules, etc.).

The economics at scale: At millions of queries, cloud API costs add up (0.010.10perquery×1Mqueries=0.01-0.10 per query × 1M queries = 10K-100K/month). A one-time investment in on-device capability eliminates variable costs. The breakeven depends on query volume, but for high-usage applications, edge deployment often wins financially.

From research: "Small Language Models deployed on edge devices overcome cloud dependency by reducing latency, bandwidth, and privacy risks."

Benefits:

  • Near-instant response times (milliseconds vs seconds)
  • Works offline / limited connectivity
  • Reduced data transmission costs
  • Enhanced privacy (data stays local)

From research: "The projection that 75% of enterprise data will be processed at the edge by 2025 highlights the critical role SLMs will play."

Deployment Options

1. llama.cpp / GGUF

Most popular for local deployment:

Bash
# Convert model to GGUF
python convert_hf_to_gguf.py ./model --outfile model.gguf

# Quantize to 4-bit
./llama-quantize model.gguf model-q4_k_m.gguf Q4_K_M

# Run inference
./llama-cli -m model-q4_k_m.gguf -p "Hello, world"

2. Ollama

Simplified local model management:

Bash
# Pull and run a model
ollama pull llama3.2:3b
ollama run llama3.2:3b "Explain quantum computing"

# API access
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.2:3b", "prompt": "Hello"}'

3. MLX (Apple Silicon)

Optimized for M-series chips:

Python
import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)

4. Mobile Deployment

  • iOS: Core ML, MLX
  • Android: TensorFlow Lite, MLC LLM
  • Cross-platform: MLC LLM, llama.cpp

Hardware Requirements

Model SizeRAM RequiredSuitable Devices
1B (Q4)1-2 GBSmartphones, RPi 5
3B (Q4)2-4 GBLaptops, tablets
7B (Q4)4-6 GBGaming PCs, M1+ Macs
13B (Q4)8-12 GBWorkstations

Use Cases

On-Device Assistants

From Meta: "Build personalized, on-device agentic applications with strong privacy where data never leaves the device."

Examples:

  • Personal note-taking with summarization
  • Local code completion
  • Offline translation
  • Smart home control

Enterprise Edge

From research: "Today, industries already deploy SLMs in scenarios including real-time healthcare diagnostics, robotics, smart homes, and autonomous navigation systems."

Healthcare: Analyze medical data locally, HIPAA compliant Manufacturing: Real-time quality control Retail: In-store recommendations without cloud latency

Hybrid Architectures

Combine edge SLMs with cloud LLMs:

Python
def hybrid_inference(query: str, complexity_threshold: float = 0.7):
    # Estimate query complexity
    complexity = estimate_complexity(query)

    if complexity < complexity_threshold:
        # Handle locally with SLM
        return local_slm.generate(query)
    else:
        # Route to cloud LLM for complex queries
        return cloud_llm.generate(query)

Performance Optimization

Speculative Decoding with SLMs

Use SLM as draft model for larger target:

Python
# SLM generates candidate tokens quickly
# LLM verifies in parallel
draft_model = load_model("llama-3.2-1b")
target_model = load_model("llama-3.1-70b")

# 2-3x speedup on inference
response = speculative_generate(
    draft=draft_model,
    target=target_model,
    prompt=prompt,
    k=4  # tokens to speculate
)

Batching Strategies

Even on edge, batching helps:

Python
# Continuous batching for multiple requests
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_model_len=4096,
    gpu_memory_utilization=0.9
)

outputs = llm.generate(prompts, sampling_params)

Choosing the Right SLM

Decision Framework

RequirementRecommended Model
Minimum size, basic tasksLlama 3.2 1B, SmolLM 1.7B
General assistantLlama 3.2 3B, Phi-4-mini
Coding tasksQwen2.5-Coder-3B
Reasoning tasksPhi-4-mini-reasoning, R1-Distill
MultilingualQwen2.5-3B (29 languages)
MultimodalMinistral-3B (vision + text)

Benchmarking Your Use Case

Always test on your specific tasks:

Python
def evaluate_slm(model, test_cases):
    results = []
    for case in test_cases:
        start = time.time()
        output = model.generate(case["input"])
        latency = time.time() - start

        results.append({
            "correct": evaluate_output(output, case["expected"]),
            "latency": latency,
            "tokens": count_tokens(output)
        })

    return {
        "accuracy": sum(r["correct"] for r in results) / len(results),
        "avg_latency": sum(r["latency"] for r in results) / len(results),
        "tokens_per_second": sum(r["tokens"] for r in results) / sum(r["latency"] for r in results)
    }

Conclusion

Small Language Models enable AI deployment where it wasn't previously possible:

  1. Edge devices with millisecond latency
  2. Privacy-preserving applications
  3. Cost-effective inference at scale
  4. Hybrid architectures combining local and cloud

The techniques—knowledge distillation, quantization, and optimized runtimes—continue to improve, making SLMs increasingly capable.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles