Task-Specific SLM Distillation: A Complete Implementation Guide
A hands-on guide to distilling Small Language Models for specific tasks—from selecting teacher/student models to data generation, training with LoRA/QLoRA, and deployment. Includes 2025 best practices from DeepSeek R1 and latest research.
Table of Contents
Why Task-Specific Distillation?
You have a specific task—customer support, code review, medical Q&A, document summarization—and you need a model that's fast, cheap, and runs anywhere. Large models like GPT-4 or Claude can do the task well, but they're expensive ($10-30 per million tokens) and slow.
Task-specific distillation solves this: take a powerful teacher model, generate training data for your specific task, and train a small student model that captures most of the teacher's capability at a fraction of the cost.
The economics are compelling:
┌─────────────────────────────────────────────────────────────────────────┐
│ DISTILLATION ECONOMICS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRAINING FROM SCRATCH: │
│ ────────────────────── │
│ Cost: $100,000+ │
│ Time: Months │
│ Data: Trillions of tokens │
│ Result: General-purpose model │
│ │
│ TASK-SPECIFIC DISTILLATION: │
│ ─────────────────────────── │
│ Cost: $1,000 - $10,000 │
│ Time: Days to 2 weeks │
│ Data: 50K - 500K examples │
│ Result: Task-specialized model with 85-95% of teacher capability │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PER-QUERY SAVINGS: │
│ │
│ Teacher API (GPT-4 class): $0.03 per query │
│ Distilled 7B (self-hosted): $0.001 per query │
│ Distilled 7B (quantized): $0.0005 per query │
│ │
│ At 1M queries/month: $30,000 → $500 = 60× cost reduction │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Recent results prove this works. DeepSeek distilled their 671B R1 reasoning model into smaller variants:
- R1-Distill-Qwen-32B: Outperforms OpenAI o1-mini on benchmarks
- R1-Distill-Qwen-7B: 83.9% on MATH-500 (vs 97.3% for full R1)
- R1-Distill-Qwen-1.5B: Runs on phones, still achieves remarkable reasoning
Even a 1.5B model can capture sophisticated reasoning through proper distillation.
The Distillation Pipeline
Before diving into details, here's the complete workflow:
┌─────────────────────────────────────────────────────────────────────────┐
│ DISTILLATION PIPELINE OVERVIEW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ STEP 1 │ │ STEP 2 │ │ STEP 3 │ │
│ │ Define │───→│ Select │───→│ Generate │ │
│ │ Task │ │ Models │ │ Data │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ STEP 6 │ │ STEP 5 │ │ STEP 4 │ │
│ │ Deploy │←───│ Evaluate │←───│ Train │ │
│ │ │ │ │ │ Student │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TIMELINE: │
│ Step 1: 1 day (define requirements) │
│ Step 2: 1 day (select models) │
│ Step 3: 2-5 days (generate and filter data) │
│ Step 4: 1-3 days (training) │
│ Step 5: 1-2 days (evaluation and iteration) │
│ Step 6: 1 day (deployment) │
│ │
│ TOTAL: 1-2 weeks │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Step 1: Define Your Task Precisely
The more precisely you define your task, the better your distilled model will perform. Vague tasks lead to vague models.
Task Definition Framework
┌─────────────────────────────────────────────────────────────────────────┐
│ TASK DEFINITION CHECKLIST │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. INPUT SPECIFICATION │
│ ──────────────────── │
│ □ What does the input look like? │
│ □ What's the typical input length? │
│ □ What languages/formats are involved? │
│ □ Are there multiple input types (text, code, structured)? │
│ │
│ Example: "Customer support tickets, 50-500 words, English, │
│ may include order IDs and product names" │
│ │
│ 2. OUTPUT SPECIFICATION │
│ ───────────────────── │
│ □ What should the output look like? │
│ □ What's the expected output length? │
│ □ Does it need specific formatting? │
│ □ Should it include reasoning/explanation? │
│ │
│ Example: "Helpful response, 100-300 words, empathetic tone, │
│ includes specific next steps when applicable" │
│ │
│ 3. SUCCESS CRITERIA │
│ ───────────────── │
│ □ How will you measure success? │
│ □ What accuracy/quality level is acceptable? │
│ □ What's the acceptable failure rate? │
│ │
│ Example: "90% of responses rated helpful by humans, │
│ <5% require human escalation" │
│ │
│ 4. DEPLOYMENT CONSTRAINTS │
│ ──────────────────────── │
│ □ What hardware will this run on? │
│ □ What's the latency requirement? │
│ □ What's the throughput requirement? │
│ □ Privacy requirements (on-device, cloud, hybrid)? │
│ │
│ Example: "Must run on A10 GPU, <500ms latency, │
│ 100 requests/second, data stays in VPC" │
│ │
│ 5. EDGE CASES │
│ ────────── │
│ □ What happens with unusual inputs? │
│ □ What should the model refuse to do? │
│ □ How should it handle uncertainty? │
│ │
│ Example: "Escalate abusive messages to human, │
│ say 'I'm not sure' rather than guess for technical Qs" │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Task Complexity Assessment
Different tasks require different amounts of distillation effort:
┌─────────────────────────────────────────────────────────────────────────┐
│ TASK COMPLEXITY LEVELS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ LEVEL 1: SIMPLE (Easy to distill) │
│ ────────────────────────────────── │
│ • Classification (sentiment, intent, topic) │
│ • Extraction (named entities, key phrases) │
│ • Simple Q&A with clear answers │
│ • Rewriting/paraphrasing │
│ │
│ Data needed: 5K-20K examples │
│ Expected transfer: 95%+ of teacher │
│ Student size: 1B-3B sufficient │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LEVEL 2: MODERATE (Standard distillation) │
│ ────────────────────────────────────────── │
│ • Customer support responses │
│ • Document summarization │
│ • Content generation (emails, descriptions) │
│ • Simple code generation │
│ │
│ Data needed: 20K-100K examples │
│ Expected transfer: 85-95% of teacher │
│ Student size: 3B-7B recommended │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LEVEL 3: COMPLEX (Requires careful distillation) │
│ ───────────────────────────────────────────────── │
│ • Multi-step reasoning │
│ • Complex code generation │
│ • Mathematical problem solving │
│ • Multi-turn dialogue with context │
│ │
│ Data needed: 100K-500K examples │
│ Expected transfer: 75-90% of teacher │
│ Student size: 7B-32B recommended │
│ Technique: Chain-of-thought distillation essential │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LEVEL 4: VERY COMPLEX (Challenging distillation) │
│ ───────────────────────────────────────────────── │
│ • Advanced mathematical proofs │
│ • Complex multi-file code understanding │
│ • Open-ended creative tasks │
│ • Tasks requiring broad world knowledge │
│ │
│ Data needed: 500K+ examples │
│ Expected transfer: 60-85% of teacher │
│ Student size: 14B-70B recommended │
│ May need hybrid approach (distillation + RAG) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Step 2: Select Your Models
Choosing a Teacher Model
The teacher model generates your training data. Choose wisely—its capabilities and limitations will transfer to your student.
┌─────────────────────────────────────────────────────────────────────────┐
│ TEACHER MODEL SELECTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ⚠️ CRITICAL: LEGAL CONSIDERATIONS │
│ ──────────────────────────────────── │
│ │
│ PROHIBITED for commercial distillation: │
│ ✗ GPT-4, GPT-4o (OpenAI ToS prohibits training competing models) │
│ ✗ Claude (Anthropic ToS similar restrictions) │
│ ✗ Gemini (Google ToS restrictions) │
│ │
│ ALLOWED for commercial distillation: │
│ ✓ Llama 3/3.1/3.2 (Meta's license allows this) │
│ ✓ Qwen 2.5 series (Apache 2.0 license) │
│ ✓ DeepSeek models (MIT license) │
│ ✓ Mistral models (Apache 2.0) │
│ ✓ Most HuggingFace open-source models │
│ │
│ Always check the specific license before starting! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ RECOMMENDED TEACHERS BY TASK (2025): │
│ │
│ GENERAL PURPOSE: │
│ ┌────────────────────┬───────────┬─────────────────────────────────┐ │
│ │ Model │ Size │ Best For │ │
│ ├────────────────────┼───────────┼─────────────────────────────────┤ │
│ │ Llama 3.1 405B │ 405B │ Best overall open-source │ │
│ │ Llama 3.1 70B │ 70B │ Great balance, easier to run │ │
│ │ Qwen 2.5 72B │ 72B │ Excellent multilingual │ │
│ │ Mistral Large │ 123B │ Strong reasoning │ │
│ └────────────────────┴───────────┴─────────────────────────────────┘ │
│ │
│ REASONING TASKS: │
│ ┌────────────────────┬───────────┬─────────────────────────────────┐ │
│ │ Model │ Size │ Best For │ │
│ ├────────────────────┼───────────┼─────────────────────────────────┤ │
│ │ DeepSeek R1 │ 671B │ State-of-the-art reasoning │ │
│ │ QwQ-32B-Preview │ 32B │ Strong CoT, smaller │ │
│ │ Llama 3.1 70B │ 70B │ Good reasoning baseline │ │
│ └────────────────────┴───────────┴─────────────────────────────────┘ │
│ │
│ CODE TASKS: │
│ ┌────────────────────┬───────────┬─────────────────────────────────┐ │
│ │ Model │ Size │ Best For │ │
│ ├────────────────────┼───────────┼─────────────────────────────────┤ │
│ │ DeepSeek Coder V2 │ 236B │ Best open code model │ │
│ │ Qwen 2.5 Coder 32B │ 32B │ Strong code + instruction │ │
│ │ CodeLlama 70B │ 70B │ Mature, well-tested │ │
│ └────────────────────┴───────────┴─────────────────────────────────┘ │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 💡 TIP: You don't always need the largest teacher! │
│ │
│ Research shows that 70B teachers often produce students that are │
│ nearly as good as those from 400B+ teachers, at much lower cost. │
│ Start with a 70B teacher; only upgrade if quality isn't sufficient. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Choosing a Student Model
The student model is what you'll deploy. Match it to your hardware constraints.
┌─────────────────────────────────────────────────────────────────────────┐
│ STUDENT MODEL SELECTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ DEPLOYMENT HARDWARE → RECOMMENDED STUDENT SIZE │
│ │
│ ┌──────────────────────────┬────────────────┬───────────────────────┐ │
│ │ Hardware │ Student Size │ Quantization │ │
│ ├──────────────────────────┼────────────────┼───────────────────────┤ │
│ │ Mobile/Edge (4-8GB) │ 0.5B - 1.5B │ INT4 required │ │
│ │ Consumer GPU (12-24GB) │ 1.5B - 7B │ INT4/INT8 │ │
│ │ Workstation (48GB) │ 7B - 14B │ FP16 or INT8 │ │
│ │ Server A100 (80GB) │ 14B - 32B │ FP16 │ │
│ │ Multi-GPU │ 32B - 70B │ FP16 │ │
│ └──────────────────────────┴────────────────┴───────────────────────┘ │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ RECOMMENDED BASE MODELS FOR STUDENTS (2025): │
│ │
│ ULTRA-SMALL (Edge/Mobile): │
│ • Qwen 2.5 0.5B/1.5B - Best capability at tiny size │
│ • SmolLM 1.7B - Designed for efficiency │
│ • Phi-3.5 Mini 3.8B - Punches way above its weight │
│ │
│ SMALL (Consumer GPU): │
│ • Qwen 2.5 7B - Excellent base for distillation │
│ • Llama 3.2 3B - Great on-device capabilities │
│ • Mistral 7B v0.3 - Strong instruction following │
│ │
│ MEDIUM (Server): │
│ • Llama 3.1 8B - Solid all-around performer │
│ • Qwen 2.5 14B - Best for complex tasks │
│ • Qwen 2.5 32B - When you need more capacity │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SELECTING STUDENT SIZE: │
│ │
│ Rule of thumb: Teacher should be 5-20× larger than student │
│ │
│ • 405B teacher → 7B-32B student │
│ • 70B teacher → 3B-14B student │
│ • 32B teacher → 1.5B-7B student │
│ │
│ Larger students = better capability transfer, higher serving cost │
│ Smaller students = lower cost, may miss some nuances │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Step 3: Generate Training Data
This is the most critical step. Your student can only be as good as the data it learns from.
Distillation Approaches
There are several ways to transfer knowledge from teacher to student:
┌─────────────────────────────────────────────────────────────────────────┐
│ DISTILLATION APPROACHES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ APPROACH 1: RESPONSE-BASED (Black-box) │
│ ─────────────────────────────────────── │
│ Student learns from teacher's final outputs only. │
│ │
│ How it works: │
│ 1. Collect/generate prompts for your task │
│ 2. Get teacher's responses │
│ 3. Train student on (prompt, response) pairs │
│ │
│ Pros: │
│ ✓ Works with any teacher (API or local) │
│ ✓ Simple to implement │
│ ✓ Sufficient for many tasks │
│ │
│ Cons: │
│ ✗ Doesn't capture how teacher reasons │
│ ✗ Less effective for complex reasoning │
│ │
│ Best for: Classification, simple Q&A, summarization, rewriting │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ APPROACH 2: CHAIN-OF-THOUGHT DISTILLATION │
│ ────────────────────────────────────────── │
│ Student learns teacher's reasoning process, not just answers. │
│ │
│ How it works: │
│ 1. Prompt teacher with CoT: "Think step by step..." │
│ 2. Capture full reasoning trace + final answer │
│ 3. Train student to generate reasoning + answer │
│ │
│ Pros: │
│ ✓ Captures HOW to think, not just WHAT to say │
│ ✓ Much better for reasoning tasks │
│ ✓ Student learns to self-check │
│ │
│ Cons: │
│ ✗ Longer outputs = more training data/compute │
│ ✗ Need to verify reasoning traces are correct │
│ │
│ Best for: Math, coding, logic, complex Q&A, decision-making │
│ │
│ Research: 770M T5 with CoT distillation outperformed 540B PaLM! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ APPROACH 3: LOGIT-BASED (White-box) │
│ ──────────────────────────────────── │
│ Student learns from teacher's probability distributions. │
│ │
│ How it works: │
│ 1. Run teacher forward pass, get logits for each token │
│ 2. Apply temperature softening (T=2-4) │
│ 3. Train student to match softened distribution (KL divergence) │
│ │
│ Pros: │
│ ✓ Maximum knowledge transfer │
│ ✓ Captures "dark knowledge" (which wrong answers are close) │
│ ✓ Can improve upon pure response distillation │
│ │
│ Cons: │
│ ✗ Need full access to teacher weights │
│ ✗ More compute-intensive │
│ ✗ More complex implementation │
│ │
│ Best for: When you have local teacher and need maximum transfer │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ APPROACH 4: MULTI-TASK DISTILLATION │
│ ──────────────────────────────────── │
│ Train student on multiple related outputs simultaneously. │
│ │
│ How it works: │
│ For each input, train student to predict: │
│ • Label/classification (if applicable) │
│ • Rationale/explanation │
│ • Final response │
│ │
│ Loss = L_label + α × L_rationale + β × L_response │
│ │
│ Best for: Tasks requiring explainability │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Data Generation Methods
Now let's get into how to actually generate training data:
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA GENERATION METHODS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ METHOD 1: SELF-INSTRUCT (Alpaca Style) │
│ ─────────────────────────────────────── │
│ Have teacher generate both instructions AND responses. │
│ │
│ Process: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 1. Create 100-200 high-quality seed examples for your task │ │
│ │ [These should be perfect examples of your task] │ │
│ │ ↓ │ │
│ │ 2. Prompt teacher to generate NEW instructions │ │
│ │ "Given these examples, generate 10 similar instructions..." │ │
│ │ ↓ │ │
│ │ 3. Prompt teacher to generate responses for new instructions │ │
│ │ "Answer this instruction: [instruction]" │ │
│ │ ↓ │ │
│ │ 4. Filter for quality (remove low-quality, duplicates) │ │
│ │ ↓ │ │
│ │ 5. Add good examples to seed pool, repeat from step 2 │ │
│ │ ↓ │ │
│ │ 6. Continue until you have 50K-100K examples │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Stanford Alpaca: 52K examples for <$500 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ METHOD 2: EVOL-INSTRUCT (WizardLM Style) │
│ ───────────────────────────────────────── │
│ Evolutionarily increase instruction complexity. │
│ │
│ Evolution operations: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ IN-DEPTH EVOLUTION (make harder): │ │
│ │ • Add constraints: "...without using loops" │ │
│ │ • Deepen: "explain in more detail with examples" │ │
│ │ • Increase reasoning: "now consider edge cases" │ │
│ │ • Complicate input: add more context/variables │ │
│ │ • Add requirements: "must handle errors gracefully" │ │
│ │ │ │
│ │ IN-BREADTH EVOLUTION (make diverse): │ │
│ │ • Concretize: "give a specific real-world example" │ │
│ │ • Mutate: change domain while keeping structure │ │
│ │ • Generate new: create related but different problems │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Process: │
│ 1. Start with simple task-specific instructions │
│ 2. Apply random evolution operation via teacher │
│ 3. Generate response for evolved instruction │
│ 4. Repeat 3-5 evolution generations │
│ 5. Result: gradient from easy → hard examples │
│ │
│ Why it works: Creates curriculum, teaches model to handle complexity │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ METHOD 3: CHAIN-OF-THOUGHT EXTRACTION │
│ ────────────────────────────────────── │
│ Extract reasoning traces for multi-step problems. │
│ │
│ Prompt template: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Solve this problem step by step. Show your reasoning at each │ │
│ │ step before giving the final answer. │ │
│ │ │ │
│ │ Problem: [YOUR PROBLEM] │ │
│ │ │ │
│ │ Let me think through this carefully: │ │
│ │ Step 1: ... │ │
│ │ Step 2: ... │ │
│ │ ... │ │
│ │ Therefore, the answer is: ... │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ For math/code: VERIFY the answer before including in training │
│ │
│ DeepSeek R1: Used 800K reasoning traces for their distilled models │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ METHOD 4: TASK-SPECIFIC GENERATION │
│ ─────────────────────────────────── │
│ Generate data that exactly matches your production use case. │
│ │
│ Process: │
│ 1. Collect real inputs from your system (anonymized) │
│ 2. Generate teacher responses for these real inputs │
│ 3. Have teacher generate similar synthetic inputs │
│ 4. Balance real and synthetic data │
│ │
│ Example (Customer Support): │
│ • Collect 1,000 real support tickets (anonymized) │
│ • Get teacher to respond to each │
│ • Ask teacher to generate 10 similar tickets per real ticket │
│ • Get teacher responses for synthetic tickets │
│ • Result: 11,000 task-specific examples │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Generating Data: Practical Implementation
The data generation process is where most of the work happens. You'll spend days generating and refining data, but the code itself is straightforward. The key insight is that you're essentially having a conversation with the teacher model at scale—asking it to solve problems, answer questions, or complete tasks in your domain.
The generation loop is simple: load your prompts, call the teacher, save the responses. What makes it effective is the quality of your prompts and the filtering you apply afterward. Most teams iterate several times—generate a batch, evaluate quality, adjust prompts, generate more.
For API-based teachers, you'll want to handle rate limits and failures gracefully. For local teachers, batch processing improves throughput significantly. Either way, save intermediate results frequently—you don't want to lose hours of generation to a crash.
# generate_distillation_data.py
from openai import OpenAI # Works with any OpenAI-compatible API (vLLM, TGI, etc.)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none") # Local teacher
def generate_response(prompt, system_prompt="You are a helpful assistant."):
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-70B-Instruct",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
],
temperature=0.7,
max_tokens=2048,
)
return response.choices[0].message.content
# Generate with Chain-of-Thought for reasoning tasks
def generate_cot_response(problem):
return generate_response(
f"Solve step by step, showing your reasoning:\n\n{problem}",
system_prompt="You are an expert problem solver. Think carefully and show your work."
)
The code above is intentionally minimal. The real complexity is in crafting good prompts and building the loop around it—iterating through your seed examples, handling errors, tracking progress, and saving results in the right format. Most teams wrap this in a simple script that processes a JSONL file of prompts and outputs a JSONL file of (prompt, response) pairs.
For Chain-of-Thought distillation, the key is the system prompt and instruction. You're explicitly asking the teacher to externalize its reasoning. This reasoning trace is what makes the student model actually learn to think, rather than just pattern-match on answers. The temperature of 0.7 provides some diversity while keeping responses coherent—you'll want multiple solutions to the same problem to give the student variety.
Data Quality Pipeline
Raw generated data isn't ready for training. You need to filter aggressively:
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA QUALITY PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ RAW GENERATED DATA (e.g., 200K examples) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ FILTER 1: BASIC QUALITY │ │
│ │ ────────────────────────── │ │
│ │ Remove: │ │
│ │ • Very short responses (<50 tokens) │ │
│ │ • Very long responses (>2K tokens, unless expected) │ │
│ │ • Responses with encoding errors │ │
│ │ • Empty or "I can't help" responses │ │
│ │ • Responses in wrong language │ │
│ │ │ │
│ │ Expected loss: ~10-20% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ FILTER 2: DEDUPLICATION │ │
│ │ ─────────────────────── │ │
│ │ Remove: │ │
│ │ • Exact duplicates │ │
│ │ • Near-duplicates (>0.9 cosine similarity) │ │
│ │ • Semantically identical instructions with same response │ │
│ │ │ │
│ │ Expected loss: ~20-40% (Self-Instruct generates many dupes) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ FILTER 3: QUALITY SCORING │ │
│ │ ───────────────────────── │ │
│ │ Use LLM-as-judge to rate quality (1-5): │ │
│ │ │ │
│ │ Prompt: "Rate this instruction-response pair on: │ │
│ │ - Accuracy (is the response correct?) │ │
│ │ - Helpfulness (does it address the instruction?) │ │
│ │ - Coherence (is it well-written?) │ │
│ │ Overall quality: 1-5" │ │
│ │ │ │
│ │ Keep only 4+ rated examples │ │
│ │ │ │
│ │ Expected loss: ~20-30% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ FILTER 4: VERIFICATION (for verifiable tasks) │ │
│ │ ───────────────────────────────────────────── │ │
│ │ For code: Execute and check against test cases │ │
│ │ For math: Verify final answer is correct │ │
│ │ For facts: Cross-check against reliable sources │ │
│ │ │ │
│ │ This is CRITICAL - don't train on incorrect examples! │ │
│ │ │ │
│ │ Expected loss: ~10-30% (depends on teacher quality) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ FILTER 5: DIVERSITY BALANCING │ │
│ │ ───────────────────────────── │ │
│ │ Ensure coverage: │ │
│ │ • Balance across sub-tasks/categories │ │
│ │ • Balance difficulty levels (easy/medium/hard) │ │
│ │ • Balance response lengths │ │
│ │ • Ensure edge cases are represented │ │
│ │ │ │
│ │ May need to generate more data for underrepresented areas │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ CLEAN TRAINING DATA (e.g., 50K-80K examples) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EXPECTED YIELD: 25-50% of raw data survives filtering │
│ If you need 50K clean examples, generate 100K-200K raw examples │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Filtering Data: Practical Implementation
Filtering is where you separate signal from noise. The harsh reality is that teacher models make mistakes, generate repetitive content, and sometimes produce responses that are technically correct but unhelpful for learning. Your filtering pipeline is the quality gate that determines your student's ceiling.
The most impactful filter is often the simplest: length. Responses that are too short usually lack substance; responses that are too long often contain padding or repetition. The sweet spot depends on your task, but having clear bounds catches obvious problems cheaply.
Deduplication is critical because Self-Instruct and similar methods naturally produce near-duplicates. The teacher might rephrase the same instruction slightly, or generate nearly identical responses to similar prompts. These duplicates waste training compute and can cause the student to overfit to particular phrasings.
LLM-as-judge quality scoring is the most powerful filter but also the most expensive. You're essentially asking another capable model (can be the same teacher, or a different one) to evaluate whether each example is high-quality. This catches subtle issues that rule-based filters miss: factual errors, unhelpful responses, poor reasoning, or responses that technically answer the question but wouldn't help a student learn.
# filter_data.py
import json
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# 1. Basic filters
def basic_filter(example):
response = example["messages"][-1]["content"]
if len(response) < 100 or len(response) > 8000: # Length bounds
return False
if response.count("\n\n\n") > 2: # Excessive whitespace
return False
return True
# 2. Deduplication
model = SentenceTransformer("all-MiniLM-L6-v2")
def deduplicate(examples, threshold=0.9):
embeddings = model.encode([e["messages"][-1]["content"] for e in examples])
keep = []
for i, emb in enumerate(embeddings):
if not keep or max(cosine_similarity([emb], [embeddings[j] for j in keep])[0]) < threshold:
keep.append(i)
return [examples[i] for i in keep]
# 3. LLM-as-judge (use sparingly - expensive)
def score_quality(example, client):
prompt = f"""Rate this instruction-response pair (1-5):
Instruction: {example["messages"][0]["content"]}
Response: {example["messages"][-1]["content"]}
Score (just the number):"""
score = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}])
return int(score.choices[0].message.content.strip())
The embedding-based deduplication deserves special attention. By encoding responses into dense vectors, you can efficiently find semantically similar content even when the exact words differ. The threshold of 0.9 is conservative—responses need to be very similar to be considered duplicates. Lower thresholds (0.8) catch more duplicates but risk removing legitimately different responses to similar questions.
For LLM-as-judge scoring, consider using a cheaper model like GPT-4o-mini for the initial pass, reserving expensive models for edge cases or final verification. The marginal quality improvement from using GPT-4 over GPT-4o-mini for scoring rarely justifies the 10x cost increase.
The order of filters matters for cost efficiency: apply cheap filters (length, basic rules) first to reduce the volume before expensive filters (deduplication, LLM scoring). A pipeline that filters 200K examples might use basic filters to get to 160K, deduplication to get to 90K, and then only run LLM scoring on those 90K rather than all 200K.
Data Format
Format your data to match the student model's expected input:
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA FORMATTING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CHAT FORMAT (Most common): │
│ ────────────────────────── │
│ { │
│ "messages": [ │
│ {"role": "system", "content": "You are a helpful assistant..."}, │
│ {"role": "user", "content": "How do I reset my password?"}, │
│ {"role": "assistant", "content": "To reset your password..."} │
│ ] │
│ } │
│ │
│ For multi-turn: │
│ { │
│ "messages": [ │
│ {"role": "system", "content": "..."}, │
│ {"role": "user", "content": "Turn 1 question"}, │
│ {"role": "assistant", "content": "Turn 1 answer"}, │
│ {"role": "user", "content": "Turn 2 follow-up"}, │
│ {"role": "assistant", "content": "Turn 2 answer"} │
│ ] │
│ } │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ INSTRUCTION FORMAT (Alpaca style): │
│ ─────────────────────────────────── │
│ { │
│ "instruction": "Summarize the following text", │
│ "input": "Long text to summarize...", │
│ "output": "This text discusses..." │
│ } │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ CHAIN-OF-THOUGHT FORMAT: │
│ ───────────────────────── │
│ { │
│ "messages": [ │
│ {"role": "user", "content": "What is 23 × 47?"}, │
│ {"role": "assistant", "content": │
│ "Let me solve this step by step:\n\n" + │
│ "Step 1: Break down 23 × 47\n" + │
│ "23 × 47 = 23 × (50 - 3)\n" + │
│ " = 23 × 50 - 23 × 3\n\n" + │
│ "Step 2: Calculate each part\n" + │
│ "23 × 50 = 1150\n" + │
│ "23 × 3 = 69\n\n" + │
│ "Step 3: Subtract\n" + │
│ "1150 - 69 = 1081\n\n" + │
│ "Therefore, 23 × 47 = 1081" │
│ } │
│ ] │
│ } │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ IMPORTANT: Match the format to your student model's expected format! │
│ • Llama 3: Use Llama 3 chat template │
│ • Qwen: Use Qwen chat template (ChatML-style) │
│ • Mistral: Use Mistral chat template │
│ │
│ Wrong format = poor performance (model gets confused) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Data Quantity Guidelines
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA QUANTITY GUIDELINES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TASK TYPE MINIMUM RECOMMENDED IDEAL │
│ ───────────────────────────────────────────────────────────────────── │
│ Classification 1,000 5,000 20,000 │
│ Simple Q&A 5,000 20,000 50,000 │
│ Customer support 10,000 50,000 100,000 │
│ Summarization 10,000 50,000 100,000 │
│ Code generation 20,000 100,000 500,000 │
│ Reasoning/Math 50,000 200,000 800,000 │
│ Complex dialogue 20,000 100,000 300,000 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ REFERENCE POINTS: │
│ │
│ Stanford Alpaca: 52,000 examples → strong instruction following │
│ WizardLM: 250,000 examples → complex instruction following │
│ DeepSeek R1: 800,000 examples → state-of-the-art reasoning │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ QUALITY VS QUANTITY: │
│ │
│ 10,000 high-quality examples > 100,000 low-quality examples │
│ │
│ Microsoft Phi models showed that carefully curated data can make │
│ small models surprisingly capable. Focus on quality first! │
│ │
│ PRACTICAL APPROACH: │
│ 1. Start with 10K-20K high-quality examples │
│ 2. Train and evaluate │
│ 3. Identify failure modes │
│ 4. Generate targeted data for failures │
│ 5. Retrain and repeat │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Step 4: Train the Student Model
Training Method Selection
┌─────────────────────────────────────────────────────────────────────────┐
│ TRAINING METHOD SELECTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ FULL FINE-TUNING: │
│ ───────────────── │
│ Update all model parameters. │
│ │
│ Memory (7B model): ~100GB │
│ Hardware: Multi-GPU (8× A100) │
│ Quality: 100% (baseline) │
│ │
│ When to use: Maximum quality needed, have compute budget │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LoRA (Low-Rank Adaptation): │
│ ─────────────────────────── │
│ Add small trainable adapters, freeze base weights. │
│ │
│ Memory (7B model): ~16GB │
│ Hardware: Single A100 or 2× A10 │
│ Quality: 95-99% of full fine-tuning │
│ │
│ When to use: Default choice for most distillation │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ QLoRA (Quantized LoRA): │
│ ─────────────────────── │
│ LoRA + 4-bit quantized base model. │
│ │
│ Memory (7B model): ~6GB │
│ Hardware: RTX 4090, RTX 3090, or A10 │
│ Quality: 93-98% of full fine-tuning │
│ │
│ When to use: Consumer GPU, limited memory │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ RECOMMENDATION FOR DISTILLATION: │
│ │
│ Start with LoRA or QLoRA. Only use full fine-tuning if you have │
│ compute budget AND LoRA isn't hitting your quality targets. │
│ │
│ Benefits of LoRA for distillation: │
│ • Naturally prevents catastrophic forgetting │
│ • Can keep base capabilities while adding task-specific skills │
│ • Much faster iteration (cheaper to experiment) │
│ • Smaller adapter files for deployment │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Training Configuration
Here's a recommended configuration for task-specific distillation:
┌─────────────────────────────────────────────────────────────────────────┐
│ TRAINING CONFIGURATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ LORA HYPERPARAMETERS: │
│ ───────────────────── │
│ │
│ r (rank): 16-64 │
│ ├── Start with 32 for most tasks │
│ ├── Increase to 64 for complex reasoning │
│ └── 16 may be sufficient for simple tasks │
│ │
│ lora_alpha: 2 × r (e.g., 64 if r=32) │
│ ├── This gives scaling factor of 2.0 │
│ └── Higher alpha = stronger adaptation │
│ │
│ lora_dropout: 0.05-0.1 │
│ ├── 0.1 for smaller datasets (<20K) │
│ ├── 0.05 for larger datasets (>50K) │
│ └── Helps prevent overfitting │
│ │
│ target_modules: All linear layers │
│ ["q_proj", "k_proj", "v_proj", "o_proj", │
│ "gate_proj", "up_proj", "down_proj"] │
│ └── More modules = more capacity, slightly more memory │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TRAINING HYPERPARAMETERS: │
│ ───────────────────────── │
│ │
│ learning_rate: 2e-4 (for LoRA) │
│ ├── Higher than full fine-tuning (which uses 1e-5) │
│ └── Adjust if training is unstable │
│ │
│ per_device_batch_size: 4-8 │
│ ├── As large as fits in memory │
│ └── Use gradient accumulation for larger effective batch │
│ │
│ gradient_accumulation_steps: 4-16 │
│ └── Effective batch = per_device × accumulation × num_gpus │
│ │
│ num_train_epochs: 2-5 │
│ ├── More epochs for smaller datasets │
│ ├── Fewer epochs for larger datasets │
│ └── Monitor for overfitting │
│ │
│ warmup_ratio: 0.03-0.1 │
│ └── ~3% of total steps is usually good │
│ │
│ lr_scheduler: "cosine" or "linear" │
│ │
│ max_seq_length: 2048-4096 │
│ └── Match your expected use case │
│ │
│ weight_decay: 0.01 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ QLORA SPECIFICS: │
│ ──────────────── │
│ │
│ load_in_4bit: True │
│ bnb_4bit_quant_type: "nf4" (NormalFloat4) │
│ bnb_4bit_compute_dtype: torch.bfloat16 │
│ bnb_4bit_use_double_quant: True │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Example Training Script
# task_specific_distillation.py
# Complete training script for task-specific SLM distillation
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
# ─────────────────────────────────────────────────────────────────────────
# 1. CONFIGURATION
# ─────────────────────────────────────────────────────────────────────────
MODEL_NAME = "Qwen/Qwen2.5-7B" # Student model
DATASET_PATH = "your_distilled_data.jsonl" # Your generated data
OUTPUT_DIR = "./distilled_customer_support_7b"
# QLoRA configuration (for consumer GPU)
USE_QLORA = True
# LoRA configuration
LORA_CONFIG = {
"r": 32,
"lora_alpha": 64,
"lora_dropout": 0.05,
"target_modules": [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
"bias": "none",
"task_type": "CAUSAL_LM",
}
# Training configuration
TRAINING_CONFIG = {
"num_train_epochs": 3,
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 8, # Effective batch = 32
"learning_rate": 2e-4,
"warmup_ratio": 0.03,
"logging_steps": 10,
"save_strategy": "epoch",
"eval_strategy": "epoch",
"bf16": True,
"max_seq_length": 2048,
}
# ─────────────────────────────────────────────────────────────────────────
# 2. LOAD MODEL
# ─────────────────────────────────────────────────────────────────────────
if USE_QLORA:
# 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
model = prepare_model_for_kbit_training(model)
else:
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# ─────────────────────────────────────────────────────────────────────────
# 3. APPLY LORA
# ─────────────────────────────────────────────────────────────────────────
lora_config = LoraConfig(**LORA_CONFIG)
model = get_peft_model(model, lora_config)
# Print trainable parameters
model.print_trainable_parameters()
# Expected output: trainable params: ~50M (~0.7% of 7B)
# ─────────────────────────────────────────────────────────────────────────
# 4. LOAD AND PREPARE DATA
# ─────────────────────────────────────────────────────────────────────────
# Load your distilled dataset
# Expected format: {"messages": [{"role": "user", ...}, {"role": "assistant", ...}]}
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
# Split for evaluation
dataset = dataset.train_test_split(test_size=0.05)
# ─────────────────────────────────────────────────────────────────────────
# 5. TRAIN
# ─────────────────────────────────────────────────────────────────────────
training_args = SFTConfig(
output_dir=OUTPUT_DIR,
**TRAINING_CONFIG,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
args=training_args,
)
# Start training
trainer.train()
# Save the final model
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Training complete! Model saved to {OUTPUT_DIR}")
# ─────────────────────────────────────────────────────────────────────────
# 6. MERGE LORA (Optional - for deployment)
# ─────────────────────────────────────────────────────────────────────────
# If you want a single merged model for deployment:
# from peft import PeftModel
#
# base_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
# model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
# merged_model = model.merge_and_unload()
# merged_model.save_pretrained("./merged_model")
Loss Functions for Distillation
For more advanced distillation (especially logit-based), you may want custom loss functions:
┌─────────────────────────────────────────────────────────────────────────┐
│ DISTILLATION LOSS FUNCTIONS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STANDARD (Response-based): │
│ ────────────────────────── │
│ Just train on teacher responses as ground truth. │
│ Loss = CrossEntropy(student_output, teacher_response) │
│ │
│ This is what SFTTrainer does by default. Sufficient for most cases. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LOGIT-BASED (KL Divergence): │
│ ───────────────────────────── │
│ Train student to match teacher's probability distribution. │
│ │
│ L_distill = KL(softmax(student_logits/T), softmax(teacher_logits/T)) │
│ L_task = CrossEntropy(student_output, ground_truth) │
│ L_total = α × L_distill + (1-α) × L_task │
│ │
│ Where: │
│ • T = temperature (2.0-4.0), softens distributions │
│ • α = distillation weight (0.5 is common) │
│ │
│ Why temperature helps: │
│ - At T=1, most probability mass is on top prediction │
│ - Higher T spreads probability, revealing relationships │
│ - Student learns which wrong answers are "close" │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ REVERSE KL (MiniLLM approach): │
│ ─────────────────────────────── │
│ Use reverse KL instead of forward KL. │
│ │
│ Forward KL: Student covers all of teacher distribution (mode-seeking)│
│ Reverse KL: Student focuses on high-prob regions (mode-covering) │
│ │
│ For generation, reverse KL works better - student doesn't waste │
│ capacity trying to model low-probability teacher outputs. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Training Monitoring
What to watch during training:
┌─────────────────────────────────────────────────────────────────────────┐
│ TRAINING MONITORING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ HEALTHY TRAINING SIGNS: │
│ ──────────────────────── │
│ ✓ Training loss decreasing steadily │
│ ✓ Validation loss decreasing (parallel to training) │
│ ✓ No loss spikes │
│ ✓ Gradient norm stable │
│ │
│ WARNING SIGNS: │
│ ────────────── │
│ ⚠ Validation loss increasing while training loss decreases │
│ → Overfitting! Stop training or add regularization │
│ │
│ ⚠ Loss spikes │
│ → Learning rate too high, reduce by 2-5× │
│ │
│ ⚠ Loss not decreasing │
│ → Learning rate too low, data format issue, or already converged │
│ │
│ ⚠ NaN losses │
│ → Numeric instability, try lower LR or gradient clipping │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TYPICAL LOSS PROGRESSION (50K examples, 3 epochs): │
│ │
│ Epoch 1: Loss 2.5 → 1.5 (big improvement) │
│ Epoch 2: Loss 1.5 → 1.2 (continued improvement) │
│ Epoch 3: Loss 1.2 → 1.1 (diminishing returns) │
│ │
│ If loss plateaus early, you may need: │
│ • More diverse training data │
│ • Higher LoRA rank │
│ • Lower learning rate (training too fast, missing nuances) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Step 5: Evaluate the Distilled Model
Evaluation Framework
┌─────────────────────────────────────────────────────────────────────────┐
│ EVALUATION FRAMEWORK │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. TASK-SPECIFIC METRICS │
│ ─────────────────────── │
│ Measure performance on YOUR specific task: │
│ │
│ Classification: Accuracy, F1, precision, recall │
│ Generation: BLEU, ROUGE, human preference │
│ Q&A: Exact match, F1 over answers │
│ Code: Pass@k (execution success rate) │
│ Math: Accuracy on correct final answer │
│ │
│ 2. TRANSFER RATIO │
│ ────────────── │
│ How much of teacher's capability did student capture? │
│ │
│ Transfer Ratio = Student Score / Teacher Score │
│ │
│ Target by task complexity: │
│ • Simple tasks: 95%+ transfer │
│ • Moderate tasks: 85-95% transfer │
│ • Complex reasoning: 75-90% transfer │
│ │
│ 3. EFFICIENCY METRICS │
│ ────────────────── │
│ The whole point is efficiency - measure it! │
│ │
│ • Latency: Time per request (ms) │
│ • Throughput: Requests per second │
│ • Memory: GPU memory used │
│ • Cost: $ per 1M tokens │
│ │
│ 4. GENERALIZATION │
│ ────────────── │
│ Does it work on inputs it hasn't seen? │
│ │
│ • Hold-out test set (not used in training) │
│ • Out-of-distribution inputs │
│ • Edge cases │
│ │
│ 5. FAILURE ANALYSIS │
│ ──────────────── │
│ Where does the model fail? │
│ │
│ • Categorize errors │
│ • Identify patterns │
│ • Use insights to generate targeted training data │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Evaluation Process
┌─────────────────────────────────────────────────────────────────────────┐
│ EVALUATION PROCESS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STEP 1: PREPARE EVALUATION SETS │
│ ─────────────────────────────── │
│ • Hold-out set: 500-2,000 examples not used in training │
│ • Diverse set: Cover all sub-tasks and difficulty levels │
│ • Edge cases: Unusual inputs, adversarial examples │
│ • Real data: If available, actual production inputs │
│ │
│ STEP 2: RUN COMPARISONS │
│ ─────────────────────── │
│ For each evaluation example, get outputs from: │
│ • Teacher model (baseline) │
│ • Student model (your distilled model) │
│ • Base model (student before distillation) │
│ │
│ STEP 3: AUTOMATED SCORING │
│ ───────────────────────── │
│ • For verifiable tasks: Check correctness automatically │
│ • For generation: Use automated metrics (BLEU, ROUGE) │
│ • For quality: Use LLM-as-judge with strong model (GPT-4, Claude) │
│ │
│ LLM-as-judge prompt: │
│ "Given this instruction and two responses, which is better? │
│ Consider accuracy, helpfulness, and coherence. │
│ Instruction: [instruction] │
│ Response A: [teacher response] │
│ Response B: [student response] │
│ Winner: A or B or Tie" │
│ │
│ STEP 4: HUMAN EVALUATION (Sample) │
│ ───────────────────────────────── │
│ Have humans evaluate a sample (100-500 examples): │
│ • Rate quality 1-5 │
│ • Flag errors │
│ • Note patterns │
│ │
│ STEP 5: COMPUTE METRICS │
│ ─────────────────────── │
│ • Win rate vs teacher │
│ • Win rate vs base model │
│ • Task accuracy/quality │
│ • Error rate by category │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Evaluating Your Model: Practical Implementation
Evaluation is where you discover whether your distillation actually worked. The temptation is to look at a few examples and declare success, but rigorous evaluation is what separates a model you can deploy from one that will embarrass you in production.
The most honest evaluation is head-to-head comparison: show the same prompt to both teacher and student, then judge which response is better. This directly measures what you care about—did the student learn from the teacher? Win rate gives you a single number to track across iterations.
For tasks with verifiable answers (math, code, factual questions), automated evaluation is gold. You can run thousands of evaluations without human effort, and the results are objective. For open-ended tasks (writing, advice, analysis), you'll need LLM-as-judge or human evaluation. LLM-as-judge is scalable but introduces its own biases; human evaluation is expensive but trustworthy.
The baseline comparison is often overlooked but important: compare your distilled model not just to the teacher, but also to the base student model before distillation. This tells you whether distillation actually added value, or whether the base model could have done the job with the right prompting.
# evaluate.py
from openai import OpenAI
def compare_responses(prompt, response_a, response_b, judge_client):
"""LLM-as-judge comparison. Returns 'A', 'B', or 'tie'."""
judge_prompt = f"""Compare these two responses to the same prompt.
Prompt: {prompt}
Response A: {response_a}
Response B: {response_b}
Which response is better? Consider accuracy, helpfulness, and clarity.
Answer with just: A, B, or tie"""
result = judge_client.chat.completions.create(
model="gpt-4o", # Strong judge for reliable comparisons
messages=[{"role": "user", "content": judge_prompt}],
temperature=0,
)
return result.choices[0].message.content.strip().lower()
def compute_win_rate(test_examples, student_client, teacher_client, judge_client):
"""Compare student vs teacher on test set."""
wins, losses, ties = 0, 0, 0
for ex in test_examples:
prompt = ex["messages"][0]["content"]
student_resp = student_client.chat.completions.create(...).choices[0].message.content
teacher_resp = teacher_client.chat.completions.create(...).choices[0].message.content
result = compare_responses(prompt, student_resp, teacher_resp, judge_client)
if "a" in result: wins += 1 # Student was A
elif "b" in result: losses += 1
else: ties += 1
return {"win_rate": wins / len(test_examples), "wins": wins, "losses": losses, "ties": ties}
A few non-obvious evaluation practices that matter:
Position bias: LLM judges tend to prefer the first response they see. Randomize whether student or teacher is "A" or "B" for each comparison, then aggregate results. Without this, your win rates will be systematically biased.
Test set contamination: Your test set must be completely separate from training data. It's easy to accidentally leak examples, especially when iterating on data generation. Keep test data in a separate directory and never touch it during development.
Stratified evaluation: Don't just compute overall win rate—break it down by category, difficulty, response length. You might discover your student excels at simple questions but fails on complex ones, or handles one topic well but struggles with another. These insights guide your next iteration.
Failure analysis over metrics: A 85% win rate is good, but what about the 15% losses? Manually review every loss. Often you'll find patterns—specific question types, edge cases, or capability gaps. These patterns become your roadmap for improvement: generate more training data in those areas, or accept the limitation.
Iteration Based on Evaluation
┌─────────────────────────────────────────────────────────────────────────┐
│ ITERATION STRATEGY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ IF: Student wins <70% vs teacher │
│ ─────────────────────────────── │
│ Significant gap. Try: │
│ 1. More training data (2-5× current amount) │
│ 2. Higher quality data (stricter filtering) │
│ 3. Larger student model │
│ 4. Higher LoRA rank │
│ 5. Chain-of-thought distillation (if not already using) │
│ │
│ IF: Student wins 70-85% vs teacher │
│ ─────────────────────────────────── │
│ Reasonable but improvable. Try: │
│ 1. Analyze failure cases → generate targeted data │
│ 2. Include harder examples (Evol-Instruct) │
│ 3. Slightly longer training │
│ 4. Hyperparameter tuning │
│ │
│ IF: Student wins 85-95% vs teacher │
│ ─────────────────────────────────── │
│ Good result! For most tasks, this is success. │
│ Further improvements may have diminishing returns. │
│ Consider: │
│ 1. Is this good enough for your use case? │
│ 2. Focus on deployment optimization │
│ 3. Minor targeted improvements for specific failure modes │
│ │
│ IF: Student wins >95% vs teacher │
│ ──────────────────────────────── │
│ Excellent! Either: │
│ • Your task is well-suited for distillation │
│ • You might be able to use an even smaller student │
│ • Check that evaluation isn't overfitting to training distribution │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Step 6: Deploy the Distilled Model
Deployment Preparation
┌─────────────────────────────────────────────────────────────────────────┐
│ DEPLOYMENT PREPARATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ OPTION 1: MERGE LORA → QUANTIZE → DEPLOY │
│ ────────────────────────────────────────── │
│ Best for: Simple deployment, maximum compatibility │
│ │
│ Steps: │
│ 1. Merge LoRA weights into base model │
│ 2. Quantize merged model (GPTQ, AWQ, or GGUF) │
│ 3. Deploy quantized model │
│ │
│ Result: Single model file, works everywhere │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ OPTION 2: KEEP LORA SEPARATE │
│ ───────────────────────────── │
│ Best for: Multi-task serving, A/B testing │
│ │
│ Steps: │
│ 1. Keep base model and LoRA adapter separate │
│ 2. Load adapter at inference time │
│ 3. Can swap adapters for different tasks │
│ │
│ Result: One base model, multiple adapters │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ QUANTIZATION OPTIONS: │
│ │
│ Format Precision Size (7B) Best For │
│ ──────────────────────────────────────────────────────────────────────│
│ FP16 16-bit 14 GB Maximum quality │
│ INT8 8-bit 7 GB Good balance │
│ INT4 (GPTQ) 4-bit 3.5 GB GPU serving │
│ INT4 (AWQ) 4-bit 3.5 GB GPU serving, often better │
│ GGUF Q4_K_M 4-bit 4 GB CPU/llama.cpp │
│ GGUF Q5_K_M 5-bit 5 GB CPU, better quality │
│ │
│ Quality retention: │
│ FP16: 100% | INT8: ~99% | INT4: 95-98% │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Preparing for Deployment: Practical Implementation
The path from trained model to production involves three steps: merge LoRA weights into the base model, quantize to reduce memory requirements, and export to your serving format. Each step is straightforward, but getting the details right matters.
Merging LoRA is conceptually simple—you're combining the small adapter weights with the frozen base weights to create a single model. The merged model behaves identically to base + adapter at inference time, but eliminates the adapter overhead and works with any inference framework. After merging, you can discard the adapter files.
Quantization is where you trade precision for efficiency. A 7B model in FP16 requires 14GB; the same model in INT4 requires 3.5GB and runs 2-3x faster. The quality loss is minimal for most tasks—INT4 typically retains 95-98% of FP16 quality. The key is choosing the right quantization format for your deployment target: AWQ or GPTQ for GPU serving (vLLM, TGI), GGUF for CPU or edge deployment (llama.cpp, Ollama).
The choice between quantization formats has practical implications. AWQ generally produces better quality than GPTQ at the same bit width, but GPTQ has broader tooling support. GGUF is the most flexible—it runs on CPUs, Apple Silicon, and GPUs—but throughput is lower than GPU-native formats. Match the format to your deployment environment.
# deploy.py - Merge, quantize, and prepare for serving
# 1. Merge LoRA into base model
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B", torch_dtype="auto")
model = PeftModel.from_pretrained(base, "./distilled_model_lora")
merged = model.merge_and_unload()
merged.save_pretrained("./merged_model")
AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B").save_pretrained("./merged_model")
# 2. Quantize with AutoAWQ (for GPU serving)
# pip install autoawq
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("./merged_model")
model.quantize(tokenizer, quant_config={"w_bit": 4, "version": "gemm"})
model.save_quantized("./merged_model_awq")
# 3. Convert to GGUF (for llama.cpp / Ollama)
# Clone llama.cpp and run:
python convert_hf_to_gguf.py ./merged_model --outfile model.gguf --outtype f16
./llama-quantize model.gguf model-q4_k_m.gguf q4_k_m
# 4. Serve with vLLM (GPU)
python -m vllm.entrypoints.openai.api_server \
--model ./merged_model_awq \
--quantization awq \
--max-model-len 4096 \
--port 8000
The conversion to GGUF deserves extra attention. The convert_hf_to_gguf.py script (from llama.cpp) converts HuggingFace models to GGUF format, and llama-quantize applies quantization. The q4_k_m quantization level is a good default—it's 4-bit with a medium k-quant variant that balances size and quality. For higher quality, use q5_k_m or q6_k; for smaller size, use q4_k_s or q3_k_m.
For Ollama deployment, create a Modelfile pointing to your GGUF and run ollama create mymodel -f Modelfile. This gives you a simple API and easy model management, though throughput is lower than vLLM.
One often-missed step: always test inference after each conversion step. Run the same prompts through the merged model, the quantized model, and the final deployed model. Verify that responses are substantively identical. Quantization and format conversion occasionally introduce subtle bugs—better to catch them before production.
Serving Options
┌─────────────────────────────────────────────────────────────────────────┐
│ SERVING OPTIONS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ vLLM (GPU - High throughput): │
│ ────────────────────────────── │
│ Best for: Production GPU serving, high throughput │
│ Features: PagedAttention, continuous batching, tensor parallelism │
│ Latency: Low | Throughput: Very high | Setup: Moderate │
│ │
│ python -m vllm.entrypoints.openai.api_server \ │
│ --model ./merged_model \ │
│ --quantization awq \ │
│ --max-model-len 4096 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TGI (GPU - Production ready): │
│ ────────────────────────────── │
│ Best for: Production deployment, Hugging Face ecosystem │
│ Features: Optimized kernels, streaming, OpenAI-compatible API │
│ Latency: Low | Throughput: High | Setup: Easy (Docker) │
│ │
│ docker run --gpus all \ │
│ -v ./merged_model:/model \ │
│ ghcr.io/huggingface/text-generation-inference \ │
│ --model-id /model │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ llama.cpp (CPU/Edge): │
│ ────────────────────── │
│ Best for: CPU inference, edge devices, local deployment │
│ Features: GGUF format, runs anywhere, no GPU needed │
│ Latency: Higher | Throughput: Lower | Setup: Very easy │
│ │
│ ./llama-server -m model.gguf --port 8080 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Ollama (Local - User friendly): │
│ ───────────────────────────────── │
│ Best for: Local development, demos, simple deployment │
│ Features: One-command setup, model management │
│ Latency: Moderate | Throughput: Moderate | Setup: Very easy │
│ │
│ ollama create mymodel -f Modelfile │
│ ollama serve │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Production Considerations
┌─────────────────────────────────────────────────────────────────────────┐
│ PRODUCTION CONSIDERATIONS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ MONITORING: │
│ ─────────── │
│ • Response latency (P50, P95, P99) │
│ • Throughput (requests/second) │
│ • Error rate │
│ • Model-specific: response length, refusal rate │
│ │
│ QUALITY ASSURANCE: │
│ ────────────────── │
│ • Sample and review outputs regularly │
│ • User feedback collection │
│ • Automated quality checks (if possible) │
│ • A/B testing against alternatives │
│ │
│ FALLBACK STRATEGY: │
│ ────────────────── │
│ • When should requests fall back to teacher model? │
│ • Low confidence? Specific topics? User request? │
│ • Monitor fallback rate (should be <5-10%) │
│ │
│ CONTINUOUS IMPROVEMENT: │
│ ───────────────────────── │
│ • Collect failure cases from production │
│ • Generate targeted training data │
│ • Periodic retraining │
│ • Version management │
│ │
│ COST TRACKING: │
│ ────────────── │
│ • Compute cost per request │
│ • Compare to teacher API cost │
│ • Track ROI of distillation investment │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Real-World Example: Customer Support Distillation
Let's walk through a concrete example end-to-end:
┌─────────────────────────────────────────────────────────────────────────┐
│ EXAMPLE: CUSTOMER SUPPORT BOT │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TASK: Respond to customer support tickets for an e-commerce company │
│ │
│ REQUIREMENTS: │
│ • Handle order inquiries, returns, product questions │
│ • Empathetic, helpful tone │
│ • Response time <500ms │
│ • Run on single A10 GPU │
│ • 90%+ quality vs current GPT-4 solution │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ STEP 1: MODEL SELECTION │
│ Teacher: Llama 3.1 70B (open source, strong instruction following) │
│ Student: Qwen 2.5 7B (good balance of capability and efficiency) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ STEP 2: DATA GENERATION │
│ │
│ Sources: │
│ • 5,000 real support tickets (anonymized) │
│ • Generate teacher responses for each │
│ • Use Self-Instruct to generate 50,000 synthetic tickets │
│ • Apply Evol-Instruct to create harder variations │
│ │
│ Filtering: │
│ • Remove low-quality (<4 rating by LLM judge) │
│ • Remove duplicates │
│ • Balance categories (orders, returns, products) │
│ │
│ Final dataset: 45,000 high-quality examples │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ STEP 3: TRAINING │
│ │
│ Configuration: │
│ • QLoRA (4-bit) for memory efficiency │
│ • r=32, alpha=64 │
│ • Learning rate: 2e-4 │
│ • 3 epochs │
│ • Batch size: 32 (effective) │
│ │
│ Hardware: Single A100 80GB │
│ Time: ~8 hours │
│ Cost: ~$50 (cloud GPU) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ STEP 4: EVALUATION │
│ │
│ Results on 500 held-out tickets: │
│ • Win rate vs teacher: 87% │
│ • Win rate vs base Qwen 7B: 94% │
│ • Human quality rating: 4.2/5 (vs teacher 4.5/5) │
│ • Latency: 180ms (vs teacher 2.5s) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ STEP 5: DEPLOYMENT │
│ │
│ • Merge LoRA into base model │
│ • Quantize to AWQ 4-bit │
│ • Deploy with vLLM on A10 GPU │
│ • Throughput: 50 requests/second │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COST COMPARISON: │
│ │
│ Before (GPT-4): │
│ • $0.03 per ticket × 100K tickets/month = $3,000/month │
│ │
│ After (Distilled 7B): │
│ • A10 GPU: $500/month │
│ • Handles all 100K tickets │
│ │
│ Savings: $2,500/month = $30,000/year │
│ ROI: Distillation cost (~$500) paid back in <1 week │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Summary
Task-specific SLM distillation is a powerful technique for deploying AI capabilities efficiently:
┌─────────────────────────────────────────────────────────────────────────┐
│ KEY TAKEAWAYS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. DEFINE YOUR TASK PRECISELY │
│ The clearer your task definition, the better your results. │
│ │
│ 2. CHOOSE MODELS WISELY │
│ • Teacher: Open-source (Llama, Qwen, DeepSeek) for legal safety │
│ • Student: Match to your deployment hardware │
│ │
│ 3. DATA QUALITY > QUANTITY │
│ 10K excellent examples beat 100K poor ones. │
│ Filter aggressively. │
│ │
│ 4. USE CHAIN-OF-THOUGHT FOR REASONING │
│ For complex tasks, distill the thinking process, not just answers. │
│ │
│ 5. LORA/QLORA IS USUALLY ENOUGH │
│ Full fine-tuning rarely needed for distillation. │
│ LoRA is faster, cheaper, and prevents forgetting. │
│ │
│ 6. ITERATE BASED ON EVALUATION │
│ Analyze failures, generate targeted data, retrain. │
│ This cycle is where the magic happens. │
│ │
│ 7. QUANTIZE FOR DEPLOYMENT │
│ INT4 quantization gives 95%+ quality at 4× memory reduction. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TYPICAL RESULTS: │
│ • 85-95% of teacher capability │
│ • 10-100× cost reduction │
│ • 3-10× latency improvement │
│ • 1-2 weeks to implement │
│ • $1K-$10K total cost │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Frequently Asked Questions
Related Articles
Knowledge Distillation for LLMs: Compressing Intelligence
A comprehensive guide to knowledge distillation—transferring capabilities from large teacher models to smaller, faster student models. From theory to implementation, including chain-of-thought distillation and synthetic data generation.
Small Language Models: Edge Deployment and Knowledge Distillation
The rise of Small Language Models (SLMs)—from Llama 3.2 to Phi-4 to Qwen 2.5. Understanding knowledge distillation, quantization, and deploying AI at the edge.
SFT Deep Dive: Instruction Tuning Techniques and Best Practices
A comprehensive guide to Supervised Fine-Tuning (SFT) for LLMs—covering full fine-tuning vs LoRA vs QLoRA vs DoRA, data curation strategies, instruction formats, multi-task learning, and avoiding catastrophic forgetting.
Synthetic Data Generation for LLM Training
How to generate high-quality synthetic training data using LLMs—from NVIDIA's Nemotron pipeline to quality filtering techniques and avoiding model collapse.
vLLM in Production: The Complete Guide to High-Performance LLM Serving
A comprehensive guide to deploying vLLM in production—covering architecture internals, configuration tuning, Kubernetes deployment, monitoring, and troubleshooting.
Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR
A deep technical guide to training reasoning models like o1 and DeepSeek R1—covering PPO, GRPO, reward function design, RLVR, and distillation techniques.