How much performance loss should I expect from distillation?

Typical ranges: The loss depends heavily on task complexity. Simple tasks (classification) distill better than complex tasks (open-ended reasoning). Same architecture, 10× smaller: 5-15% performance loss, Same architecture, 100× smaller: 15-30% performance loss, Cross-architecture: Highly variable.

Do I need access to the teacher's internals?

No. For LLMs, response-level distillation (just using the teacher's outputs) works well and only requires API access. Logit-level distillation (accessing probability distributions) can help but isn't required.

How much synthetic data do I need?

Rules of thumb: More is generally better, but quality matters more than quantity. Basic instruction following: 50K-100K examples, Complex reasoning: 500K-1M examples, Domain-specific: Depends on domain complexity.

Can the student ever exceed the teacher?

In limited ways, yes: Multi-teacher distillation can combine strengths, Self-distillation with filtering can improve over time, Student may generalize better on specific distributions, But typically, the teacher sets an upper bound on capability.

Is distillation legal for commercial use?

This depends on the teacher model's terms of service. Many API providers (OpenAI, Anthropic) have terms that restrict using their outputs to train competing models. Check the specific terms for your teacher model. Open-source teachers (Llama, Mistral) generally allow this.

Knowledge Distillation for LLMs: Compressing Intelligence | Enrico Piovano

What is Knowledge Distillation?

Knowledge distillation is the process of transferring learned capabilities from a large, powerful "teacher" model to a smaller, more efficient "student" model. The student learns to mimic the teacher's behavior, often achieving surprisingly close performance at a fraction of the computational cost.

Think of it like an experienced expert teaching an apprentice. The apprentice doesn't need to rediscover everything from scratch—they can learn from watching the expert work, understanding their reasoning, and imitating their approach. The apprentice may never match the expert perfectly, but they can become very capable much faster than learning independently.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    KNOWLEDGE DISTILLATION                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TEACHER MODEL                           STUDENT MODEL                   │
│  (Large, expensive)                      (Small, efficient)              │
│                                                                          │
│  ┌─────────────────┐                    ┌─────────────────┐             │
│  │                 │                    │                 │             │
│  │   GPT-4 / 70B   │   ──Distill──→    │   7B / 3B / 1B  │             │
│  │                 │                    │                 │             │
│  └─────────────────┘                    └─────────────────┘             │
│                                                                          │
│  Properties:                            Properties:                      │
│  • High accuracy                        • Lower accuracy (but close!)   │
│  • Slow inference                       • Fast inference                 │
│  • Expensive to run                     • Cheap to run                   │
│  • High memory                          • Low memory                     │
│  • Cloud-only                           • Edge-deployable                │
│                                                                          │
│  Example performance:                                                    │
│  Teacher (70B): 85% accuracy            Student (7B): 80% accuracy      │
│  Teacher cost: $0.01/query              Student cost: $0.0005/query     │
│                                                                          │
│  5% accuracy loss for 20× cost reduction                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why Distillation Matters

The best LLMs are powerful but impractical for many use cases:

Cost: GPT-4 or Claude costs $10-30 per million tokens. At scale, this adds up to millions of dollars.

Latency: Large models are slow. A 70B model might take 2-5 seconds to generate a response. Some applications need <100ms.

Privacy: Sending data to cloud APIs may violate privacy requirements. On-device models keep data local.

Offline: Edge devices, embedded systems, and air-gapped environments can't access cloud APIs.

Reliability: Depending on external APIs creates failure modes. Local models are always available.

Distillation bridges this gap: get most of the capability at a fraction of the cost.

The Theory: Why Distillation Works

Hard Labels vs. Soft Labels

Traditional supervised learning uses "hard labels"—the single correct answer:

Code

Input: "What is the capital of France?"
Hard label: "Paris" (probability = 1)

But a trained teacher model produces "soft labels"—a probability distribution over all possible answers:

Code

Input: "What is the capital of France?"
Teacher distribution:
  "Paris": 0.92
  "Lyon": 0.03
  "Marseille": 0.02
  "France": 0.01
  "French": 0.005
  ...

The key insight: Soft labels contain more information than hard labels. The teacher's distribution reveals:

Confidence: How sure is the model? (Paris at 0.92 is high confidence)
Alternatives: What else could be plausible? (Lyon and Marseille are French cities)
Relationships: What's semantically related? (France, French)
Uncertainty: Where is the model unsure?

This extra information is called "dark knowledge"—knowledge implicit in the teacher's behavior that isn't captured by just the correct answer.

Why Dark Knowledge Helps

Consider teaching a student model to classify images of dogs and cats:

Hard labels say: "This is a cat." Soft labels say: "This is 95% cat, 4% dog, 1% other."

For a clear cat picture, both are similar. But for an ambiguous picture (maybe a fluffy cat that looks dog-like), soft labels convey uncertainty that helps the student learn the decision boundary better.

Another example with text:

Question: "Is a tomato a fruit or vegetable?"

Hard label: "Fruit" (botanically correct)

Soft labels from teacher:

"Fruit": 0.55
"Vegetable": 0.35
"Both": 0.08
"It depends": 0.02

The soft labels capture the genuine ambiguity—tomatoes are botanically fruits but culinarily vegetables. A student learning from soft labels understands this nuance better than one learning from hard labels.

Temperature: Controlling Softness

The "softness" of the probability distribution is controlled by a temperature parameter:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TEMPERATURE IN DISTILLATION                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Softmax with temperature:                                               │
│                                                                          │
│  P(i) = exp(z_i / T) / Σ exp(z_j / T)                                   │
│                                                                          │
│  Where T = temperature, z = logits (pre-softmax values)                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Example logits: [2.0, 1.0, 0.5]                                        │
│                                                                          │
│  T = 1.0 (standard):  [0.66, 0.24, 0.10]  ← Sharp, peaked              │
│  T = 2.0 (softer):    [0.50, 0.31, 0.19]  ← More spread out            │
│  T = 5.0 (very soft): [0.39, 0.33, 0.28]  ← Nearly uniform             │
│  T = 0.5 (sharper):   [0.84, 0.14, 0.02]  ← Very peaked                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  For distillation:                                                       │
│  • T = 1: Uses standard probabilities                                   │
│  • T = 2-5: Typical for distillation (reveals more dark knowledge)     │
│  • T > 10: Too soft, loses discrimination                               │
│                                                                          │
│  Higher temperature reveals more information about relationships        │
│  between classes, helping the student learn better.                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Distillation Loss

The student is trained using a combination of losses:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DISTILLATION LOSS FUNCTION                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  L_total = α × L_distill + (1-α) × L_task                               │
│                                                                          │
│  Where:                                                                  │
│                                                                          │
│  L_distill = KL(P_teacher(T) || P_student(T))                           │
│            = Cross-entropy between teacher and student distributions     │
│            = "Match the teacher's soft predictions"                      │
│                                                                          │
│  L_task = CrossEntropy(y_true, P_student(T=1))                          │
│         = Standard supervised loss on ground truth labels                │
│         = "Get the right answer"                                         │
│                                                                          │
│  α = Weight balancing the two objectives (typically 0.5-0.9)            │
│  T = Temperature (same for teacher and student during distillation)    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Why both losses?                                                        │
│  • L_distill: Transfers dark knowledge from teacher                     │
│  • L_task: Ensures student still solves the actual task correctly       │
│                                                                          │
│  Without L_task, student might match teacher's errors perfectly!        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Types of Knowledge Distillation

1. Logit Distillation (Output-Level)

The classic approach: match the teacher's output probability distribution.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LOGIT DISTILLATION                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                         Input: "The cat sat on the..."                  │
│                                      │                                   │
│               ┌──────────────────────┼──────────────────────┐           │
│               ▼                                             ▼            │
│       ┌───────────────┐                           ┌───────────────┐     │
│       │    TEACHER    │                           │    STUDENT    │     │
│       │     (70B)     │                           │     (7B)      │     │
│       └───────┬───────┘                           └───────┬───────┘     │
│               │                                           │              │
│               ▼                                           ▼              │
│       P_teacher:                                   P_student:            │
│       mat: 0.45                                    mat: 0.38            │
│       floor: 0.25                                  floor: 0.28          │
│       couch: 0.15                                  couch: 0.18          │
│       ...                                          ...                   │
│               │                                           │              │
│               └───────────────────┬───────────────────────┘              │
│                                   │                                      │
│                                   ▼                                      │
│                        KL Divergence Loss                                │
│                        (minimize difference)                             │
│                                                                          │
│  Advantages:                                                             │
│  • Simple to implement                                                  │
│  • Works for any model architecture                                     │
│  • No access to teacher internals needed                                │
│                                                                          │
│  Limitations:                                                            │
│  • Only transfers final output behavior                                 │
│  • Misses intermediate reasoning                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

2. Response Distillation (Sequence-Level)

For language models, instead of matching token-level distributions, train the student to generate the same responses as the teacher:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    RESPONSE DISTILLATION                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. Collect prompts: ["Explain photosynthesis", "Write a poem", ...]   │
│                                                                          │
│  2. Generate teacher responses:                                          │
│     ┌─────────────────────────────────────────────────────────────┐    │
│     │  Prompt: "Explain photosynthesis briefly"                    │    │
│     │                                                              │    │
│     │  Teacher Response: "Photosynthesis is the process by which  │    │
│     │  plants convert sunlight, water, and CO2 into glucose and   │    │
│     │  oxygen. It occurs in chloroplasts and is essential for     │    │
│     │  life on Earth."                                             │    │
│     └─────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  3. Train student on (prompt, teacher_response) pairs:                  │
│     Standard language modeling loss on teacher's text                    │
│                                                                          │
│  Advantages:                                                             │
│  • Natural for language models                                          │
│  • Captures full response quality, not just token probabilities        │
│  • Easy to scale with synthetic data generation                         │
│                                                                          │
│  This is how most LLM distillation works in practice!                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

3. Feature Distillation (Intermediate-Level)

Match the teacher's internal representations, not just outputs:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    FEATURE DISTILLATION                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TEACHER (24 layers)              STUDENT (12 layers)                   │
│  ┌───────────────┐                ┌───────────────┐                     │
│  │   Layer 24    │ ──────────────→│   Layer 12    │  Match layer 12    │
│  │   Layer 23    │                │               │                     │
│  │   Layer 22    │                │               │                     │
│  │   Layer 21    │                │               │                     │
│  │   Layer 20    │                │               │                     │
│  │   Layer 19    │                │               │                     │
│  │   Layer 18    │ ──────────────→│   Layer 9     │  Match layer 9     │
│  │   Layer 17    │                │               │                     │
│  │   ...         │                │   ...         │                     │
│  │   Layer 12    │ ──────────────→│   Layer 6     │  Match layer 6     │
│  │   ...         │                │   ...         │                     │
│  │   Layer 6     │ ──────────────→│   Layer 3     │  Match layer 3     │
│  │   ...         │                │   ...         │                     │
│  │   Layer 1     │                │   Layer 1     │                     │
│  └───────────────┘                └───────────────┘                     │
│                                                                          │
│  Loss = Σ MSE(transform(teacher_layer_i), student_layer_j)              │
│                                                                          │
│  Advantages:                                                             │
│  • Transfers internal representations, not just behavior                │
│  • Can improve performance over output-only distillation               │
│                                                                          │
│  Challenges:                                                             │
│  • Need access to teacher internals (not possible with APIs)           │
│  • Dimension mismatch requires projection layers                        │
│  • Choosing which layers to match is non-trivial                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

4. Chain-of-Thought Distillation

One of the most important techniques for LLMs: distill not just answers but reasoning processes.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CHAIN-OF-THOUGHT DISTILLATION                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STANDARD DISTILLATION:                                                  │
│  ────────────────────────                                                │
│  Input: "If John has 3 apples and gives away 2, how many does he have?" │
│  Teacher output: "1"                                                     │
│  Student learns: Map question → answer                                  │
│                                                                          │
│  CHAIN-OF-THOUGHT DISTILLATION:                                         │
│  ────────────────────────────────                                        │
│  Input: "If John has 3 apples and gives away 2, how many does he have?" │
│  Teacher output: "Let me work through this step by step.                │
│                   John starts with 3 apples.                            │
│                   He gives away 2 apples.                               │
│                   3 - 2 = 1                                             │
│                   John has 1 apple left."                               │
│  Student learns: Map question → reasoning → answer                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Why this matters:                                                       │
│                                                                          │
│  • Small models struggle with reasoning                                 │
│  • But they CAN follow reasoning steps if shown how                     │
│  • CoT distillation teaches the reasoning process, not just the answer │
│  • Student learns to generate intermediate steps                        │
│  • Dramatically improves performance on math, logic, multi-step tasks  │
│                                                                          │
│  Research finding: "Students trained on CoT data from teachers          │
│  significantly outperform students trained on answer-only data,         │
│  especially on complex reasoning tasks."                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Synthetic Data Generation for Distillation

The Data Challenge

Distillation requires training data—pairs of inputs and teacher outputs. Where does this data come from?

Option 1: Use existing datasets

Pros: High quality, diverse
Cons: Limited quantity, may not cover your use case

Option 2: Generate synthetic data from the teacher

Pros: Unlimited quantity, tailored to your needs
Cons: Risk of amplifying teacher errors, potential quality issues

Synthetic data generation has become the dominant approach for LLM distillation.

The Synthetic Data Pipeline

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SYNTHETIC DATA GENERATION PIPELINE                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STEP 1: SEED PROMPTS                                                    │
│  ───────────────────                                                     │
│  Start with diverse seed prompts or topics:                              │
│  • From existing datasets (ShareGPT, Alpaca)                            │
│  • Generated programmatically (templates + variations)                  │
│  • Sampled from real user queries                                       │
│                                                                          │
│           ↓                                                              │
│                                                                          │
│  STEP 2: PROMPT AUGMENTATION                                            │
│  ───────────────────────────                                             │
│  Expand seeds into many variations:                                      │
│  • "Explain X" → "Explain X to a beginner"                              │
│                → "Explain X in detail"                                  │
│                → "Explain X with examples"                              │
│                → "Explain X in one sentence"                            │
│                                                                          │
│           ↓                                                              │
│                                                                          │
│  STEP 3: TEACHER GENERATION                                              │
│  ──────────────────────────                                              │
│  Generate responses with the teacher model:                              │
│  • Use high temperature for diversity                                   │
│  • Generate multiple responses per prompt                               │
│  • Include chain-of-thought when appropriate                            │
│                                                                          │
│           ↓                                                              │
│                                                                          │
│  STEP 4: QUALITY FILTERING                                               │
│  ─────────────────────────                                               │
│  Remove low-quality generations:                                         │
│  • Filter by length (too short/long)                                    │
│  • Filter by perplexity (nonsensical text)                              │
│  • Filter by self-consistency (regenerate and compare)                  │
│  • Filter by reference answer (if available)                            │
│  • Human evaluation on samples                                          │
│                                                                          │
│           ↓                                                              │
│                                                                          │
│  STEP 5: TRAINING DATA                                                   │
│  ─────────────────────                                                   │
│  Final dataset: [(prompt_1, response_1), (prompt_2, response_2), ...]   │
│  Typically 100K - 10M examples                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Self-Instruct and Evol-Instruct

Two influential techniques for generating diverse instruction data:

Self-Instruct (Stanford Alpaca):

Start with small seed set of instructions
Use LLM to generate new instructions
Use LLM to generate responses
Filter for quality
Repeat

This bootstrapping approach can generate millions of instruction-response pairs from a few hundred seeds.

Evol-Instruct (WizardLM):

Start with simple instructions
"Evolve" instructions to be more complex:
- Add constraints
- Increase reasoning depth
- Complicate the scenario
- Require multiple steps
Generate responses to evolved instructions
Train on progressively harder data

Evol-Instruct produces harder training data, which leads to more capable students.

Quality Control for Synthetic Data

Synthetic data can contain errors, inconsistencies, and hallucinations. Quality control is essential:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    QUALITY CONTROL STRATEGIES                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. SELF-CONSISTENCY FILTERING                                          │
│     ───────────────────────────                                          │
│     Generate multiple responses to the same prompt.                      │
│     Keep only prompts where responses are consistent.                    │
│                                                                          │
│     If teacher says "Paris" 3/3 times → high confidence, keep           │
│     If teacher says "Paris", "Lyon", "France" → low confidence, discard │
│                                                                          │
│  2. PERPLEXITY FILTERING                                                │
│     ─────────────────────                                                │
│     Use a separate model to score response perplexity.                   │
│     Very high perplexity = likely nonsense, discard.                    │
│                                                                          │
│  3. LENGTH AND FORMAT CHECKS                                            │
│     ─────────────────────────                                            │
│     Remove responses that are:                                           │
│     • Too short (likely incomplete)                                     │
│     • Too long (likely repetitive)                                      │
│     • Missing expected format (for structured tasks)                    │
│                                                                          │
│  4. REWARD MODEL FILTERING                                              │
│     ─────────────────────────                                            │
│     Use a reward model (from RLHF) to score responses.                  │
│     Keep only high-scoring responses.                                    │
│                                                                          │
│  5. EXECUTION-BASED FILTERING (for code)                                │
│     ──────────────────────────────────────                               │
│     Run the generated code.                                              │
│     Keep only code that executes successfully / passes tests.           │
│                                                                          │
│  6. HUMAN EVALUATION                                                     │
│     ─────────────────                                                    │
│     Sample and manually review a subset.                                 │
│     Identify systematic issues to filter programmatically.              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Practical Implementation

Architecture Considerations

Teacher and student architecture matching:

Same architecture family (transformer) works best
Different sizes within family (70B → 7B) is common
Cross-architecture (transformer → other) is harder

Student size selection:

Larger student = better quality, higher cost
Common ratios: Teacher/Student = 10×, 20×, 100×
Empirical: Try multiple sizes, evaluate on your task

Training Configuration

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DISTILLATION TRAINING CONFIG                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  HYPERPARAMETERS                                                         │
│  ───────────────                                                         │
│                                                                          │
│  # Loss weights                                                          │
│  alpha = 0.7              # Weight for distillation loss                │
│  temperature = 3.0        # Softmax temperature                          │
│                                                                          │
│  # Training                                                              │
│  learning_rate = 2e-5     # Lower than pre-training                     │
│  batch_size = 32          # Per GPU                                      │
│  epochs = 3-5             # Usually few epochs suffice                  │
│  warmup_ratio = 0.1       # Learning rate warmup                        │
│                                                                          │
│  # Data                                                                  │
│  max_seq_length = 2048    # Match teacher's effective context           │
│  num_examples = 500K-5M   # More is usually better                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GUIDELINES                                                              │
│  ──────────                                                              │
│                                                                          │
│  Temperature:                                                            │
│  • T=1: Standard (no extra softening)                                   │
│  • T=2-3: Good for classification tasks                                 │
│  • T=3-5: Good for language modeling                                    │
│  • T>5: Usually too soft, loses signal                                  │
│                                                                          │
│  Alpha (distillation weight):                                            │
│  • α=0.5: Balanced between teacher and task                             │
│  • α=0.7-0.9: Focus more on matching teacher                            │
│  • α=1.0: Pure distillation (no ground truth loss)                      │
│                                                                          │
│  Learning rate:                                                          │
│  • Start lower than pre-training (student already has knowledge)        │
│  • Too high: Catastrophic forgetting                                    │
│  • Too low: Slow convergence                                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Simple Implementation Example

Python

class DistillationTrainer:
    """
    A simple distillation trainer for language models.

    This implements response-level distillation, which is the
    most common approach for LLM distillation.
    """

    def __init__(
        self,
        student_model,
        teacher_model,
        tokenizer,
        temperature: float = 3.0,
        alpha: float = 0.7
    ):
        self.student = student_model
        self.teacher = teacher_model
        self.tokenizer = tokenizer
        self.temperature = temperature
        self.alpha = alpha

        # Freeze teacher
        for param in self.teacher.parameters():
            param.requires_grad = False

    def distillation_loss(
        self,
        student_logits,  # [batch, seq_len, vocab]
        teacher_logits,  # [batch, seq_len, vocab]
        labels,          # [batch, seq_len]
        attention_mask   # [batch, seq_len]
    ):
        """
        Compute combined distillation and task loss.

        The distillation loss uses KL divergence between
        softened teacher and student distributions.
        """
        # Soften distributions with temperature
        student_soft = F.log_softmax(student_logits / self.temperature, dim=-1)
        teacher_soft = F.softmax(teacher_logits / self.temperature, dim=-1)

        # KL divergence (distillation loss)
        # Multiply by T^2 to scale gradients appropriately
        kl_loss = F.kl_div(
            student_soft,
            teacher_soft,
            reduction='batchmean'
        ) * (self.temperature ** 2)

        # Standard cross-entropy loss (task loss)
        task_loss = F.cross_entropy(
            student_logits.view(-1, student_logits.size(-1)),
            labels.view(-1),
            ignore_index=-100  # Ignore padding
        )

        # Combined loss
        total_loss = self.alpha * kl_loss + (1 - self.alpha) * task_loss

        return total_loss, kl_loss, task_loss

    def train_step(self, batch):
        """Single training step."""
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']

        # Get teacher predictions (no grad)
        with torch.no_grad():
            teacher_outputs = self.teacher(
                input_ids=input_ids,
                attention_mask=attention_mask
            )
            teacher_logits = teacher_outputs.logits

        # Get student predictions
        student_outputs = self.student(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        student_logits = student_outputs.logits

        # Compute loss
        loss, kl_loss, task_loss = self.distillation_loss(
            student_logits,
            teacher_logits,
            labels,
            attention_mask
        )

        return loss, {'kl_loss': kl_loss.item(), 'task_loss': task_loss.item()}

Response-Level Distillation (More Common)

For most LLM applications, response-level distillation is simpler and just as effective:

Python

def generate_distillation_data(
    teacher_model,
    prompts: list[str],
    tokenizer,
    num_responses_per_prompt: int = 1,
    max_new_tokens: int = 512,
    temperature: float = 0.7
) -> list[dict]:
    """
    Generate synthetic training data from teacher.

    This is the first step in response-level distillation.
    """
    data = []

    for prompt in prompts:
        for _ in range(num_responses_per_prompt):
            # Generate teacher response
            inputs = tokenizer(prompt, return_tensors='pt')

            with torch.no_grad():
                outputs = teacher_model.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens,
                    temperature=temperature,
                    do_sample=True
                )

            response = tokenizer.decode(
                outputs[0][inputs['input_ids'].shape[1]:],
                skip_special_tokens=True
            )

            data.append({
                'prompt': prompt,
                'response': response
            })

    return data

def train_student_on_responses(
    student_model,
    distillation_data: list[dict],
    tokenizer,
    num_epochs: int = 3
):
    """
    Train student on teacher-generated responses.

    This is standard supervised fine-tuning on synthetic data.
    """
    # Format as instruction-tuning data
    training_texts = [
        f"### Instruction:\n{d['prompt']}\n\n### Response:\n{d['response']}"
        for d in distillation_data
    ]

    # Standard SFT training loop
    # (Implementation depends on your training framework)
    # ...

When to Distill vs. Other Approaches

Decision Framework

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    WHEN TO USE DISTILLATION                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  USE DISTILLATION WHEN:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  ✓ You have access to a strong teacher model                            │
│  ✓ You need lower inference cost than the teacher                       │
│  ✓ You need faster inference latency                                    │
│  ✓ You need to run on-device or edge                                    │
│  ✓ You need to avoid API dependencies                                   │
│  ✓ Some quality degradation is acceptable                               │
│  ✓ You have compute for training (cheaper than running teacher)         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DON'T USE DISTILLATION WHEN:                                           │
│  ─────────────────────────────                                           │
│                                                                          │
│  ✗ You need teacher-level quality (no degradation acceptable)           │
│  ✗ You have very limited training compute                               │
│  ✗ Your task requires the full capability of the teacher                │
│  ✗ You don't have access to a suitable teacher                          │
│  ✗ API costs are acceptable for your scale                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ALTERNATIVES TO CONSIDER:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  • Quantization: Reduce precision without training                      │
│    - Faster to implement                                                │
│    - No training data needed                                            │
│    - But smaller speedup than distillation                              │
│                                                                          │
│  • Pruning: Remove unnecessary weights                                  │
│    - Can combine with distillation                                      │
│    - Structured pruning enables speedup                                 │
│                                                                          │
│  • Prompt optimization: Get more from existing model                    │
│    - No training needed                                                 │
│    - But still need large model at inference                            │
│                                                                          │
│  • Smaller teacher: Use smallest model that works                       │
│    - Avoids distillation entirely                                       │
│    - But may not exist for your capability level                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Distillation vs. Direct Training

Should you distill from a teacher or train the small model directly?

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DISTILLATION VS DIRECT TRAINING                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  DISTILLATION                       DIRECT TRAINING                      │
│  ─────────────                      ───────────────                      │
│                                                                          │
│  Data: Teacher-generated            Data: Ground truth labels            │
│  Quality: Teacher-bounded           Quality: Data-bounded                │
│  Cost: Depends on teacher API       Cost: Fixed data collection          │
│  Risk: Inherits teacher biases      Risk: Needs more data for quality   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHEN DISTILLATION WINS:                                                │
│                                                                          │
│  • Limited labeled data (teacher can generate unlimited)                │
│  • Teacher is much better than small model could be with direct data   │
│  • Soft labels provide useful signal beyond hard labels                 │
│  • CoT distillation teaches reasoning that's hard to label              │
│                                                                          │
│  WHEN DIRECT TRAINING WINS:                                             │
│                                                                          │
│  • Abundant high-quality labeled data                                   │
│  • Teacher makes systematic errors you don't want to inherit            │
│  • Task is simple enough that small model can learn directly            │
│  • Ground truth labels are more reliable than teacher                   │
│                                                                          │
│  BEST PRACTICE:                                                          │
│  Combine both! Use ground truth where available + teacher generations  │
│  for augmentation.                                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Advanced Techniques

Multi-Teacher Distillation

Use multiple teachers to get diverse knowledge:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MULTI-TEACHER DISTILLATION                            │
│                                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                  │
│  │  Teacher A   │  │  Teacher B   │  │  Teacher C   │                  │
│  │   (GPT-4)    │  │   (Claude)   │  │   (Llama)    │                  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘                  │
│         │                 │                 │                           │
│         └────────────┬────┴────────────────┘                           │
│                      │                                                   │
│                      ▼                                                   │
│              ┌──────────────┐                                           │
│              │   ENSEMBLE   │   Combine teacher predictions:            │
│              │              │   • Average probabilities                 │
│              └──────┬───────┘   • Weighted by confidence                │
│                     │           • Select best per-example               │
│                     ▼                                                    │
│              ┌──────────────┐                                           │
│              │   STUDENT    │                                           │
│              └──────────────┘                                           │
│                                                                          │
│  Benefits:                                                               │
│  • Reduces individual teacher biases                                    │
│  • Student can exceed any single teacher                                │
│  • More robust to teacher errors                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Self-Distillation

A model can distill itself! Train a model, then use it to generate data for training a better version:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SELF-DISTILLATION                                     │
│                                                                          │
│  Round 1: Train base model on original data                             │
│           ↓                                                              │
│  Round 2: Use model to generate responses, filter best, retrain         │
│           ↓                                                              │
│  Round 3: Repeat with improved model                                    │
│           ↓                                                              │
│  Round N: Model progressively improves                                  │
│                                                                          │
│  Key insight: Generate many responses, keep only the best              │
│  (Best-of-N sampling + retraining)                                      │
│                                                                          │
│  This is related to:                                                     │
│  • ReST (Reinforced Self-Training)                                      │
│  • STaR (Self-Taught Reasoner)                                          │
│  • Rejection sampling fine-tuning                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Progressive Distillation

Distill in stages through intermediate-sized models:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PROGRESSIVE DISTILLATION                              │
│                                                                          │
│  Instead of: 70B ──────────────────────────────────────→ 1B             │
│              (hard to bridge large capability gap)                       │
│                                                                          │
│  Do: 70B ────→ 13B ────→ 7B ────→ 3B ────→ 1B                          │
│      (each step is smaller gap, easier to distill)                      │
│                                                                          │
│  Benefits:                                                               │
│  • Smaller gaps = more effective transfer                               │
│  • Can stop at any intermediate size                                    │
│  • Each stage can be optimized independently                            │
│                                                                          │
│  Costs:                                                                  │
│  • More total training compute                                          │
│  • More complexity in pipeline                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Real-World Examples

Case Study: Alpaca (Stanford)

The Alpaca project demonstrated that high-quality instruction-following could be distilled from a large teacher:

Teacher: text-davinci-003 (GPT-3.5)
Student: LLaMA 7B
Data: 52K instructions generated via Self-Instruct
Cost: ~$500 for API calls to generate data
Result: 7B model with GPT-3.5-like instruction following

Key insight: With good distillation data, even a 7B model can achieve useful instruction-following capability.

Case Study: Orca (Microsoft)

Orca improved on simple distillation by collecting richer teacher outputs:

Instead of just answers, collected explanations
Asked GPT-4 to "explain step-by-step"
Resulted in much better reasoning in the student
Demonstrated importance of CoT distillation

Case Study: Phi (Microsoft)

The Phi models took a different approach:

Heavy emphasis on textbook-quality data
Synthetic data generation at scale
Curriculum learning (simple → complex)
Result: 3B model competitive with 70B on some benchmarks

Key insight: Data quality can sometimes matter more than model size.

Common Pitfalls and Solutions

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DISTILLATION PITFALLS                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PITFALL 1: Inheriting Teacher Errors                                   │
│  ─────────────────────────────────────                                   │
│  Problem: Teacher makes mistakes, student learns them                   │
│  Solution:                                                               │
│  • Filter synthetic data for quality                                    │
│  • Mix with ground truth data                                           │
│  • Use multiple teachers                                                │
│  • Verify on held-out set                                               │
│                                                                          │
│  PITFALL 2: Distribution Mismatch                                       │
│  ─────────────────────────────────                                       │
│  Problem: Training data doesn't match deployment distribution           │
│  Solution:                                                               │
│  • Generate data from realistic prompts                                 │
│  • Include edge cases                                                   │
│  • Test on representative eval set                                      │
│                                                                          │
│  PITFALL 3: Catastrophic Forgetting                                     │
│  ──────────────────────────────────                                      │
│  Problem: Student loses general capabilities while specializing         │
│  Solution:                                                               │
│  • Mix general and specialized data                                     │
│  • Lower learning rate                                                  │
│  • Monitor general benchmarks                                           │
│                                                                          │
│  PITFALL 4: Overconfident Student                                       │
│  ───────────────────────────────                                         │
│  Problem: Student produces high-confidence wrong answers                │
│  Solution:                                                               │
│  • Higher temperature during distillation                               │
│  • Train on teacher's uncertainty (soft labels)                         │
│  • Calibration fine-tuning                                              │
│                                                                          │
│  PITFALL 5: Teacher API Costs                                           │
│  ───────────────────────────                                             │
│  Problem: Generating enough data is expensive                           │
│  Solution:                                                               │
│  • Batch requests efficiently                                           │
│  • Cache teacher outputs                                                │
│  • Use open-source teacher if possible                                  │
│  • Start with less data, scale if needed                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Summary

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    KNOWLEDGE DISTILLATION SUMMARY                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  WHAT: Transfer capabilities from large teacher to small student        │
│                                                                          │
│  WHY:                                                                    │
│  • Reduce inference cost (10-100×)                                      │
│  • Enable edge deployment                                               │
│  • Improve latency                                                      │
│  • Maintain privacy (no API calls)                                      │
│                                                                          │
│  HOW:                                                                    │
│  • Generate synthetic data from teacher                                 │
│  • Filter for quality                                                   │
│  • Train student on teacher outputs                                     │
│  • Optionally use soft labels with temperature                          │
│                                                                          │
│  KEY TECHNIQUES:                                                         │
│  • Response-level distillation (most common for LLMs)                   │
│  • Chain-of-thought distillation (for reasoning)                        │
│  • Quality filtering (essential for synthetic data)                     │
│  • Temperature softening (reveals dark knowledge)                       │
│                                                                          │
│  EXPECT:                                                                 │
│  • 80-95% of teacher performance                                        │
│  • 10-100× cost reduction                                               │
│  • 2-10× latency improvement                                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Distillation for Specific Capabilities

Different capabilities require different distillation approaches. Understanding these nuances is crucial for effective knowledge transfer.

Distilling Reasoning Capabilities

Reasoning is one of the hardest capabilities to distill because it involves multi-step logical processes that small models struggle to learn from just the final answer.

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    REASONING DISTILLATION STRATEGIES                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STRATEGY 1: DETAILED TRACE DISTILLATION                                │
│  ────────────────────────────────────────                                │
│  Don't just teach the answer—teach every reasoning step:                │
│                                                                          │
│  Weak prompt: "What is 23 × 47?"                                        │
│  Weak response: "1081"                                                  │
│                                                                          │
│  Strong prompt: "What is 23 × 47? Show your work."                     │
│  Strong response:                                                        │
│  "Let me break this down:                                               │
│   23 × 47 = 23 × (50 - 3)                                              │
│           = 23 × 50 - 23 × 3                                           │
│           = 1150 - 69                                                   │
│           = 1081                                                        │
│                                                                          │
│   Alternatively: 23 × 47 = (20 + 3) × 47                               │
│                         = 20 × 47 + 3 × 47                             │
│                         = 940 + 141                                    │
│                         = 1081"                                         │
│                                                                          │
│  The detailed trace teaches HOW to reason, not just WHAT to say.       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STRATEGY 2: ERROR ANALYSIS DISTILLATION                                │
│  ────────────────────────────────────────                                │
│  Teach the student to recognize and avoid common mistakes:              │
│                                                                          │
│  Include in training data:                                               │
│  "Here's a common mistake: 23 × 47 ≠ 23 × 40 + 7 = 927               │
│   The error: we multiplied 23 × 7, not 23 × 47.                       │
│   Correct approach: 23 × 40 + 23 × 7 = 920 + 161 = 1081"             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STRATEGY 3: VERIFICATION DISTILLATION                                  │
│  ──────────────────────────────────────                                  │
│  Teach the student to check its own work:                               │
│                                                                          │
│  "Let me verify: 1081 ÷ 23 should equal 47.                           │
│   1081 ÷ 23 = 47 ✓                                                    │
│   The answer is confirmed: 23 × 47 = 1081"                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Distilling Code Generation

Code presents unique opportunities for distillation because we have automatic verification (execution):

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CODE DISTILLATION APPROACHES                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EXECUTION-VERIFIED DISTILLATION:                                        │
│  ─────────────────────────────────                                       │
│  1. Generate many code solutions from teacher                           │
│  2. Execute each against test cases                                     │
│  3. Keep only passing solutions                                         │
│  4. Train student on verified-correct code                              │
│                                                                          │
│  Benefits:                                                               │
│  • Ground truth filtering (no teacher errors propagated)               │
│  • Multiple correct solutions teach flexibility                         │
│  • Can generate unlimited training data                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXPLANATION-ENHANCED CODE DISTILLATION:                                │
│  ────────────────────────────────────────                                │
│  Don't just show the code—explain the approach:                         │
│                                                                          │
│  "Problem: Find two numbers that sum to target.                        │
│                                                                          │
│   Approach: Use a hash map for O(n) lookup.                            │
│   - For each number, check if (target - number) exists                 │
│   - If yes, we found our pair                                          │
│   - If no, add current number to hash map                              │
│                                                                          │
│   def two_sum(nums, target):                                           │
│       seen = {}  # Maps number to its index                            │
│       for i, num in enumerate(nums):                                   │
│           complement = target - num                                    │
│           if complement in seen:                                       │
│               return [seen[complement], i]                             │
│           seen[num] = i                                                │
│       return []  # No solution found"                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MULTI-SOLUTION DISTILLATION:                                           │
│  ─────────────────────────────                                           │
│  Teach multiple valid approaches:                                        │
│                                                                          │
│  "Solution 1 (Hash Map - O(n)): [code]                                 │
│   Solution 2 (Two Pointers on Sorted - O(n log n)): [code]            │
│   Solution 3 (Brute Force - O(n²)): [code]                             │
│                                                                          │
│   Trade-offs:                                                           │
│   - Hash map: Fastest, uses extra memory                               │
│   - Two pointers: No extra memory, requires sorting                    │
│   - Brute force: Simple but slow"                                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Distilling Safety and Alignment

Safety behaviors are particularly important—and tricky—to distill:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SAFETY DISTILLATION                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE CHALLENGE:                                                          │
│  ──────────────                                                          │
│  Safety isn't just about refusing harmful requests. It involves:       │
│  • Recognizing subtle harm                                              │
│  • Knowing when to help vs. refuse                                      │
│  • Refusing gracefully without being unhelpful                         │
│  • Not over-refusing (avoiding false positives)                        │
│                                                                          │
│  Teacher models encode complex safety judgments. How do we transfer?   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  APPROACH 1: CONSTITUTIONAL DISTILLATION                                │
│  ────────────────────────────────────────                                │
│  Include explicit safety reasoning in responses:                        │
│                                                                          │
│  Prompt: "How do I pick a lock?"                                       │
│                                                                          │
│  Teacher response with reasoning:                                       │
│  "I can help with lock picking for legitimate purposes like           │
│   locksmithing or when you're locked out of your own property.        │
│   Here are the basic principles: [educational content]                 │
│                                                                          │
│   However, I should note that using these techniques on locks         │
│   you don't own is illegal. Please only use this knowledge            │
│   ethically and legally."                                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  APPROACH 2: BALANCED DATASET                                           │
│  ─────────────────────────────                                           │
│  Include both:                                                           │
│  • Helpful responses to legitimate requests                            │
│  • Appropriate refusals for harmful requests                           │
│  • Edge cases with nuanced handling                                    │
│                                                                          │
│  Without balance, student may over-refuse or under-refuse.            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  APPROACH 3: RED-TEAM DISTILLATION                                      │
│  ──────────────────────────────                                          │
│  Generate adversarial prompts, have teacher respond appropriately:     │
│                                                                          │
│  • Jailbreak attempts → appropriate refusals                           │
│  • Social engineering → recognition and deflection                     │
│  • Ambiguous requests → thoughtful clarification                       │
│                                                                          │
│  Train student on this adversarial data for robustness.               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Measuring Distillation Success

Evaluation Framework

How do you know if distillation worked? A comprehensive evaluation framework:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DISTILLATION EVALUATION FRAMEWORK                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CAPABILITY METRICS:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  1. Task Performance                                                     │
│     • Benchmark scores (MMLU, HumanEval, GSM8K, etc.)                  │
│     • Comparison to teacher on same benchmarks                         │
│     • Transfer ratio = student_score / teacher_score                   │
│                                                                          │
│  2. Quality Metrics                                                      │
│     • Win rate vs teacher (human preference)                           │
│     • Response coherence and fluency                                   │
│     • Factual accuracy on test questions                               │
│                                                                          │
│  3. Reasoning Metrics                                                    │
│     • Chain-of-thought quality                                         │
│     • Multi-step problem solving                                       │
│     • Error rate on reasoning benchmarks                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EFFICIENCY METRICS:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  1. Inference Cost                                                       │
│     • Tokens per second                                                 │
│     • Cost per 1M tokens                                               │
│     • Memory footprint                                                  │
│                                                                          │
│  2. Latency                                                              │
│     • Time to first token                                               │
│     • Time for full response                                            │
│     • P95/P99 latencies                                                │
│                                                                          │
│  3. Deployment                                                           │
│     • Minimum hardware requirements                                     │
│     • Batch throughput                                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SUCCESS CRITERIA:                                                       │
│                                                                          │
│  "Good" distillation typically means:                                   │
│  • 85%+ of teacher performance on target tasks                         │
│  • 10×+ cost reduction                                                 │
│  • 3×+ latency improvement                                             │
│  • Acceptable quality degradation for use case                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Failure Analysis

When distillation doesn't work well, understanding why is crucial:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DISTILLATION FAILURE ANALYSIS                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SYMPTOM: Student performs well on training distribution but poorly    │
│           on new inputs                                                  │
│  DIAGNOSIS: Overfitting to synthetic data                               │
│  FIX:                                                                    │
│  • More diverse prompts in training                                    │
│  • Include out-of-distribution examples                                │
│  • Regularization (dropout, weight decay)                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SYMPTOM: Student copies teacher's style but not substance             │
│  DIAGNOSIS: Learning surface patterns, not reasoning                    │
│  FIX:                                                                    │
│  • More chain-of-thought data                                          │
│  • Verification-based filtering                                        │
│  • Feature-level distillation                                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SYMPTOM: Student is worse than base model on some tasks               │
│  DIAGNOSIS: Catastrophic forgetting                                     │
│  FIX:                                                                    │
│  • Lower learning rate                                                  │
│  • Mix in general data                                                 │
│  • Fewer training epochs                                               │
│  • LoRA instead of full fine-tuning                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SYMPTOM: Student confident but often wrong                             │
│  DIAGNOSIS: Lost calibration during distillation                       │
│  FIX:                                                                    │
│  • Higher temperature during training                                  │
│  • Include uncertainty in teacher responses                            │
│  • Calibration fine-tuning                                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SYMPTOM: Gap between student and teacher too large                    │
│  DIAGNOSIS: Capacity gap too wide                                       │
│  FIX:                                                                    │
│  • Larger student model                                                │
│  • Progressive distillation (intermediate steps)                       │
│  • Task-specific distillation (narrow scope)                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Economics of Distillation

Cost-Benefit Analysis

When does distillation make economic sense?

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    DISTILLATION ECONOMICS                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  COSTS OF DISTILLATION:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  1. Data Generation                                                      │
│     • Teacher API calls: $500-50,000 (depends on scale)                │
│     • Prompt engineering time                                          │
│     • Quality filtering compute                                         │
│                                                                          │
│  2. Training                                                             │
│     • GPU hours: $100-10,000 (depends on student size)                 │
│     • Iteration cycles for tuning                                      │
│     • Evaluation compute                                                │
│                                                                          │
│  3. Engineering                                                          │
│     • Pipeline development                                              │
│     • Quality assurance                                                 │
│     • Deployment setup                                                  │
│                                                                          │
│  TOTAL: $1,000 - $100,000 (typical range)                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BENEFITS (per-query savings):                                           │
│  ─────────────────────────────                                           │
│                                                                          │
│  Teacher (GPT-4): ~$0.03 per query (1K tokens in, 500 out)             │
│  Student (7B):    ~$0.001 per query (self-hosted) or $0.0005 (batch)   │
│                                                                          │
│  Savings per query: ~$0.029                                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BREAK-EVEN ANALYSIS:                                                    │
│                                                                          │
│  If distillation costs $10,000:                                         │
│  Break-even = $10,000 / $0.029 per query ≈ 345,000 queries             │
│                                                                          │
│  At 10,000 queries/day = 35 days to break even                         │
│  At 100,000 queries/day = 3.5 days to break even                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHEN DISTILLATION PAYS OFF:                                            │
│                                                                          │
│  ✓ High query volume (>100K queries expected)                          │
│  ✓ Long deployment lifetime                                             │
│  ✓ Latency requirements teacher can't meet                             │
│  ✓ Privacy requirements (can't send data to API)                       │
│  ✓ Reliability requirements (can't depend on external API)             │
│                                                                          │
│  WHEN TO USE TEACHER DIRECTLY:                                          │
│                                                                          │
│  ✗ Low query volume                                                     │
│  ✗ Rapidly changing requirements                                        │
│  ✗ Quality degradation unacceptable                                     │
│  ✗ Limited engineering resources                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Future Directions

Emerging Trends in Distillation

The field is evolving rapidly. Key directions to watch:

1. Reasoning-Focused Distillation Models like DeepSeek-R1 and o1 have sophisticated reasoning capabilities. Distilling these "thinking" models requires new techniques that capture the internal reasoning process, not just the final output.

2. Multi-Modal Distillation As vision-language models become standard, distillation must handle multiple modalities—images, text, and their interactions. This is more complex than text-only distillation.

3. Online Distillation Instead of offline data generation, continuously distill from teacher during deployment. The student improves in real-time based on actual user queries.

4. Distillation-Aware Training Train teacher models specifically to be good teachers—producing outputs that are easier for students to learn from.

5. Tiny Models, Big Capability Pushing the limits: can we distill useful capabilities into models small enough to run on phones or IoT devices? Early results are promising.

Table of Contents

What is Knowledge Distillation?

Why Distillation Matters

The Theory: Why Distillation Works

Hard Labels vs. Soft Labels

Why Dark Knowledge Helps

Temperature: Controlling Softness

The Distillation Loss

Types of Knowledge Distillation

1. Logit Distillation (Output-Level)

2. Response Distillation (Sequence-Level)

3. Feature Distillation (Intermediate-Level)

4. Chain-of-Thought Distillation

Synthetic Data Generation for Distillation

The Data Challenge

The Synthetic Data Pipeline

Self-Instruct and Evol-Instruct

Quality Control for Synthetic Data

Practical Implementation

Architecture Considerations

Training Configuration

Simple Implementation Example

Response-Level Distillation (More Common)

When to Distill vs. Other Approaches

Decision Framework

Distillation vs. Direct Training

Advanced Techniques

Multi-Teacher Distillation

Self-Distillation

Progressive Distillation

Real-World Examples

Case Study: Alpaca (Stanford)

Case Study: Orca (Microsoft)

Case Study: Phi (Microsoft)

Common Pitfalls and Solutions

Summary

Distillation for Specific Capabilities

Distilling Reasoning Capabilities

Distilling Code Generation

Distilling Safety and Alignment

Measuring Distillation Success

Evaluation Framework

Failure Analysis

The Economics of Distillation

Cost-Benefit Analysis

Future Directions

Emerging Trends in Distillation

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Small Language Models: Edge Deployment and Knowledge Distillation

LLM Pre-training: Building Foundation Models from Scratch

SFT Deep Dive: Instruction Tuning Techniques and Best Practices

RLHF Complete Guide: Aligning LLMs with Human Preferences

Synthetic Data Generation for LLM Training

Fine-Tuning vs Prompting: When to Use Each