Skip to main content
Back to Blog

SFT Deep Dive: Instruction Tuning Techniques and Best Practices

A comprehensive guide to Supervised Fine-Tuning (SFT) for LLMs—covering full fine-tuning vs LoRA vs QLoRA vs DoRA, data curation strategies, instruction formats, multi-task learning, and avoiding catastrophic forgetting.

4 min read
Share:

What is Supervised Fine-Tuning (SFT)?

Supervised Fine-Tuning is the process of adapting a pre-trained language model to follow instructions and respond helpfully to user queries. While pre-training teaches a model to predict text, SFT teaches it to be an assistant.

The transformation is profound:

Before SFT (base model):

Code
User: "What is the capital of France?"
Model: "What is the capital of France?
        A) Paris B) London C) Berlin D) Madrid
        The answer is..."

The base model continues text, treating the input as an incomplete document.

After SFT:

Code
User: "What is the capital of France?"
Model: "The capital of France is Paris."

The fine-tuned model understands it should answer, not continue.

Where SFT Fits in the Training Pipeline

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    THE LLM TRAINING PIPELINE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STAGE 1: PRE-TRAINING                                                   │
│  ─────────────────────                                                   │
│  • Objective: Predict next token                                        │
│  • Data: Trillions of tokens (web, books, code)                         │
│  • Duration: Weeks to months                                            │
│  • Cost: $1M-$100M+                                                     │
│  • Result: Base model (text completion)                                 │
│                                                                          │
│           ↓                                                              │
│                                                                          │
│  STAGE 2: SUPERVISED FINE-TUNING (SFT) ← This post                     │
│  ──────────────────────────────────────                                  │
│  • Objective: Learn to follow instructions                              │
│  • Data: 10K-1M instruction-response pairs                              │
│  • Duration: Hours to days                                              │
│  • Cost: $100-$100K                                                     │
│  • Result: Instruction-following model                                  │
│                                                                          │
│           ↓                                                              │
│                                                                          │
│  STAGE 3: PREFERENCE TUNING (RLHF/DPO)                                  │
│  ─────────────────────────────────────                                   │
│  • Objective: Align with human preferences                              │
│  • Data: Preference pairs (chosen vs rejected)                          │
│  • Duration: Days to weeks                                              │
│  • Cost: $1K-$1M                                                        │
│  • Result: Aligned assistant                                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

SFT is the bridge between a raw language model and a useful assistant. It's often the most impactful training stage for practical applications.


Fine-Tuning Methods: Full, LoRA, QLoRA, and Beyond

The first decision in fine-tuning is how to update the model's parameters. Options range from updating all parameters (full fine-tuning) to updating only a small subset (parameter-efficient methods).

Full Fine-Tuning

What it is: Update all model parameters during training, just like pre-training but on your specific data.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FULL FINE-TUNING                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Original Model                      Fine-tuned Model                    │
│  ┌─────────────────┐                ┌─────────────────┐                 │
│  │ ████████████████│                │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│                 │
│  │ ████████████████│   Training    │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│                 │
│  │ ████████████████│ ──────────→   │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│                 │
│  │ ████████████████│                │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│                 │
│  │ ████████████████│                │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│                 │
│  └─────────────────┘                └─────────────────┘                 │
│                                                                          │
│  ████ = Original weights            ▓▓▓▓ = ALL weights updated          │
│                                                                          │
│  Requirements for 7B model:                                              │
│  • GPU Memory: ~60-80 GB (with optimizer states)                        │
│  • Storage: 14 GB per checkpoint                                        │
│  • Time: 4-12 hours on 8× A100s                                         │
│                                                                          │
│  Pros:                              Cons:                                │
│  ✓ Maximum flexibility              ✗ High memory requirements          │
│  ✓ Full model capacity              ✗ Full model copy per task          │
│  ✓ Can learn any behavior           ✗ Risk of catastrophic forgetting   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Memory breakdown for full fine-tuning:

ComponentSize (7B model)
Model parameters (FP16)14 GB
Gradients (FP16)14 GB
Optimizer states (Adam)56 GB
Activations10-50 GB (varies)
Total~100+ GB

This is why full fine-tuning of even a 7B model requires high-end hardware or distributed training.

LoRA (Low-Rank Adaptation)

The key insight: Most of the knowledge in a pre-trained model doesn't need to change for a specific task. Instead of updating all parameters, we can add small "adapter" layers that learn task-specific adjustments.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    LoRA: LOW-RANK ADAPTATION                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Key Idea: Instead of updating a weight matrix W directly,              │
│  add a low-rank decomposition: W' = W + BA                              │
│                                                                          │
│  Original weight matrix W (frozen):                                      │
│  ┌───────────────────────────────────────┐                              │
│  │                                       │  d × d                        │
│  │                                       │  e.g., 4096 × 4096           │
│  │         W (frozen)                    │  = 16M parameters            │
│  │                                       │                              │
│  │                                       │                              │
│  └───────────────────────────────────────┘                              │
│                                                                          │
│  LoRA adaptation:                                                        │
│  ┌───────┐   ┌───────────────────────────┐                              │
│  │       │   │                           │                              │
│  │   B   │ × │           A               │                              │
│  │       │   │                           │                              │
│  │ d × r │   │         r × d             │                              │
│  └───────┘   └───────────────────────────┘                              │
│                                                                          │
│  Where r = rank (typically 8-64)                                        │
│  Parameters: 2 × d × r = 2 × 4096 × 16 = 131K (vs 16M!)                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Forward pass:                                                           │
│  h = Wx + BAx  (original + adaptation)                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Applied to: Query, Key, Value, and Output projections in attention     │
│  Typical rank: r = 8-64                                                 │
│  Typical alpha: α = 16-32 (scaling factor)                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Why low-rank works: Research shows that the weight updates during fine-tuning have low "intrinsic rank"—they can be well-approximated by low-rank matrices. This means we're not losing much by constraining updates to be low-rank.

LoRA benefits:

AspectFull Fine-TuningLoRA
Trainable params100%0.1-1%
Memory (7B model)~100 GB~16 GB
Storage per task14 GB10-100 MB
Training speedBaselineOften faster
QualityBaseline~95-99% of full

LoRA configuration explained:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    LoRA HYPERPARAMETERS                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  r (rank):                                                               │
│  ─────────                                                               │
│  • Controls adapter capacity                                            │
│  • Higher r = more parameters = more capacity                           │
│  • Typical values: 8, 16, 32, 64                                        │
│  • Start with r=16, increase if underfitting                            │
│                                                                          │
│  alpha (α):                                                              │
│  ──────────                                                              │
│  • Scaling factor for LoRA output                                       │
│  • Effective scaling: α/r                                               │
│  • Common practice: α = 2r (e.g., r=16, α=32)                          │
│  • Higher α = stronger adaptation                                       │
│                                                                          │
│  target_modules:                                                         │
│  ───────────────                                                         │
│  • Which layers to apply LoRA                                           │
│  • Minimal: ["q_proj", "v_proj"]                                        │
│  • Common: ["q_proj", "k_proj", "v_proj", "o_proj"]                     │
│  • Aggressive: All linear layers                                        │
│  • More modules = more capacity, more memory                            │
│                                                                          │
│  dropout:                                                                │
│  ────────                                                                │
│  • LoRA-specific dropout (regularization)                               │
│  • Typical: 0.05-0.1                                                    │
│  • Helps prevent overfitting on small datasets                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

QLoRA (Quantized LoRA)

The innovation: Combine LoRA with quantization to enable fine-tuning on consumer hardware.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    QLoRA: QUANTIZED LoRA                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Key Ideas:                                                              │
│                                                                          │
│  1. Load base model in 4-bit (NormalFloat4)                             │
│     • Reduces model memory by 4×                                        │
│     • 7B model: 14GB → ~3.5GB                                           │
│                                                                          │
│  2. Add LoRA adapters in higher precision (FP16/BF16)                   │
│     • Adapters are small, precision matters                             │
│     • Only adapters are trained                                         │
│                                                                          │
│  3. Double quantization                                                  │
│     • Quantize the quantization constants too                           │
│     • Additional memory savings                                         │
│                                                                          │
│  4. Paged optimizers                                                     │
│     • Move optimizer states to CPU when GPU is full                     │
│     • Prevents OOM during gradient accumulation                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Memory comparison (7B model):                                          │
│                                                                          │
│  Full FT:     ~100 GB (requires 2× A100 80GB)                           │
│  LoRA FP16:   ~16 GB  (requires 1× A100 80GB or 2× A100 40GB)          │
│  QLoRA 4-bit: ~6 GB   (fits on RTX 4090 or even 3090!)                 │
│                                                                          │
│  Quality: Typically ~98% of full fine-tuning performance               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

QLoRA configuration:

Python
# Example QLoRA configuration (using PEFT library)
qlora_config = {
    # Quantization settings
    "load_in_4bit": True,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_type": "nf4",  # NormalFloat4
    "bnb_4bit_use_double_quant": True,

    # LoRA settings
    "r": 64,
    "lora_alpha": 16,
    "lora_dropout": 0.1,
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
                       "gate_proj", "up_proj", "down_proj"]
}

DoRA (Weight-Decomposed Low-Rank Adaptation)

The insight: LoRA updates magnitude and direction together. DoRA decomposes them, updating direction via LoRA while learning magnitude separately.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DoRA: WEIGHT-DECOMPOSED LoRA                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LoRA: W' = W + BA                                                      │
│        (updates magnitude and direction together)                       │
│                                                                          │
│  DoRA: W' = m · (W + BA) / ||W + BA||                                  │
│             ↑        ↑                                                   │
│             │        └── Direction (normalized, via LoRA)               │
│             └── Magnitude (learnable scalar per output dim)             │
│                                                                          │
│  Key insight: Full fine-tuning primarily changes direction,             │
│  with smaller magnitude changes. DoRA mimics this pattern.              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Benefits over LoRA:                                                     │
│  • Better matches full fine-tuning behavior                             │
│  • Improved performance on many benchmarks                              │
│  • Same memory footprint as LoRA                                        │
│  • Drop-in replacement                                                  │
│                                                                          │
│  Parameters added: r × d (LoRA) + d (magnitude) ≈ same as LoRA         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Method Comparison

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FINE-TUNING METHODS COMPARISON                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Method     Memory  Quality  Speed  Multi-task  Hardware                │
│  ──────────────────────────────────────────────────────────────────────│
│  Full FT    100%    100%     1×     Hard        Multi-GPU              │
│  LoRA       15-20%  95-99%   1-2×   Easy        Single GPU             │
│  QLoRA      5-10%   93-98%   0.5×   Easy        Consumer GPU           │
│  DoRA       15-20%  96-99%   1×     Easy        Single GPU             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RECOMMENDATIONS:                                                        │
│                                                                          │
│  Consumer GPU (24GB):  QLoRA for 7B models                              │
│  Workstation (48GB):   LoRA or DoRA for 7B, QLoRA for 13B              │
│  A100 (80GB):          LoRA/DoRA for 13B, QLoRA for 70B                │
│  Multi-GPU cluster:    Full fine-tuning if quality critical            │
│                                                                          │
│  Start with LoRA/QLoRA; move to full FT only if needed.                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Data Quality: The Most Important Factor

Data quality determines model quality. A smaller model trained on excellent data will outperform a larger model trained on poor data.

What Makes Good SFT Data?

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CHARACTERISTICS OF GOOD SFT DATA                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. CORRECTNESS                                                          │
│     ────────────                                                         │
│     • Responses should be factually accurate                            │
│     • Code should work                                                  │
│     • Reasoning should be valid                                         │
│     • No hallucinations                                                 │
│                                                                          │
│  2. HELPFULNESS                                                          │
│     ────────────                                                         │
│     • Actually addresses the user's need                                │
│     • Appropriate level of detail                                       │
│     • Actionable when appropriate                                       │
│     • Anticipates follow-up questions                                   │
│                                                                          │
│  3. STYLE CONSISTENCY                                                    │
│     ─────────────────                                                    │
│     • Consistent tone across examples                                   │
│     • Consistent formatting                                             │
│     • Matches your desired assistant personality                        │
│                                                                          │
│  4. DIVERSITY                                                            │
│     ─────────                                                            │
│     • Wide range of topics                                              │
│     • Various instruction types (explain, write, analyze, etc.)         │
│     • Different difficulty levels                                       │
│     • Various response lengths                                          │
│                                                                          │
│  5. APPROPRIATE LENGTH                                                   │
│     ────────────────────                                                 │
│     • Not too verbose (wastes training compute)                         │
│     • Not too terse (loses nuance)                                      │
│     • Length should match complexity                                    │
│                                                                          │
│  6. FORMATTING                                                           │
│     ──────────                                                           │
│     • Proper markdown when appropriate                                  │
│     • Code blocks with language tags                                    │
│     • Lists for multiple items                                          │
│     • Structure that aids comprehension                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Data Sources for SFT

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    SFT DATA SOURCES                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. PUBLIC DATASETS                                                      │
│     ─────────────────                                                    │
│     • ShareGPT: Real conversations (varied quality)                     │
│     • OpenAssistant: Crowdsourced dialogues                             │
│     • Dolly: Databricks employee-written                                │
│     • FLAN: Academic task collection                                    │
│     • OpenOrca: Curated reasoning examples                              │
│                                                                          │
│     Pros: Free, diverse, large                                          │
│     Cons: Quality varies, may not match your needs                      │
│                                                                          │
│  2. SYNTHETIC GENERATION                                                 │
│     ────────────────────                                                 │
│     • Generate from stronger models (GPT-4, Claude)                     │
│     • Self-Instruct methodology                                         │
│     • Evol-Instruct for complexity                                      │
│                                                                          │
│     Pros: Unlimited quantity, controllable style                        │
│     Cons: API costs, teacher limitations inherited                      │
│                                                                          │
│  3. HUMAN ANNOTATION                                                     │
│     ─────────────────                                                    │
│     • Expert writers for your domain                                    │
│     • Crowdsourcing with quality control                                │
│     • Internal team annotation                                          │
│                                                                          │
│     Pros: Highest quality, perfect style control                        │
│     Cons: Expensive ($5-50 per example), slow                           │
│                                                                          │
│  4. EXISTING INTERACTIONS                                                │
│     ─────────────────────                                                │
│     • Logs from current system (with consent)                           │
│     • Customer support transcripts                                      │
│     • Historical Q&A data                                               │
│                                                                          │
│     Pros: Real distribution, domain-specific                            │
│     Cons: Privacy concerns, quality filtering needed                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Data Quality Pipeline

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA QUALITY PIPELINE                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RAW DATA                                                                │
│      │                                                                   │
│      ▼                                                                   │
│  ┌────────────────────────────────────────────────────────────────────┐│
│  │ STEP 1: BASIC FILTERING                                            ││
│  │ • Remove very short/very long examples                             ││
│  │ • Remove duplicates                                                ││
│  │ • Remove non-target language                                       ││
│  │ • Remove examples with encoding issues                             ││
│  └───────────────────────────────────────────────────────────────────┘│
│      │                                                                   │
│      ▼                                                                   │
│  ┌────────────────────────────────────────────────────────────────────┐│
│  │ STEP 2: QUALITY SCORING                                            ││
│  │ • LLM-as-judge scoring (GPT-4 rates quality 1-5)                  ││
│  │ • Perplexity filtering (remove nonsense)                          ││
│  │ • Instruction-response alignment (does response match?)           ││
│  │ • Factuality checks (for fact-based responses)                    ││
│  └───────────────────────────────────────────────────────────────────┘│
│      │                                                                   │
│      ▼                                                                   │
│  ┌────────────────────────────────────────────────────────────────────┐│
│  │ STEP 3: CONTENT FILTERING                                          ││
│  │ • Remove toxic/harmful content                                     ││
│  │ • Remove PII (personal information)                                ││
│  │ • Remove copyrighted content                                       ││
│  │ • Remove incorrect/hallucinated facts                              ││
│  └───────────────────────────────────────────────────────────────────┘│
│      │                                                                   │
│      ▼                                                                   │
│  ┌────────────────────────────────────────────────────────────────────┐│
│  │ STEP 4: DIVERSITY BALANCING                                        ││
│  │ • Ensure topic coverage                                            ││
│  │ • Balance instruction types                                        ││
│  │ • Balance difficulty levels                                        ││
│  │ • Deduplicate semantically similar examples                        ││
│  └───────────────────────────────────────────────────────────────────┘│
│      │                                                                   │
│      ▼                                                                   │
│  CLEAN TRAINING DATA                                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

How Much Data Do You Need?

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DATA QUANTITY GUIDELINES                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Task Type                        Minimum      Good        Ideal        │
│  ────────────────────────────────────────────────────────────────────── │
│  General instruction following    10K         100K        500K+        │
│  Domain-specific assistant        1K          10K         50K          │
│  Single task (classification)     500         2K          10K          │
│  Code generation                  5K          50K         200K+        │
│  Complex reasoning                10K         100K        500K+        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHT: Quality > Quantity                                        │
│                                                                          │
│  1,000 high-quality examples > 100,000 low-quality examples            │
│                                                                          │
│  The Phi models showed that carefully curated data can achieve         │
│  remarkable results with relatively small datasets.                     │
│                                                                          │
│  PRACTICAL APPROACH:                                                     │
│  1. Start with 1,000-5,000 high-quality examples                       │
│  2. Train and evaluate                                                  │
│  3. Identify failure modes                                              │
│  4. Add targeted data for failures                                      │
│  5. Repeat                                                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Instruction Formats

The format of training data significantly impacts model behavior. Different formats encode different assumptions about how conversations should work.

Common Instruction Formats

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    INSTRUCTION FORMATS                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. ALPACA FORMAT                                                        │
│     ──────────────                                                       │
│     ### Instruction:                                                     │
│     {instruction}                                                        │
│                                                                          │
│     ### Input:                                                           │
│     {input}                                                              │
│                                                                          │
│     ### Response:                                                        │
│     {response}                                                           │
│                                                                          │
│     Use when: Simple instruction-following tasks                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2. CHAT/MESSAGES FORMAT                                                │
│     ───────────────────                                                  │
│     <|im_start|>system                                                   │
│     {system_prompt}<|im_end|>                                           │
│     <|im_start|>user                                                     │
│     {user_message}<|im_end|>                                            │
│     <|im_start|>assistant                                                │
│     {assistant_response}<|im_end|>                                      │
│                                                                          │
│     Use when: Multi-turn conversations, role-playing                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  3. LLAMA 3 FORMAT                                                       │
│     ───────────────                                                      │
│     <|begin_of_text|><|start_header_id|>system<|end_header_id|>         │
│                                                                          │
│     {system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>   │
│                                                                          │
│     {user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id│
│                                                                          │
│     {assistant_response}<|eot_id|>                                      │
│                                                                          │
│     Use when: Fine-tuning Llama 3 specifically                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  4. OPENAI MESSAGES FORMAT (for API compatibility)                      │
│     ────────────────────────────────────────────                         │
│     [                                                                    │
│       {"role": "system", "content": "..."},                             │
│       {"role": "user", "content": "..."},                               │
│       {"role": "assistant", "content": "..."}                           │
│     ]                                                                    │
│                                                                          │
│     Use when: Building API-compatible systems                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Choosing the Right Format

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    FORMAT SELECTION GUIDE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  RULE 1: Match the base model's format if possible                      │
│  ─────────────────────────────────────────────────                       │
│  • Llama 3 → Use Llama 3 format                                         │
│  • Mistral → Use Mistral format                                         │
│  • This ensures compatibility with model's pre-training                 │
│                                                                          │
│  RULE 2: Be consistent                                                   │
│  ───────────────────                                                     │
│  • Use the same format for all training data                            │
│  • Use the same format at inference time                                │
│  • Inconsistency confuses the model                                     │
│                                                                          │
│  RULE 3: Include system prompts in training if you'll use them         │
│  ─────────────────────────────────────────────────────────────          │
│  • If you plan to use system prompts at inference, train with them     │
│  • Vary the system prompts in training data                             │
│  • This teaches the model to follow system instructions                 │
│                                                                          │
│  RULE 4: Consider special tokens                                        │
│  ───────────────────────────────                                         │
│  • Make sure tokenizer knows all special tokens                         │
│  • Add them to tokenizer if needed                                      │
│  • Be careful with <|endoftext|> and similar                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Multi-Turn Conversations

For conversational assistants, training on multi-turn dialogues is essential:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MULTI-TURN TRAINING DATA                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Single-turn (limited):                                                  │
│  ────────────────────────                                                │
│  User: "What's the capital of France?"                                  │
│  Assistant: "Paris"                                                      │
│                                                                          │
│  Multi-turn (better):                                                    │
│  ─────────────────────                                                   │
│  User: "What's the capital of France?"                                  │
│  Assistant: "The capital of France is Paris."                           │
│  User: "What about Germany?"                                            │
│  Assistant: "The capital of Germany is Berlin."                         │
│  User: "Which one has more people?"                                     │
│  Assistant: "Berlin has a larger population than Paris. Berlin has     │
│              about 3.6 million people in the city proper, while Paris  │
│              has about 2.1 million."                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Benefits of multi-turn:                                                 │
│  • Model learns to use conversation context                             │
│  • Learns to handle follow-up questions                                 │
│  • Learns pronoun resolution ("Which one" = which capital)              │
│  • Learns conversational coherence                                      │
│                                                                          │
│  Best practice: Mix of single-turn and multi-turn in training data     │
│  Ratio: ~30% single-turn, ~70% multi-turn (for conversational use)     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Multi-Task vs. Single-Task Fine-Tuning

Should you fine-tune one model for everything, or separate models for each task?

Single-Task Fine-Tuning

Train a model specifically for one task (e.g., code generation, customer support, medical Q&A).

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    SINGLE-TASK FINE-TUNING                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Base Model ────────→ Code Model                                        │
│              Training on code data only                                 │
│                                                                          │
│  Pros:                                                                   │
│  • Maximum performance on target task                                   │
│  • Simpler training (one objective)                                     │
│  • Smaller training dataset needed                                      │
│                                                                          │
│  Cons:                                                                   │
│  • Loses general capabilities                                           │
│  • One model per task = high serving complexity                         │
│  • Can't handle mixed queries                                           │
│                                                                          │
│  Best for:                                                               │
│  • Highly specialized domains                                           │
│  • When task performance is critical                                    │
│  • When general capabilities aren't needed                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Multi-Task Fine-Tuning

Train one model on multiple tasks simultaneously.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MULTI-TASK FINE-TUNING                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│                      ┌──────────────────────────────────────────────────┐
│                      │              Training Data                       │
│                      │  ┌─────────┐ ┌─────────┐ ┌─────────┐           │
│                      │  │  Code   │ │  Q&A    │ │ Writing │           │
│                      │  │  Tasks  │ │  Tasks  │ │  Tasks  │           │
│                      │  └────┬────┘ └────┬────┘ └────┬────┘           │
│                      │       └──────────┼──────────┘                  │
│                      │                  ▼                              │
│  Base Model ─────────┼────────────→ Multi-task Model                  │
│                      │                  │                              │
│                      │                  ▼                              │
│                      │        Handles all task types                   │
│                      └──────────────────────────────────────────────────┘
│                                                                          │
│  Pros:                                                                   │
│  • One model serves all tasks                                           │
│  • Simpler deployment                                                   │
│  • Knowledge transfer between tasks                                     │
│  • Preserves generality                                                 │
│                                                                          │
│  Cons:                                                                   │
│  • May underperform specialized models on any single task              │
│  • Balancing tasks during training is tricky                           │
│  • Larger training dataset needed                                       │
│                                                                          │
│  Best for:                                                               │
│  • General-purpose assistants                                           │
│  • When versatility is valued                                           │
│  • When you can't predict which task users will need                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Task Balancing Strategies

When training on multiple tasks, how you balance them matters:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TASK BALANCING STRATEGIES                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. UNIFORM SAMPLING                                                     │
│     ─────────────────                                                    │
│     Equal probability of sampling each task                              │
│     Risk: Under-training on harder tasks, over-training on easier       │
│                                                                          │
│  2. SIZE-PROPORTIONAL                                                    │
│     ─────────────────                                                    │
│     Sample proportional to dataset size                                  │
│     Risk: Rare but important tasks get under-trained                    │
│                                                                          │
│  3. TEMPERATURE SAMPLING                                                 │
│     ─────────────────────                                                │
│     P(task) ∝ size(task)^(1/T)                                          │
│     T=1: Size-proportional, T=∞: Uniform                                │
│     T=5 is common: Balances between extremes                            │
│                                                                          │
│  4. PERFORMANCE-BASED                                                    │
│     ────────────────────                                                 │
│     Sample more from tasks where model is struggling                    │
│     Requires evaluation during training                                  │
│     Most complex but can be most effective                              │
│                                                                          │
│  5. STAGED TRAINING                                                      │
│     ───────────────                                                      │
│     Train on general data first, then specialize                        │
│     Or: Interleave general and specialized batches                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Catastrophic Forgetting: The Silent Killer

One of the biggest challenges in fine-tuning is catastrophic forgetting—the model loses general capabilities while learning specific ones.

Understanding Catastrophic Forgetting

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CATASTROPHIC FORGETTING                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Before fine-tuning:                                                     │
│  ────────────────────                                                    │
│  Base model capabilities:                                                │
│  ✓ General knowledge                                                    │
│  ✓ Reasoning                                                            │
│  ✓ Code generation                                                      │
│  ✓ Creative writing                                                     │
│  ✓ Math                                                                 │
│                                                                          │
│  After aggressive fine-tuning on customer support:                      │
│  ─────────────────────────────────────────────────                       │
│  Model capabilities:                                                     │
│  ✓ Customer support (excellent!)                                        │
│  ✗ General knowledge (degraded)                                         │
│  ✗ Reasoning (degraded)                                                 │
│  ✗ Code generation (broken)                                             │
│  ✗ Creative writing (broken)                                            │
│  ✗ Math (broken)                                                        │
│                                                                          │
│  The model "forgot" how to do things it knew before!                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Why this happens:                                                       │
│  • Neural networks store knowledge distributed across weights           │
│  • Fine-tuning updates weights to optimize for new task                │
│  • These updates can overwrite weights needed for old tasks            │
│  • The narrower your fine-tuning data, the worse the forgetting        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Mitigation Strategies

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PREVENTING CATASTROPHIC FORGETTING                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. MIX IN GENERAL DATA                                                  │
│     ────────────────────                                                 │
│     Training mix: 70% specific task + 30% general data                  │
│     General data maintains broad capabilities                            │
│     Most effective and most common approach                             │
│                                                                          │
│  2. USE PARAMETER-EFFICIENT METHODS (LoRA)                              │
│     ────────────────────────────────────────                             │
│     LoRA only updates a small subset of parameters                      │
│     Base model weights are frozen                                        │
│     Greatly reduces forgetting by design                                │
│                                                                          │
│  3. LOWER LEARNING RATE                                                  │
│     ────────────────────                                                 │
│     Smaller updates = less disruption                                   │
│     Typical: 10-50% of pre-training LR                                  │
│     Trade-off: Slower convergence                                       │
│                                                                          │
│  4. EARLY STOPPING                                                       │
│     ──────────────                                                       │
│     Monitor general benchmarks during training                          │
│     Stop when general performance starts dropping                       │
│     Requires evaluation infrastructure                                  │
│                                                                          │
│  5. REGULARIZATION                                                       │
│     ──────────────                                                       │
│     L2 regularization toward original weights                           │
│     Elastic Weight Consolidation (EWC)                                  │
│     Adds complexity but can help                                        │
│                                                                          │
│  6. REPLAY                                                               │
│     ───────                                                              │
│     Store examples from pre-training                                    │
│     Mix them into fine-tuning batches                                   │
│     Similar to mixing general data                                      │
│                                                                          │
│  PRACTICAL RECOMMENDATION:                                               │
│  Use LoRA + mix 20-30% general data + lower learning rate              │
│  This combination handles most forgetting issues.                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Measuring Forgetting

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MONITORING FOR FORGETTING                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Track performance on these benchmarks BEFORE and AFTER fine-tuning:   │
│                                                                          │
│  General knowledge:                                                      │
│  • MMLU (Massive Multitask Language Understanding)                      │
│  • TriviaQA                                                             │
│  • NaturalQuestions                                                     │
│                                                                          │
│  Reasoning:                                                              │
│  • ARC (AI2 Reasoning Challenge)                                        │
│  • HellaSwag                                                            │
│  • WinoGrande                                                           │
│                                                                          │
│  Code:                                                                   │
│  • HumanEval                                                            │
│  • MBPP                                                                 │
│                                                                          │
│  Math:                                                                   │
│  • GSM8K                                                                │
│  • MATH                                                                 │
│                                                                          │
│  ACCEPTABLE DEGRADATION:                                                 │
│  • <5%: Probably fine                                                   │
│  • 5-10%: Might be acceptable depending on use case                    │
│  • >10%: Significant forgetting, needs mitigation                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Evaluation During Fine-Tuning

What to Measure

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    EVALUATION METRICS                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. TRAINING METRICS (during training)                                  │
│     ─────────────────────────────────                                    │
│     • Training loss: Should decrease                                    │
│     • Gradient norm: Should be stable                                   │
│     • Learning rate: Follow schedule                                    │
│                                                                          │
│  2. VALIDATION METRICS (on held-out data)                              │
│     ─────────────────────────────────────                                │
│     • Validation loss: Gap with training shows overfit                 │
│     • Perplexity: Lower is generally better                            │
│                                                                          │
│  3. TASK-SPECIFIC METRICS                                               │
│     ─────────────────────────                                            │
│     • Exact match (for factual Q&A)                                    │
│     • BLEU/ROUGE (for generation)                                       │
│     • Pass@k (for code)                                                 │
│     • Task-specific accuracy                                            │
│                                                                          │
│  4. GENERAL CAPABILITY METRICS                                          │
│     ────────────────────────────                                         │
│     • MMLU score                                                        │
│     • Reasoning benchmarks                                              │
│     • Prevent forgetting                                                │
│                                                                          │
│  5. QUALITATIVE EVALUATION                                              │
│     ────────────────────────                                             │
│     • Manual inspection of outputs                                      │
│     • Edge case testing                                                 │
│     • Red teaming for safety                                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Evaluation Cadence

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    WHEN TO EVALUATE                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  EVERY STEP:                                                            │
│  • Training loss (automatic)                                            │
│                                                                          │
│  EVERY N STEPS (e.g., 100-500):                                         │
│  • Validation loss                                                      │
│  • Gradient statistics                                                  │
│                                                                          │
│  EVERY EPOCH / MAJOR CHECKPOINT:                                        │
│  • Task-specific benchmarks                                             │
│  • General benchmarks (MMLU, etc.)                                      │
│  • Qualitative inspection                                               │
│                                                                          │
│  BEFORE DEPLOYMENT:                                                      │
│  • Full benchmark suite                                                 │
│  • Safety evaluation                                                    │
│  • Edge case testing                                                    │
│  • A/B testing against baseline                                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Overfitting Detection

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DETECTING OVERFITTING                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Loss curves:                                                            │
│                                                                          │
│  HEALTHY:                                                                │
│  Loss                                                                    │
│    │╲                                                                   │
│    │ ╲  Training                                                        │
│    │  ╲___________                                                      │
│    │   ╲                                                                │
│    │    ╲  Validation (slightly higher, parallel)                       │
│    │     ╲___________                                                   │
│    └──────────────────────→ Steps                                       │
│                                                                          │
│  OVERFITTING:                                                            │
│  Loss                                                                    │
│    │╲                                                                   │
│    │ ╲  Training continues to drop                                      │
│    │  ╲___________________                                              │
│    │   ╲                                                                │
│    │    ╲  Validation           ______                                  │
│    │     ╲_____________________╱        (goes up!)                      │
│    └──────────────────────────────────→ Steps                           │
│                                                                          │
│  If validation loss increases while training loss decreases:            │
│  → Stop training (early stopping)                                       │
│  → Use more regularization (dropout, weight decay)                      │
│  → Get more diverse training data                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Practical Training Configuration

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    SFT HYPERPARAMETER GUIDE                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LEARNING RATE:                                                          │
│  ──────────────                                                          │
│  Full fine-tuning: 1e-5 to 5e-5                                         │
│  LoRA: 1e-4 to 3e-4                                                     │
│  QLoRA: 2e-4 to 3e-4                                                    │
│                                                                          │
│  Rule: Lower for larger models, higher for LoRA                         │
│                                                                          │
│  BATCH SIZE:                                                             │
│  ───────────                                                             │
│  Effective batch size: 32-128 (use gradient accumulation)              │
│  Per-GPU batch size: 1-8 (depends on memory)                            │
│                                                                          │
│  EPOCHS:                                                                 │
│  ───────                                                                 │
│  General rule: 1-5 epochs                                               │
│  More data → fewer epochs                                               │
│  Watch for overfitting                                                  │
│                                                                          │
│  WARMUP:                                                                 │
│  ───────                                                                 │
│  Steps: 100-500 or 3-10% of total                                       │
│  Helps stability at start                                               │
│                                                                          │
│  SEQUENCE LENGTH:                                                        │
│  ────────────────                                                        │
│  Match your expected use case                                           │
│  Common: 2048-4096 tokens                                               │
│  Longer = more memory, slower training                                  │
│                                                                          │
│  WEIGHT DECAY:                                                           │
│  ─────────────                                                           │
│  Typical: 0.01-0.1                                                      │
│  Helps prevent overfitting                                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE CONFIGURATION (7B model, LoRA, 50K examples):                  │
│                                                                          │
│  learning_rate: 2e-4                                                    │
│  batch_size: 4 (per GPU)                                                │
│  gradient_accumulation_steps: 8                                         │
│  # Effective batch size: 32                                             │
│  num_epochs: 3                                                          │
│  warmup_ratio: 0.03                                                     │
│  max_seq_length: 2048                                                   │
│  lora_r: 16                                                             │
│  lora_alpha: 32                                                         │
│  lora_dropout: 0.05                                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Training Script Structure

Python
# Conceptual training script structure (pseudocode)

def train_sft(
    model_name: str,
    train_data: Dataset,
    eval_data: Dataset,
    output_dir: str,
    config: TrainingConfig
):
    # 1. Load base model
    model = load_model(model_name, quantization=config.quantization)
    tokenizer = load_tokenizer(model_name)

    # 2. Apply LoRA if using
    if config.use_lora:
        model = apply_lora(
            model,
            r=config.lora_r,
            alpha=config.lora_alpha,
            target_modules=config.target_modules
        )

    # 3. Prepare data
    train_dataset = format_and_tokenize(
        train_data,
        tokenizer,
        max_length=config.max_seq_length,
        chat_template=config.chat_template
    )

    # 4. Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=config.epochs,
        per_device_train_batch_size=config.batch_size,
        gradient_accumulation_steps=config.grad_accum,
        learning_rate=config.lr,
        warmup_ratio=config.warmup_ratio,
        evaluation_strategy="steps",
        eval_steps=config.eval_steps,
        save_strategy="steps",
        save_steps=config.save_steps,
        logging_steps=config.log_steps,
    )

    # 5. Create trainer
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        args=training_args,
    )

    # 6. Train!
    trainer.train()

    # 7. Save
    trainer.save_model(output_dir)

Common Pitfalls and Solutions

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    SFT TROUBLESHOOTING GUIDE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PROBLEM: Model outputs are repetitive                                  │
│  ─────────────────────────────────────                                   │
│  Causes:                                                                 │
│  • Training data has repetitive responses                               │
│  • Overfitting to common patterns                                       │
│                                                                          │
│  Solutions:                                                              │
│  • Diversify training data                                              │
│  • Add repetition penalty during training                               │
│  • Train for fewer steps                                                │
│  • Check for duplicates in data                                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PROBLEM: Model ignores instructions                                    │
│  ────────────────────────────────────                                    │
│  Causes:                                                                 │
│  • Wrong chat template                                                  │
│  • System prompt not in training data                                   │
│  • Base model not instruction-tuned                                     │
│                                                                          │
│  Solutions:                                                              │
│  • Verify template matches model expected format                        │
│  • Include diverse instructions in training                             │
│  • Start from instruction-tuned base if possible                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PROBLEM: Model outputs are too long/short                              │
│  ─────────────────────────────────────────                               │
│  Causes:                                                                 │
│  • Training data biased toward certain lengths                          │
│  • Model learned wrong stopping behavior                                │
│                                                                          │
│  Solutions:                                                              │
│  • Include varied lengths in training data                              │
│  • Explicitly train on "be concise" / "be detailed" instructions       │
│  • Check EOS token handling                                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PROBLEM: Training loss not decreasing                                  │
│  ──────────────────────────────────────                                  │
│  Causes:                                                                 │
│  • Learning rate too low                                                │
│  • Data formatting issues                                               │
│  • Gradient issues (NaN, vanishing)                                     │
│                                                                          │
│  Solutions:                                                              │
│  • Increase learning rate                                               │
│  • Check data format and tokenization                                   │
│  • Check for NaN in gradients                                           │
│  • Verify loss calculation is correct                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PROBLEM: Out of memory (OOM)                                           │
│  ────────────────────────────                                            │
│  Solutions:                                                              │
│  • Reduce batch size                                                    │
│  • Enable gradient checkpointing                                        │
│  • Use LoRA/QLoRA instead of full fine-tuning                          │
│  • Reduce sequence length                                               │
│  • Use DeepSpeed or FSDP for distributed training                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Summary

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    SFT QUICK REFERENCE                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  METHODS:                                                                │
│  • Full fine-tuning: Best quality, highest memory                       │
│  • LoRA: 95-99% quality, much less memory                               │
│  • QLoRA: 93-98% quality, consumer GPU friendly                        │
│  • DoRA: LoRA improvement, same memory                                  │
│                                                                          │
│  DATA:                                                                   │
│  • Quality > Quantity                                                   │
│  • Filter aggressively                                                  │
│  • Match your target format                                             │
│  • Include diverse examples                                             │
│                                                                          │
│  FORGETTING:                                                             │
│  • Mix in 20-30% general data                                           │
│  • Use LoRA (naturally limits forgetting)                               │
│  • Monitor general benchmarks                                           │
│  • Lower learning rate                                                  │
│                                                                          │
│  EVALUATION:                                                             │
│  • Task-specific + general metrics                                      │
│  • Watch for overfitting (val loss increasing)                          │
│  • Qualitative review essential                                         │
│                                                                          │
│  TYPICAL RECIPE:                                                         │
│  1. Start with LoRA (r=16, alpha=32)                                   │
│  2. Use 10K-100K high-quality examples                                  │
│  3. Train for 1-3 epochs with lr=2e-4                                  │
│  4. Mix in 20% general data                                             │
│  5. Evaluate on task + general benchmarks                               │
│  6. Iterate on data quality                                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Merging LoRA Adapters

After training with LoRA, you have two options for deployment: keep adapters separate or merge them into the base model.

When to Merge

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MERGE VS KEEP SEPARATE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  KEEP ADAPTERS SEPARATE:                                                 │
│  ────────────────────────                                                │
│  Base Model + Adapter A → Task A output                                 │
│  Base Model + Adapter B → Task B output                                 │
│  Base Model + Adapter C → Task C output                                 │
│                                                                          │
│  Pros:                                                                   │
│  • One base model, many tasks                                          │
│  • Hot-swap adapters at runtime                                        │
│  • Smaller storage per task                                            │
│  • Easy A/B testing of adapters                                        │
│                                                                          │
│  Cons:                                                                   │
│  • Slight inference overhead                                           │
│  • More complex serving infrastructure                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MERGE INTO BASE MODEL:                                                  │
│  ──────────────────────                                                  │
│  Base Model + Adapter → Merged Model (single model)                    │
│                                                                          │
│  Pros:                                                                   │
│  • No inference overhead                                               │
│  • Simple serving (just one model)                                     │
│  • Compatible with any inference engine                                │
│                                                                          │
│  Cons:                                                                   │
│  • Full model size per task                                            │
│  • Can't combine adapters                                              │
│  • Irreversible (unless you kept originals)                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RECOMMENDATION:                                                         │
│  • Development: Keep separate (easier iteration)                       │
│  • Production single-task: Merge (simpler deployment)                  │
│  • Production multi-task: Keep separate (efficiency)                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Combining Multiple Adapters

An interesting capability: combining adapters trained for different tasks:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    COMBINING ADAPTERS                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  APPROACH 1: Weighted Average                                           │
│  ─────────────────────────────                                           │
│  merged_adapter = 0.5 × adapter_A + 0.5 × adapter_B                    │
│                                                                          │
│  Adjust weights based on task importance:                               │
│  • 0.7 × adapter_A + 0.3 × adapter_B  (favor task A)                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  APPROACH 2: Concatenation (for different target modules)              │
│  ─────────────────────────────────────────────────────────               │
│  If adapters target different layers, can use both:                    │
│  adapter_A (attention layers) + adapter_B (FFN layers)                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  APPROACH 3: Sequential Application                                     │
│  ───────────────────────────────────                                     │
│  Apply adapters sequentially (not always supported):                   │
│  output = adapter_B(adapter_A(base_model(input)))                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CAUTION: Adapter combination is experimental. Results vary.           │
│  Best to train a single adapter on combined data when possible.        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Domain Adaptation with SFT

Adapting to Specific Domains

When fine-tuning for a specific domain (medical, legal, finance, etc.), special considerations apply:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DOMAIN ADAPTATION STRATEGY                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PHASE 1: CONTINUED PRE-TRAINING (Optional)                            │
│  ──────────────────────────────────────────                              │
│  If the domain uses specialized vocabulary/concepts:                    │
│  • Continue pre-training on domain documents                           │
│  • Use next-token prediction (like pre-training)                       │
│  • Teaches domain vocabulary and facts                                 │
│                                                                          │
│  Data: Raw domain documents (papers, manuals, textbooks)               │
│  Objective: Language modeling                                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PHASE 2: SFT ON DOMAIN TASKS                                          │
│  ─────────────────────────────                                           │
│  Fine-tune on instruction-response pairs for domain:                   │
│  • Domain-specific Q&A                                                 │
│  • Domain document analysis                                            │
│  • Domain-specific generation tasks                                    │
│                                                                          │
│  Data: (instruction, response) pairs in domain                         │
│  Objective: Instruction following                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE: Medical Domain                                                 │
│                                                                          │
│  Phase 1: Train on medical papers, textbooks, guidelines               │
│  Phase 2: Train on:                                                     │
│  • Medical Q&A (patient questions → doctor-like answers)              │
│  • Diagnosis assistance (symptoms → potential conditions)             │
│  • Medical document summarization                                      │
│  • Drug interaction queries                                            │
│                                                                          │
│  Important: Mix in 20-30% general data to prevent forgetting           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Safety Considerations for Domain Models

Domain-specific models have unique safety considerations:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DOMAIN MODEL SAFETY                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MEDICAL DOMAIN:                                                         │
│  ───────────────                                                         │
│  • Must not provide diagnoses without disclaimer                       │
│  • Should recommend professional consultation                          │
│  • Should not prescribe medications                                    │
│  • Must handle urgent situations appropriately                         │
│  Train on examples with appropriate caveats!                           │
│                                                                          │
│  LEGAL DOMAIN:                                                           │
│  ──────────────                                                          │
│  • Must clarify this is not legal advice                              │
│  • Should recommend consulting a lawyer                                │
│  • Must not draft legally binding documents without review             │
│  • Jurisdiction-specific cautions                                      │
│                                                                          │
│  FINANCIAL DOMAIN:                                                       │
│  ─────────────────                                                       │
│  • Must disclaim investment advice                                     │
│  • Should not promise returns                                          │
│  • Must handle sensitive financial data carefully                      │
│  • Regulatory compliance considerations                                │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  Include safety-aware responses in training data:                      │
│                                                                          │
│  BAD training example:                                                  │
│  Q: "I have chest pain, what should I do?"                            │
│  A: "Take an aspirin and rest."                                        │
│                                                                          │
│  GOOD training example:                                                 │
│  Q: "I have chest pain, what should I do?"                            │
│  A: "Chest pain can have many causes, some serious. If your pain      │
│      is severe, sudden, or accompanied by shortness of breath,        │
│      please seek emergency medical care immediately. For persistent   │
│      chest discomfort, consult a healthcare provider. I'm an AI       │
│      and cannot diagnose medical conditions."                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

SFT for Code Models

Code fine-tuning has unique characteristics and best practices.

Code-Specific Considerations

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CODE MODEL FINE-TUNING                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  DATA FORMAT OPTIONS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  1. Natural Language to Code:                                           │
│     Instruction: "Write a function that reverses a string"             │
│     Response: def reverse_string(s): return s[::-1]                    │
│                                                                          │
│  2. Code to Explanation:                                                 │
│     Instruction: "Explain this code: [code block]"                     │
│     Response: "This function sorts a list using quicksort..."          │
│                                                                          │
│  3. Code Completion:                                                     │
│     Instruction: "Complete this function: def fibonacci(n):\n   "     │
│     Response: "if n <= 1: return n\n    return fib(n-1) + fib(n-2)"  │
│                                                                          │
│  4. Bug Fixing:                                                          │
│     Instruction: "Fix the bug in this code: [buggy code]"              │
│     Response: [corrected code with explanation]                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXECUTION VERIFICATION:                                                 │
│  ───────────────────────                                                 │
│  Unique to code: we can verify correctness automatically!              │
│                                                                          │
│  1. Generate code solution                                              │
│  2. Execute against test cases                                          │
│  3. Only include passing solutions in training                         │
│                                                                          │
│  This produces higher-quality training data than human annotation.     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LANGUAGE COVERAGE:                                                      │
│  ──────────────────                                                      │
│  Include multiple programming languages:                                │
│  • Python (most data available)                                        │
│  • JavaScript/TypeScript                                                │
│  • Java, C++, Go, Rust                                                 │
│  • SQL                                                                 │
│  • Shell/Bash                                                          │
│                                                                          │
│  Balance matters: Don't let one language dominate.                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Code Quality in Training Data

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CODE DATA QUALITY                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  GOOD code training examples have:                                       │
│  ✓ Correct output (verified by execution)                              │
│  ✓ Good variable names (readable code)                                 │
│  ✓ Appropriate comments                                                 │
│  ✓ Error handling where appropriate                                    │
│  ✓ Efficient algorithms (not unnecessarily slow)                       │
│  ✓ Modern idioms for the language                                      │
│                                                                          │
│  BAD code to avoid in training:                                          │
│  ✗ Buggy code (obviously)                                               │
│  ✗ Deprecated patterns                                                  │
│  ✗ Security vulnerabilities                                            │
│  ✗ Poor style (inconsistent formatting)                                │
│  ✗ Over-engineered solutions                                           │
│  ✗ Missing error handling for production code                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FILTERING PIPELINE FOR CODE:                                            │
│                                                                          │
│  1. Syntax check (must parse)                                          │
│  2. Execute against test cases (must pass)                             │
│  3. Lint check (reasonable code quality)                               │
│  4. Security scan (no obvious vulnerabilities)                         │
│  5. Complexity filter (not overly simple or complex)                   │
│  6. Deduplication (many similar solutions exist)                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Deploying Fine-Tuned Models

Model Export and Formats

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DEPLOYMENT FORMATS                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  HUGGING FACE FORMAT:                                                    │
│  ────────────────────                                                    │
│  • Standard for training/experimentation                               │
│  • Easy to share and version                                           │
│  • Direct use with transformers library                                │
│                                                                          │
│  GGUF (for llama.cpp):                                                  │
│  ─────────────────────                                                   │
│  • Optimized for CPU inference                                         │
│  • Supports various quantization levels (Q4, Q5, Q8)                  │
│  • Good for edge/local deployment                                      │
│                                                                          │
│  VLLM FORMAT:                                                            │
│  ─────────────                                                           │
│  • Optimized for high-throughput GPU serving                          │
│  • PagedAttention for efficient memory                                 │
│  • Best for production GPU deployments                                 │
│                                                                          │
│  TENSORRT-LLM:                                                           │
│  ─────────────                                                           │
│  • NVIDIA-optimized                                                     │
│  • Maximum performance on NVIDIA GPUs                                  │
│  • Requires compilation step                                           │
│                                                                          │
│  ONNX:                                                                   │
│  ─────                                                                   │
│  • Cross-platform                                                       │
│  • Can run on various backends                                         │
│  • May have compatibility issues with newer architectures              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Quantization for Deployment

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    POST-TRAINING QUANTIZATION                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  After SFT, you may want to quantize for deployment efficiency:        │
│                                                                          │
│  QUANTIZATION LEVELS (for a 7B model):                                  │
│  ─────────────────────────────────────                                   │
│  FP16:   14 GB  (100% quality baseline)                                │
│  INT8:   7 GB   (~99% quality)                                         │
│  INT4:   3.5 GB (~95-98% quality)                                      │
│  INT3:   2.6 GB (~90-95% quality)                                      │
│  INT2:   1.8 GB (~80-90% quality, experimental)                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  METHODS:                                                                │
│                                                                          │
│  GPTQ: Fast inference, needs calibration data                          │
│  AWQ: Activation-aware, good quality at low bits                       │
│  GGML/GGUF: Good for CPU, various quantization options                │
│  bitsandbytes: Easy integration, used for training too                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RECOMMENDATION:                                                         │
│  • GPU serving: INT8 or INT4 with GPTQ/AWQ                             │
│  • CPU serving: GGUF Q4_K_M or Q5_K_M                                  │
│  • Edge devices: Smallest that maintains acceptable quality            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles

EducationLLMs

RLHF Complete Guide: Aligning LLMs with Human Preferences

A comprehensive deep dive into Reinforcement Learning from Human Feedback—from reward modeling to PPO to DPO. Understanding how AI assistants learn to be helpful, harmless, and honest.

9 min read
EducationLLMs

LLM Pre-training: Building Foundation Models from Scratch

A comprehensive guide to pre-training large language models—from data curation and architecture decisions to scaling laws and distributed training infrastructure. Understanding how GPT, Llama, and other foundation models are built.

15 min read
LLMsML Engineering

Knowledge Distillation for LLMs: Compressing Intelligence

A comprehensive guide to knowledge distillation—transferring capabilities from large teacher models to smaller, faster student models. From theory to implementation, including chain-of-thought distillation and synthetic data generation.

7 min read
PromptingLLMs

Fine-Tuning vs Prompting: When to Use Each

A practical guide to deciding between fine-tuning and prompt engineering for your LLM application, based on real-world experience with both approaches.

11 min read
LLMsResearch

Training Reasoning Models: PPO, GRPO, Reward Functions, and RLVR

A deep technical guide to training reasoning models like o1 and DeepSeek R1—covering PPO, GRPO, reward function design, RLVR, and distillation techniques.

21 min read
LLMsML Engineering

vLLM in Production: The Complete Guide to High-Performance LLM Serving

A comprehensive guide to deploying vLLM in production—covering architecture internals, configuration tuning, Kubernetes deployment, monitoring, and troubleshooting.

9 min read