Skip to main content
Back to Blog

Fine-Tuning Workflows & Best Practices: A Practical Guide for LLM Customization

Comprehensive guide to fine-tuning LLMs including LoRA, QLoRA, and full fine-tuning. Covers data preparation, hyperparameter selection, evaluation strategies, common pitfalls, and 2025 tools like Unsloth, Axolotl, and LLaMA-Factory.

11 min read
Share:

Fine-Tuning Workflows & Best Practices

Fine-tuning transforms general-purpose LLMs into specialized tools for your specific use case. A 7B model fine-tuned on your domain data can outperform a 70B general model for your tasks—while being faster and cheaper to run. But fine-tuning done poorly wastes compute, degrades quality, or creates models that fail in production.

This guide provides practical workflows for fine-tuning LLMs: when to fine-tune, data preparation, choosing between LoRA and full fine-tuning, hyperparameter selection, evaluation, and the 2025 tool landscape.


When to Fine-Tune

Fine-tuning is powerful but not always necessary. Start by testing with LoRA or QLoRA first; if it won't work there, it almost certainly won't work with full fine-tuning.

Good Candidates for Fine-Tuning

Domain-specific vocabulary and concepts: Medical, legal, financial, and technical domains have specialized terminology that general models handle poorly. Fine-tuning teaches the model your domain's language.

Consistent output format: If you need structured outputs in a specific format—particular JSON schemas, report templates, or code styles—fine-tuning enforces consistency better than prompting.

Persona and tone: Enterprise assistants, brand voices, and specialized interaction styles benefit from fine-tuning. The model internalizes the persona rather than requiring elaborate prompts.

Task-specific behavior: When you need the model to excel at a narrow task (classification, extraction, summarization in your domain), fine-tuning focuses its capabilities.

Poor Candidates for Fine-Tuning

Knowledge injection: Fine-tuning doesn't reliably teach new facts. For knowledge updates, use RAG instead.

Simple instructions: If few-shot prompting achieves your goal, fine-tuning adds unnecessary complexity.

Rapidly changing requirements: Fine-tuning creates a static model. If your needs change frequently, prompt engineering is more agile.

Insufficient data: Fine-tuning needs quality examples. If you have fewer than a few hundred examples, focus on prompting first.

The Decision Framework

Start small: 2-5k quality examples + LoRA + RAG will get you 80% of value quickly. Only escalate to full fine-tuning if LoRA proves insufficient.


Data Preparation

Data quality and formatting matter more than most hyperparameters. Poor data produces poor models regardless of compute invested.

Dataset Structure

Fine-tuning datasets typically have two columns: input (instruction/question) and output (response). For chat models, use conversation format with alternating user/assistant messages.

Instruction format:

Code
Input: "Summarize this contract clause in plain English: [clause text]"
Output: "This clause means that..."

Chat format:

Code
User: "What does this clause mean?"
Assistant: "This clause establishes..."
User: "What are the implications?"
Assistant: "The main implications are..."

Quality Guidelines

Quality matters far more than quantity, with ~1,000 high-quality examples often sufficient.

Accuracy: Every example should demonstrate correct behavior. Wrong examples teach wrong behavior.

Diversity: Cover the range of scenarios the model will encounter. Don't over-represent any single pattern.

Representativeness: Examples should match production input distribution. If users ask questions informally, training data should include informal questions.

Completeness: Outputs should be complete, not truncated. Incomplete outputs teach the model to generate incomplete responses.

Data Cleaning

Remove duplicates: Exact and near-duplicates inflate training and can cause overfitting.

Check for PII: Remove personally identifiable information unless necessary for the task.

Validate format: Ensure consistent formatting across examples. Inconsistent formatting confuses the model.

Filter low quality: Remove examples that are ambiguous, incorrect, or poorly written.

Dataset Size Guidelines

Prepare a minimum of ~1,000 examples, prioritizing quality over quantity with an 80/20 train/validation split.

Use CaseRecommended Size
Style/tone adaptation500-2,000
Task specialization1,000-5,000
Domain adaptation5,000-20,000
New capability10,000+

More data helps, but with diminishing returns. High-quality data is critical—10,000 quality examples outperform 100,000 noisy examples.


Fine-Tuning Methods

LoRA (Low-Rank Adaptation)

LoRA adds tiny low-rank "adapters" to a frozen model so you only train a sliver of parameters. It's fast, modular, and usually within striking distance of full fine-tuning quality.

How it works: Instead of updating all model weights, LoRA adds small trainable matrices to specific layers (typically attention layers). These matrices have low rank, meaning they have far fewer parameters than the original weight matrices. For example, fine-tuning a 13B-parameter model with LoRA may only require updating 1-2% of parameters.

Key hyperparameters:

  • Rank (r): Dimension of the low-rank matrices. Higher rank = more capacity but more parameters. Start with r=8 or r=16.
  • Alpha: Scaling factor for LoRA updates. Typically set to 2x rank (alpha=16 for r=8).
  • Target modules: Which layers to apply LoRA to. Default is attention layers; applying to all linear layers can improve quality.

Advantages:

  • Trains on single GPU (even consumer GPUs for smaller models)
  • Multiple adapters can be merged or swapped
  • Base model stays frozen—no catastrophic forgetting of general capabilities

Disadvantages:

  • Slightly lower quality ceiling than full fine-tuning for large datasets
  • Requires careful rank selection for optimal results

QLoRA (Quantized LoRA)

QLoRA quantizes the pretrained weights to 4-bit precision and uses paged optimizers to handle memory spikes.

How it works: The base model is loaded in 4-bit precision (using NormalFloat4 quantization), dramatically reducing memory requirements. LoRA adapters are trained in higher precision, then combined with the quantized base for inference.

Memory savings: You can save 33% of GPU memory with QLoRA compared to standard LoRA. This enables fine-tuning 70B parameter models on 24GB VRAM.

Tradeoffs: QLoRA comes at a 39% increased training runtime caused by additional quantization and dequantization during training. The quality difference from standard LoRA is typically minimal.

When to use: If your base model already fits comfortably in GPU memory, start with LoRA; if you need to squeeze very large models onto limited VRAM, go QLoRA.

Full Fine-Tuning

Full fine-tuning involves training the entire model, meaning all layers are adjusted during training.

When it makes sense:

  • Large training datasets (50k+ examples)
  • Significant distribution shift from pretraining data
  • Maximum quality is critical and compute is not constrained
  • You need to modify fundamental model behaviors

Resource requirements: Full fine-tuning requires roughly 4x the memory of inference (for gradients, optimizer states, etc.). A 7B model needs ~56GB for full fine-tuning; 70B models need hundreds of GB.

A common mistake is jumping straight into full fine-tuning, which is compute-heavy. Test with LoRA first.

DoRA (Weight-Decomposed Low-Rank Adaptation)

DoRA is an NVIDIA-developed improvement over LoRA that decomposes pretrained weights into magnitude and direction components, achieving quality closer to full fine-tuning.

How it works: DoRA can be described in two steps:

  1. Decompose pretrained weight matrix into a magnitude vector (m) and directional matrix (V)
  2. Apply LoRA to the directional matrix V while training the magnitude vector m separately

Why it works: The DoRA authors found that LoRA increases or decreases magnitude and direction updates proportionally, but lacks the capability to make subtle directional changes that full fine-tuning achieves. DoRA decouples these, enabling more precise adaptation.

Performance improvements:

  • +3.7/+1.0 on Llama 7B/13B for common-sense reasoning
  • +2.9 on Llama 2 7B, +4.4 on Llama 3 8B
  • Improvements on vision-language tasks (+0.9/+1.9 on VL-BART)

Overhead: Introducing the magnitude vector m adds only 0.01% more parameters compared to LoRA—negligible overhead for meaningful quality improvement.

Hyperparameter recommendations:

  • Start with slightly lower learning rate than LoRA
  • Experiment with varying dropout ratios
  • Can often achieve comparable results with half the rank of LoRA

When to use: DoRA is recommended when you want LoRA's efficiency but need quality closer to full fine-tuning. It's now supported in Hugging Face PEFT and major fine-tuning frameworks.

Method Comparison (Updated)

MethodMemoryQualitySpeedParametersUse Case
LoRALowGoodFast1-2%Default choice
DoRALowVery GoodFast1-2% + 0.01%Higher quality than LoRA
QLoRAVery LowGoodMedium1-2%Large models on limited VRAM
Full FTVery HighBestSlow100%Large datasets, max quality

Dataset Formats

Different fine-tuning frameworks expect data in specific formats. Understanding these formats prevents frustrating preprocessing errors.

Alpaca Format

The original Stanford Alpaca format, widely supported:

JSON
{
  "instruction": "Summarize the following text",
  "input": "The quick brown fox...",
  "output": "A fox jumped over a dog."
}

Fields:

  • instruction: The task description
  • input: Optional additional context
  • output: Expected response

Best for: Single-turn instruction following tasks.

ShareGPT Format

Multi-turn conversation format, used by many chat models:

JSON
{
  "conversations": [
    {"from": "human", "value": "What is machine learning?"},
    {"from": "gpt", "value": "Machine learning is..."},
    {"from": "human", "value": "Can you give an example?"},
    {"from": "gpt", "value": "Sure, consider..."}
  ]
}

Variations:

  • from can be human/gpt, user/assistant, or system/user/assistant
  • Some formats use role/content instead of from/value

Best for: Multi-turn chat fine-tuning, conversational models.

OpenAI Chat Format

The format used by OpenAI's fine-tuning API:

JSON
{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "2+2 equals 4."}
  ]
}

Best for: Compatibility with OpenAI fine-tuning, chat models following OpenAI conventions.

Completion Format

Simple input-output pairs for completion tasks:

JSON
{
  "prompt": "Translate to French: Hello",
  "completion": "Bonjour"
}

Best for: Simple completion tasks, base model fine-tuning.

Format Conversion Tips

Framework requirements:

  • Axolotl: Supports Alpaca, ShareGPT, completion formats natively
  • LLaMA-Factory: GUI-based format selection, supports most formats
  • Unsloth: Expects chat template format, auto-converts common formats
  • TRL SFTTrainer: Flexible, accepts most formats with proper configuration

Common pitfalls:

  • Inconsistent field names across examples
  • Missing required fields
  • Incorrect chat template application
  • Wrong special tokens

Validation: Always validate a sample of your dataset before training. Check that:

  • All required fields are present
  • Field names are consistent
  • Content is properly escaped
  • No truncation will occur during tokenization

Hyperparameter Selection

Learning Rate

Learning rate is the most important hyperparameter. Too high causes instability; too low causes slow convergence.

Starting points:

  • LoRA: 1e-4 to 3e-4
  • Full fine-tuning: 1e-5 to 5e-5

Scheduling: Use linear warmup (5-10% of steps) followed by cosine decay or linear decay.

Batch Size

Larger batch sizes provide more stable gradients but require more memory.

Effective batch size: Accumulate gradients across multiple small batches to simulate larger batches. effective_batch = batch_size * gradient_accumulation_steps * num_gpus

Typical values: Effective batch sizes of 32-128 work well for most tasks.

Epochs

How many times to iterate through the dataset.

Typical values: 1-5 epochs. More epochs risk overfitting; fewer epochs risk underfitting.

Early stopping: Monitor validation loss and stop when it starts increasing (overfitting signal).

LoRA-Specific Parameters

The authors suggest applying LoRA adapters on all linear transformer blocks along with query, key, and value layers for best results.

Rank (r): Start with 8-16. Increase if underfitting; decrease if overfitting or for smaller adapters.

Target modules: For best quality, apply to all linear layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj). For smaller adapters, attention-only (q_proj, v_proj) often suffices.


The Fine-Tuning Pipeline

Stage 1: Dataset Preparation

  1. Collect and curate examples
  2. Clean and validate data
  3. Format for training (instruction format or chat format)
  4. Split into train/validation (80/20 or 90/10)
  5. Inspect samples to verify quality

Stage 2: Model Selection

Choose your base model based on:

  • Required capabilities (reasoning, coding, multilingual)
  • Size constraints (inference latency, memory)
  • License requirements (commercial use, modifications)

Starting with Instruct models is recommended as they allow direct fine-tuning using conversational chat templates and require less data compared to base models.

Stage 3: Training Environment Setup

Set up training infrastructure:

  • GPU selection (A100, H100 for production; consumer GPUs for experimentation)
  • Framework selection (see Tools section)
  • Experiment tracking (Weights & Biases, MLflow)

Stage 4: Training

Run training with monitoring:

  • Track training loss (should decrease smoothly)
  • Track validation loss (should decrease, then plateau)
  • Watch for divergence or oscillation (learning rate too high)
  • Save checkpoints regularly

Stage 5: Evaluation

Evaluate the fine-tuned model:

  • Task-specific metrics (accuracy, F1, BLEU, ROUGE)
  • Human evaluation for quality assessment
  • Comparison against base model and prompting baselines
  • Regression testing on general capabilities

Stage 6: Deployment

Deploy the model:

  • Merge LoRA adapters with base model (optional, simplifies serving)
  • Quantize for inference efficiency
  • Set up serving infrastructure
  • Monitor production metrics

Stage 7: Monitoring and Iteration

Continuously improve:

  • Collect production feedback
  • Identify failure cases
  • Add examples addressing failures
  • Retrain periodically

Tools and Frameworks

Unsloth

Unsloth provides dramatically faster training through custom CUDA kernels, with 2-5x speedup and 60% memory reduction compared to standard implementations.

Strengths: Fastest training speeds, excellent documentation, easy setup.

Best for: Rapid experimentation, teams wanting maximum training efficiency.

Axolotl

Axolotl provides a config-driven approach to fine-tuning with extensive customization options.

Strengths: Highly configurable, supports many model architectures, active community.

Best for: Teams needing flexibility and customization, complex training setups.

LLaMA-Factory

LLaMA-Factory offers a web UI for fine-tuning, making it accessible to non-experts.

Strengths: User-friendly interface, broad model support, integrated evaluation.

Best for: Teams new to fine-tuning, rapid prototyping.

Hugging Face PEFT

PEFT (Parameter-Efficient Fine-Tuning) is the foundational library for LoRA and other efficient fine-tuning methods.

Strengths: Industry standard, extensive documentation, broad compatibility.

Best for: Production deployments, teams wanting full control.

Tool Selection

The best GenAI fine-tuning tools in 2025 combine efficient training methods with robust experiment tracking and scalable serving infrastructure.

NeedRecommended Tool
Maximum speedUnsloth
Maximum flexibilityAxolotl
Ease of useLLaMA-Factory
Production controlHugging Face PEFT

Common Pitfalls

Overfitting

Symptoms: Training loss continues decreasing but validation loss increases. Model memorizes training data rather than learning patterns.

Solutions:

  • Add more diverse training data
  • Reduce epochs
  • Increase dropout
  • Reduce LoRA rank
  • Use early stopping

Catastrophic Forgetting

Symptoms: Model loses general capabilities after fine-tuning. Performs well on fine-tuning tasks but poorly on general tasks.

Solutions:

  • Use LoRA (keeps base model frozen)
  • Mix general examples with task-specific examples
  • Use lower learning rates
  • Train for fewer epochs

Format Learning Instead of Task Learning

Symptoms: Model learns to produce outputs in the right format but with wrong content.

Solutions:

  • Ensure training data has diverse content
  • Include negative examples
  • Evaluate on held-out content, not just format

Data Contamination

Symptoms: Suspiciously high evaluation scores. Model may have seen evaluation data during training.

Solutions:

  • Carefully separate train/eval data
  • Use temporal splits when possible
  • Create novel evaluation examples

Evaluation Strategies

Automated Metrics

Task-specific metrics: Accuracy for classification, F1 for extraction, BLEU/ROUGE for generation.

Perplexity: Lower is better. Compare against base model to ensure improvement.

Format compliance: What percentage of outputs match required format?

LLM-as-Judge

Use a capable model (GPT-4, Claude) to evaluate fine-tuned model outputs:

  • Rate response quality on multiple dimensions
  • Compare fine-tuned vs. base model outputs
  • Identify specific failure modes

Human Evaluation

A/B testing: Show evaluators outputs from fine-tuned and base models. Which do they prefer?

Rating scales: Rate specific dimensions (accuracy, helpfulness, fluency).

Error analysis: Categorize failures to guide data collection.

Regression Testing

Ensure fine-tuning didn't break general capabilities:

  • Test on standard benchmarks (MMLU, HumanEval)
  • Compare against base model
  • Acceptable degradation depends on use case

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles