Fine-Tuning Workflows & Best Practices: A Practical Guide for LLM Customization
Comprehensive guide to fine-tuning LLMs including LoRA, QLoRA, and full fine-tuning. Covers data preparation, hyperparameter selection, evaluation strategies, common pitfalls, and 2025 tools like Unsloth, Axolotl, and LLaMA-Factory.
Table of Contents
Fine-Tuning Workflows & Best Practices
Fine-tuning transforms general-purpose LLMs into specialized tools for your specific use case. A 7B model fine-tuned on your domain data can outperform a 70B general model for your tasks—while being faster and cheaper to run. But fine-tuning done poorly wastes compute, degrades quality, or creates models that fail in production.
This guide provides practical workflows for fine-tuning LLMs: when to fine-tune, data preparation, choosing between LoRA and full fine-tuning, hyperparameter selection, evaluation, and the 2025 tool landscape.
When to Fine-Tune
Fine-tuning is powerful but not always necessary. Start by testing with LoRA or QLoRA first; if it won't work there, it almost certainly won't work with full fine-tuning.
Good Candidates for Fine-Tuning
Domain-specific vocabulary and concepts: Medical, legal, financial, and technical domains have specialized terminology that general models handle poorly. Fine-tuning teaches the model your domain's language.
Consistent output format: If you need structured outputs in a specific format—particular JSON schemas, report templates, or code styles—fine-tuning enforces consistency better than prompting.
Persona and tone: Enterprise assistants, brand voices, and specialized interaction styles benefit from fine-tuning. The model internalizes the persona rather than requiring elaborate prompts.
Task-specific behavior: When you need the model to excel at a narrow task (classification, extraction, summarization in your domain), fine-tuning focuses its capabilities.
Poor Candidates for Fine-Tuning
Knowledge injection: Fine-tuning doesn't reliably teach new facts. For knowledge updates, use RAG instead.
Simple instructions: If few-shot prompting achieves your goal, fine-tuning adds unnecessary complexity.
Rapidly changing requirements: Fine-tuning creates a static model. If your needs change frequently, prompt engineering is more agile.
Insufficient data: Fine-tuning needs quality examples. If you have fewer than a few hundred examples, focus on prompting first.
The Decision Framework
Start small: 2-5k quality examples + LoRA + RAG will get you 80% of value quickly. Only escalate to full fine-tuning if LoRA proves insufficient.
Data Preparation
Data quality and formatting matter more than most hyperparameters. Poor data produces poor models regardless of compute invested.
Dataset Structure
Fine-tuning datasets typically have two columns: input (instruction/question) and output (response). For chat models, use conversation format with alternating user/assistant messages.
Instruction format:
Input: "Summarize this contract clause in plain English: [clause text]"
Output: "This clause means that..."
Chat format:
User: "What does this clause mean?"
Assistant: "This clause establishes..."
User: "What are the implications?"
Assistant: "The main implications are..."
Quality Guidelines
Quality matters far more than quantity, with ~1,000 high-quality examples often sufficient.
Accuracy: Every example should demonstrate correct behavior. Wrong examples teach wrong behavior.
Diversity: Cover the range of scenarios the model will encounter. Don't over-represent any single pattern.
Representativeness: Examples should match production input distribution. If users ask questions informally, training data should include informal questions.
Completeness: Outputs should be complete, not truncated. Incomplete outputs teach the model to generate incomplete responses.
Data Cleaning
Remove duplicates: Exact and near-duplicates inflate training and can cause overfitting.
Check for PII: Remove personally identifiable information unless necessary for the task.
Validate format: Ensure consistent formatting across examples. Inconsistent formatting confuses the model.
Filter low quality: Remove examples that are ambiguous, incorrect, or poorly written.
Dataset Size Guidelines
Prepare a minimum of ~1,000 examples, prioritizing quality over quantity with an 80/20 train/validation split.
| Use Case | Recommended Size |
|---|---|
| Style/tone adaptation | 500-2,000 |
| Task specialization | 1,000-5,000 |
| Domain adaptation | 5,000-20,000 |
| New capability | 10,000+ |
More data helps, but with diminishing returns. High-quality data is critical—10,000 quality examples outperform 100,000 noisy examples.
Fine-Tuning Methods
LoRA (Low-Rank Adaptation)
LoRA adds tiny low-rank "adapters" to a frozen model so you only train a sliver of parameters. It's fast, modular, and usually within striking distance of full fine-tuning quality.
How it works: Instead of updating all model weights, LoRA adds small trainable matrices to specific layers (typically attention layers). These matrices have low rank, meaning they have far fewer parameters than the original weight matrices. For example, fine-tuning a 13B-parameter model with LoRA may only require updating 1-2% of parameters.
Key hyperparameters:
- Rank (r): Dimension of the low-rank matrices. Higher rank = more capacity but more parameters. Start with r=8 or r=16.
- Alpha: Scaling factor for LoRA updates. Typically set to 2x rank (alpha=16 for r=8).
- Target modules: Which layers to apply LoRA to. Default is attention layers; applying to all linear layers can improve quality.
Advantages:
- Trains on single GPU (even consumer GPUs for smaller models)
- Multiple adapters can be merged or swapped
- Base model stays frozen—no catastrophic forgetting of general capabilities
Disadvantages:
- Slightly lower quality ceiling than full fine-tuning for large datasets
- Requires careful rank selection for optimal results
QLoRA (Quantized LoRA)
QLoRA quantizes the pretrained weights to 4-bit precision and uses paged optimizers to handle memory spikes.
How it works: The base model is loaded in 4-bit precision (using NormalFloat4 quantization), dramatically reducing memory requirements. LoRA adapters are trained in higher precision, then combined with the quantized base for inference.
Memory savings: You can save 33% of GPU memory with QLoRA compared to standard LoRA. This enables fine-tuning 70B parameter models on 24GB VRAM.
Tradeoffs: QLoRA comes at a 39% increased training runtime caused by additional quantization and dequantization during training. The quality difference from standard LoRA is typically minimal.
When to use: If your base model already fits comfortably in GPU memory, start with LoRA; if you need to squeeze very large models onto limited VRAM, go QLoRA.
Full Fine-Tuning
Full fine-tuning involves training the entire model, meaning all layers are adjusted during training.
When it makes sense:
- Large training datasets (50k+ examples)
- Significant distribution shift from pretraining data
- Maximum quality is critical and compute is not constrained
- You need to modify fundamental model behaviors
Resource requirements: Full fine-tuning requires roughly 4x the memory of inference (for gradients, optimizer states, etc.). A 7B model needs ~56GB for full fine-tuning; 70B models need hundreds of GB.
A common mistake is jumping straight into full fine-tuning, which is compute-heavy. Test with LoRA first.
DoRA (Weight-Decomposed Low-Rank Adaptation)
DoRA is an NVIDIA-developed improvement over LoRA that decomposes pretrained weights into magnitude and direction components, achieving quality closer to full fine-tuning.
How it works: DoRA can be described in two steps:
- Decompose pretrained weight matrix into a magnitude vector (m) and directional matrix (V)
- Apply LoRA to the directional matrix V while training the magnitude vector m separately
Why it works: The DoRA authors found that LoRA increases or decreases magnitude and direction updates proportionally, but lacks the capability to make subtle directional changes that full fine-tuning achieves. DoRA decouples these, enabling more precise adaptation.
Performance improvements:
- +3.7/+1.0 on Llama 7B/13B for common-sense reasoning
- +2.9 on Llama 2 7B, +4.4 on Llama 3 8B
- Improvements on vision-language tasks (+0.9/+1.9 on VL-BART)
Overhead: Introducing the magnitude vector m adds only 0.01% more parameters compared to LoRA—negligible overhead for meaningful quality improvement.
Hyperparameter recommendations:
- Start with slightly lower learning rate than LoRA
- Experiment with varying dropout ratios
- Can often achieve comparable results with half the rank of LoRA
When to use: DoRA is recommended when you want LoRA's efficiency but need quality closer to full fine-tuning. It's now supported in Hugging Face PEFT and major fine-tuning frameworks.
Method Comparison (Updated)
| Method | Memory | Quality | Speed | Parameters | Use Case |
|---|---|---|---|---|---|
| LoRA | Low | Good | Fast | 1-2% | Default choice |
| DoRA | Low | Very Good | Fast | 1-2% + 0.01% | Higher quality than LoRA |
| QLoRA | Very Low | Good | Medium | 1-2% | Large models on limited VRAM |
| Full FT | Very High | Best | Slow | 100% | Large datasets, max quality |
Dataset Formats
Different fine-tuning frameworks expect data in specific formats. Understanding these formats prevents frustrating preprocessing errors.
Alpaca Format
The original Stanford Alpaca format, widely supported:
{
"instruction": "Summarize the following text",
"input": "The quick brown fox...",
"output": "A fox jumped over a dog."
}
Fields:
instruction: The task descriptioninput: Optional additional contextoutput: Expected response
Best for: Single-turn instruction following tasks.
ShareGPT Format
Multi-turn conversation format, used by many chat models:
{
"conversations": [
{"from": "human", "value": "What is machine learning?"},
{"from": "gpt", "value": "Machine learning is..."},
{"from": "human", "value": "Can you give an example?"},
{"from": "gpt", "value": "Sure, consider..."}
]
}
Variations:
fromcan behuman/gpt,user/assistant, orsystem/user/assistant- Some formats use
role/contentinstead offrom/value
Best for: Multi-turn chat fine-tuning, conversational models.
OpenAI Chat Format
The format used by OpenAI's fine-tuning API:
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "2+2 equals 4."}
]
}
Best for: Compatibility with OpenAI fine-tuning, chat models following OpenAI conventions.
Completion Format
Simple input-output pairs for completion tasks:
{
"prompt": "Translate to French: Hello",
"completion": "Bonjour"
}
Best for: Simple completion tasks, base model fine-tuning.
Format Conversion Tips
Framework requirements:
- Axolotl: Supports Alpaca, ShareGPT, completion formats natively
- LLaMA-Factory: GUI-based format selection, supports most formats
- Unsloth: Expects chat template format, auto-converts common formats
- TRL SFTTrainer: Flexible, accepts most formats with proper configuration
Common pitfalls:
- Inconsistent field names across examples
- Missing required fields
- Incorrect chat template application
- Wrong special tokens
Validation: Always validate a sample of your dataset before training. Check that:
- All required fields are present
- Field names are consistent
- Content is properly escaped
- No truncation will occur during tokenization
Hyperparameter Selection
Learning Rate
Learning rate is the most important hyperparameter. Too high causes instability; too low causes slow convergence.
Starting points:
- LoRA: 1e-4 to 3e-4
- Full fine-tuning: 1e-5 to 5e-5
Scheduling: Use linear warmup (5-10% of steps) followed by cosine decay or linear decay.
Batch Size
Larger batch sizes provide more stable gradients but require more memory.
Effective batch size: Accumulate gradients across multiple small batches to simulate larger batches. effective_batch = batch_size * gradient_accumulation_steps * num_gpus
Typical values: Effective batch sizes of 32-128 work well for most tasks.
Epochs
How many times to iterate through the dataset.
Typical values: 1-5 epochs. More epochs risk overfitting; fewer epochs risk underfitting.
Early stopping: Monitor validation loss and stop when it starts increasing (overfitting signal).
LoRA-Specific Parameters
The authors suggest applying LoRA adapters on all linear transformer blocks along with query, key, and value layers for best results.
Rank (r): Start with 8-16. Increase if underfitting; decrease if overfitting or for smaller adapters.
Target modules: For best quality, apply to all linear layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj). For smaller adapters, attention-only (q_proj, v_proj) often suffices.
The Fine-Tuning Pipeline
Stage 1: Dataset Preparation
- Collect and curate examples
- Clean and validate data
- Format for training (instruction format or chat format)
- Split into train/validation (80/20 or 90/10)
- Inspect samples to verify quality
Stage 2: Model Selection
Choose your base model based on:
- Required capabilities (reasoning, coding, multilingual)
- Size constraints (inference latency, memory)
- License requirements (commercial use, modifications)
Starting with Instruct models is recommended as they allow direct fine-tuning using conversational chat templates and require less data compared to base models.
Stage 3: Training Environment Setup
Set up training infrastructure:
- GPU selection (A100, H100 for production; consumer GPUs for experimentation)
- Framework selection (see Tools section)
- Experiment tracking (Weights & Biases, MLflow)
Stage 4: Training
Run training with monitoring:
- Track training loss (should decrease smoothly)
- Track validation loss (should decrease, then plateau)
- Watch for divergence or oscillation (learning rate too high)
- Save checkpoints regularly
Stage 5: Evaluation
Evaluate the fine-tuned model:
- Task-specific metrics (accuracy, F1, BLEU, ROUGE)
- Human evaluation for quality assessment
- Comparison against base model and prompting baselines
- Regression testing on general capabilities
Stage 6: Deployment
Deploy the model:
- Merge LoRA adapters with base model (optional, simplifies serving)
- Quantize for inference efficiency
- Set up serving infrastructure
- Monitor production metrics
Stage 7: Monitoring and Iteration
Continuously improve:
- Collect production feedback
- Identify failure cases
- Add examples addressing failures
- Retrain periodically
Tools and Frameworks
Unsloth
Unsloth provides dramatically faster training through custom CUDA kernels, with 2-5x speedup and 60% memory reduction compared to standard implementations.
Strengths: Fastest training speeds, excellent documentation, easy setup.
Best for: Rapid experimentation, teams wanting maximum training efficiency.
Axolotl
Axolotl provides a config-driven approach to fine-tuning with extensive customization options.
Strengths: Highly configurable, supports many model architectures, active community.
Best for: Teams needing flexibility and customization, complex training setups.
LLaMA-Factory
LLaMA-Factory offers a web UI for fine-tuning, making it accessible to non-experts.
Strengths: User-friendly interface, broad model support, integrated evaluation.
Best for: Teams new to fine-tuning, rapid prototyping.
Hugging Face PEFT
PEFT (Parameter-Efficient Fine-Tuning) is the foundational library for LoRA and other efficient fine-tuning methods.
Strengths: Industry standard, extensive documentation, broad compatibility.
Best for: Production deployments, teams wanting full control.
Tool Selection
The best GenAI fine-tuning tools in 2025 combine efficient training methods with robust experiment tracking and scalable serving infrastructure.
| Need | Recommended Tool |
|---|---|
| Maximum speed | Unsloth |
| Maximum flexibility | Axolotl |
| Ease of use | LLaMA-Factory |
| Production control | Hugging Face PEFT |
Common Pitfalls
Overfitting
Symptoms: Training loss continues decreasing but validation loss increases. Model memorizes training data rather than learning patterns.
Solutions:
- Add more diverse training data
- Reduce epochs
- Increase dropout
- Reduce LoRA rank
- Use early stopping
Catastrophic Forgetting
Symptoms: Model loses general capabilities after fine-tuning. Performs well on fine-tuning tasks but poorly on general tasks.
Solutions:
- Use LoRA (keeps base model frozen)
- Mix general examples with task-specific examples
- Use lower learning rates
- Train for fewer epochs
Format Learning Instead of Task Learning
Symptoms: Model learns to produce outputs in the right format but with wrong content.
Solutions:
- Ensure training data has diverse content
- Include negative examples
- Evaluate on held-out content, not just format
Data Contamination
Symptoms: Suspiciously high evaluation scores. Model may have seen evaluation data during training.
Solutions:
- Carefully separate train/eval data
- Use temporal splits when possible
- Create novel evaluation examples
Evaluation Strategies
Automated Metrics
Task-specific metrics: Accuracy for classification, F1 for extraction, BLEU/ROUGE for generation.
Perplexity: Lower is better. Compare against base model to ensure improvement.
Format compliance: What percentage of outputs match required format?
LLM-as-Judge
Use a capable model (GPT-4, Claude) to evaluate fine-tuned model outputs:
- Rate response quality on multiple dimensions
- Compare fine-tuned vs. base model outputs
- Identify specific failure modes
Human Evaluation
A/B testing: Show evaluators outputs from fine-tuned and base models. Which do they prefer?
Rating scales: Rate specific dimensions (accuracy, helpfulness, fluency).
Error analysis: Categorize failures to guide data collection.
Regression Testing
Ensure fine-tuning didn't break general capabilities:
- Test on standard benchmarks (MMLU, HumanEval)
- Compare against base model
- Acceptable degradation depends on use case
Frequently Asked Questions
Related Articles
SFT Deep Dive: Instruction Tuning Techniques and Best Practices
A comprehensive guide to Supervised Fine-Tuning (SFT) for LLMs—covering full fine-tuning vs LoRA vs QLoRA vs DoRA, data curation strategies, instruction formats, multi-task learning, and avoiding catastrophic forgetting.
RLHF Complete Guide: Aligning LLMs with Human Preferences
A comprehensive deep dive into Reinforcement Learning from Human Feedback—from reward modeling to PPO to DPO. Understanding how AI assistants learn to be helpful, harmless, and honest.
RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond
A comprehensive guide to reinforcement learning algorithms for LLM alignment—PPO, GRPO, GSPO, REINFORCE++, DPO, and their variants. Understanding the tradeoffs that power modern AI assistants.
HuggingFace TRL: A Deep Dive into the Transformer Reinforcement Learning Library
A comprehensive exploration of HuggingFace TRL's architecture—examining its trainer ecosystem from SFT to GRPO, data collators, reward functions, vLLM integration, and the internals that power modern LLM fine-tuning workflows.