LoRA or full fine-tuning?

[LoRA is the pragmatist's choice](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms)—a balance of quality and efficiency that works on single GPUs. Start with LoRA. Only use full fine-tuning if LoRA proves insufficient and you have compute to spare.

How much data do I need?

Quality matters more than quantity. Start with 1,000 high-quality examples. Add more only if the model underfits.

How long should fine-tuning take?

LoRA on 7B model with 5k examples: 1-4 hours on A100. QLoRA is 30-40% slower. Full fine-tuning is 10-20x slower.

Should I fine-tune a base model or an instruct model?

[Start with instruct models](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide)—they require less data and handle conversational formats natively.

How do I know if fine-tuning worked?

Compare against base model on your task. If fine-tuned model meaningfully outperforms base model (with prompting), fine-tuning worked.

Can I fine-tune on multiple tasks?

Yes. Include examples from all tasks in training data. The model will learn to handle all of them, though performance on each may be lower than single-task fine-tuning.

Fine-Tuning Workflows & Best Practices

Fine-tuning transforms general-purpose LLMs into specialized tools for your specific use case. A 7B model fine-tuned on your domain data can outperform a 70B general model for your tasks—while being faster and cheaper to run. But fine-tuning done poorly wastes compute, degrades quality, or creates models that fail in production.

This guide provides practical workflows for fine-tuning LLMs: when to fine-tune, data preparation, choosing between LoRA and full fine-tuning, hyperparameter selection, evaluation, and the 2025 tool landscape.

When to Fine-Tune

Fine-tuning is powerful but not always necessary. Start by testing with LoRA or QLoRA first; if it won't work there, it almost certainly won't work with full fine-tuning.

Good Candidates for Fine-Tuning

Domain-specific vocabulary and concepts: Medical, legal, financial, and technical domains have specialized terminology that general models handle poorly. Fine-tuning teaches the model your domain's language.

Consistent output format: If you need structured outputs in a specific format—particular JSON schemas, report templates, or code styles—fine-tuning enforces consistency better than prompting.

Persona and tone: Enterprise assistants, brand voices, and specialized interaction styles benefit from fine-tuning. The model internalizes the persona rather than requiring elaborate prompts.

Task-specific behavior: When you need the model to excel at a narrow task (classification, extraction, summarization in your domain), fine-tuning focuses its capabilities.

Poor Candidates for Fine-Tuning

Knowledge injection: Fine-tuning doesn't reliably teach new facts. For knowledge updates, use RAG instead.

Simple instructions: If few-shot prompting achieves your goal, fine-tuning adds unnecessary complexity.

Rapidly changing requirements: Fine-tuning creates a static model. If your needs change frequently, prompt engineering is more agile.

Insufficient data: Fine-tuning needs quality examples. If you have fewer than a few hundred examples, focus on prompting first.

The Decision Framework

Start small: 2-5k quality examples + LoRA + RAG will get you 80% of value quickly. Only escalate to full fine-tuning if LoRA proves insufficient.

Data Preparation

Data quality and formatting matter more than most hyperparameters. Poor data produces poor models regardless of compute invested.

Dataset Structure

Fine-tuning datasets typically have two columns: input (instruction/question) and output (response). For chat models, use conversation format with alternating user/assistant messages.

Instruction format:

Code

Input: "Summarize this contract clause in plain English: [clause text]"
Output: "This clause means that..."

Chat format:

Code

User: "What does this clause mean?"
Assistant: "This clause establishes..."
User: "What are the implications?"
Assistant: "The main implications are..."

Quality Guidelines

Quality matters far more than quantity, with ~1,000 high-quality examples often sufficient.

Accuracy: Every example should demonstrate correct behavior. Wrong examples teach wrong behavior.

Diversity: Cover the range of scenarios the model will encounter. Don't over-represent any single pattern.

Representativeness: Examples should match production input distribution. If users ask questions informally, training data should include informal questions.

Completeness: Outputs should be complete, not truncated. Incomplete outputs teach the model to generate incomplete responses.

Data Cleaning

Remove duplicates: Exact and near-duplicates inflate training and can cause overfitting.

Check for PII: Remove personally identifiable information unless necessary for the task.

Validate format: Ensure consistent formatting across examples. Inconsistent formatting confuses the model.

Filter low quality: Remove examples that are ambiguous, incorrect, or poorly written.

Dataset Size Guidelines

Prepare a minimum of ~1,000 examples, prioritizing quality over quantity with an 80/20 train/validation split.

Use Case	Recommended Size
Style/tone adaptation	500-2,000
Task specialization	1,000-5,000
Domain adaptation	5,000-20,000
New capability	10,000+

More data helps, but with diminishing returns. High-quality data is critical—10,000 quality examples outperform 100,000 noisy examples.

Fine-Tuning Methods

LoRA (Low-Rank Adaptation)

LoRA adds tiny low-rank "adapters" to a frozen model so you only train a sliver of parameters. It's fast, modular, and usually within striking distance of full fine-tuning quality.

How it works: Instead of updating all model weights, LoRA adds small trainable matrices to specific layers (typically attention layers). These matrices have low rank, meaning they have far fewer parameters than the original weight matrices. For example, fine-tuning a 13B-parameter model with LoRA may only require updating 1-2% of parameters.

Key hyperparameters:

Rank (r): Dimension of the low-rank matrices. Higher rank = more capacity but more parameters. Start with r=8 or r=16.
Alpha: Scaling factor for LoRA updates. Typically set to 2x rank (alpha=16 for r=8).
Target modules: Which layers to apply LoRA to. Default is attention layers; applying to all linear layers can improve quality.

Advantages:

Trains on single GPU (even consumer GPUs for smaller models)
Multiple adapters can be merged or swapped
Base model stays frozen—no catastrophic forgetting of general capabilities

Disadvantages:

Slightly lower quality ceiling than full fine-tuning for large datasets
Requires careful rank selection for optimal results

QLoRA (Quantized LoRA)

QLoRA quantizes the pretrained weights to 4-bit precision and uses paged optimizers to handle memory spikes.

How it works: The base model is loaded in 4-bit precision (using NormalFloat4 quantization), dramatically reducing memory requirements. LoRA adapters are trained in higher precision, then combined with the quantized base for inference.

Memory savings: You can save 33% of GPU memory with QLoRA compared to standard LoRA. This enables fine-tuning 70B parameter models on 24GB VRAM.

Tradeoffs: QLoRA comes at a 39% increased training runtime caused by additional quantization and dequantization during training. The quality difference from standard LoRA is typically minimal.

When to use: If your base model already fits comfortably in GPU memory, start with LoRA; if you need to squeeze very large models onto limited VRAM, go QLoRA.

Full Fine-Tuning

Full fine-tuning involves training the entire model, meaning all layers are adjusted during training.

When it makes sense:

Large training datasets (50k+ examples)
Significant distribution shift from pretraining data
Maximum quality is critical and compute is not constrained
You need to modify fundamental model behaviors

Resource requirements: Full fine-tuning requires roughly 4x the memory of inference (for gradients, optimizer states, etc.). A 7B model needs ~56GB for full fine-tuning; 70B models need hundreds of GB.

A common mistake is jumping straight into full fine-tuning, which is compute-heavy. Test with LoRA first.

DoRA (Weight-Decomposed Low-Rank Adaptation)

DoRA is an NVIDIA-developed improvement over LoRA that decomposes pretrained weights into magnitude and direction components, achieving quality closer to full fine-tuning.

How it works: DoRA can be described in two steps:

Decompose pretrained weight matrix into a magnitude vector (m) and directional matrix (V)
Apply LoRA to the directional matrix V while training the magnitude vector m separately

Why it works: The DoRA authors found that LoRA increases or decreases magnitude and direction updates proportionally, but lacks the capability to make subtle directional changes that full fine-tuning achieves. DoRA decouples these, enabling more precise adaptation.

Performance improvements:

+3.7/+1.0 on Llama 7B/13B for common-sense reasoning
+2.9 on Llama 2 7B, +4.4 on Llama 3 8B
Improvements on vision-language tasks (+0.9/+1.9 on VL-BART)

Overhead: Introducing the magnitude vector m adds only 0.01% more parameters compared to LoRA—negligible overhead for meaningful quality improvement.

Hyperparameter recommendations:

Start with slightly lower learning rate than LoRA
Experiment with varying dropout ratios
Can often achieve comparable results with half the rank of LoRA

When to use: DoRA is recommended when you want LoRA's efficiency but need quality closer to full fine-tuning. It's now supported in Hugging Face PEFT and major fine-tuning frameworks.

Method Comparison (Updated)

Method	Memory	Quality	Speed	Parameters	Use Case
LoRA	Low	Good	Fast	1-2%	Default choice
DoRA	Low	Very Good	Fast	1-2% + 0.01%	Higher quality than LoRA
QLoRA	Very Low	Good	Medium	1-2%	Large models on limited VRAM
Full FT	Very High	Best	Slow	100%	Large datasets, max quality

Dataset Formats

Different fine-tuning frameworks expect data in specific formats. Understanding these formats prevents frustrating preprocessing errors.

Alpaca Format

The original Stanford Alpaca format, widely supported:

JSON

{
  "instruction": "Summarize the following text",
  "input": "The quick brown fox...",
  "output": "A fox jumped over a dog."
}

Fields:

instruction: The task description
input: Optional additional context
output: Expected response

Best for: Single-turn instruction following tasks.

ShareGPT Format

Multi-turn conversation format, used by many chat models:

JSON

{
  "conversations": [
    {"from": "human", "value": "What is machine learning?"},
    {"from": "gpt", "value": "Machine learning is..."},
    {"from": "human", "value": "Can you give an example?"},
    {"from": "gpt", "value": "Sure, consider..."}
  ]
}

Variations:

from can be human/gpt, user/assistant, or system/user/assistant
Some formats use role/content instead of from/value

Best for: Multi-turn chat fine-tuning, conversational models.

OpenAI Chat Format

The format used by OpenAI's fine-tuning API:

JSON

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "2+2 equals 4."}
  ]
}

Best for: Compatibility with OpenAI fine-tuning, chat models following OpenAI conventions.

Completion Format

Simple input-output pairs for completion tasks:

JSON

{
  "prompt": "Translate to French: Hello",
  "completion": "Bonjour"
}

Best for: Simple completion tasks, base model fine-tuning.

Format Conversion Tips

Framework requirements:

Axolotl: Supports Alpaca, ShareGPT, completion formats natively
LLaMA-Factory: GUI-based format selection, supports most formats
Unsloth: Expects chat template format, auto-converts common formats
TRL SFTTrainer: Flexible, accepts most formats with proper configuration

Common pitfalls:

Inconsistent field names across examples
Missing required fields
Incorrect chat template application
Wrong special tokens

Validation: Always validate a sample of your dataset before training. Check that:

All required fields are present
Field names are consistent
Content is properly escaped
No truncation will occur during tokenization

Hyperparameter Selection

Learning Rate

Learning rate is the most important hyperparameter. Too high causes instability; too low causes slow convergence.

Starting points:

LoRA: 1e-4 to 3e-4
Full fine-tuning: 1e-5 to 5e-5

Scheduling: Use linear warmup (5-10% of steps) followed by cosine decay or linear decay.

Batch Size

Larger batch sizes provide more stable gradients but require more memory.

Effective batch size: Accumulate gradients across multiple small batches to simulate larger batches. effective_batch = batch_size * gradient_accumulation_steps * num_gpus

Typical values: Effective batch sizes of 32-128 work well for most tasks.

Epochs

How many times to iterate through the dataset.

Typical values: 1-5 epochs. More epochs risk overfitting; fewer epochs risk underfitting.

Early stopping: Monitor validation loss and stop when it starts increasing (overfitting signal).

LoRA-Specific Parameters

The authors suggest applying LoRA adapters on all linear transformer blocks along with query, key, and value layers for best results.

Rank (r): Start with 8-16. Increase if underfitting; decrease if overfitting or for smaller adapters.

Target modules: For best quality, apply to all linear layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj). For smaller adapters, attention-only (q_proj, v_proj) often suffices.

The Fine-Tuning Pipeline

Stage 1: Dataset Preparation

Collect and curate examples
Clean and validate data
Format for training (instruction format or chat format)
Split into train/validation (80/20 or 90/10)
Inspect samples to verify quality

Stage 2: Model Selection

Choose your base model based on:

Required capabilities (reasoning, coding, multilingual)
Size constraints (inference latency, memory)
License requirements (commercial use, modifications)

Starting with Instruct models is recommended as they allow direct fine-tuning using conversational chat templates and require less data compared to base models.

Stage 3: Training Environment Setup

Set up training infrastructure:

GPU selection (A100, H100 for production; consumer GPUs for experimentation)
Framework selection (see Tools section)
Experiment tracking (Weights & Biases, MLflow)

Stage 4: Training

Run training with monitoring:

Track training loss (should decrease smoothly)
Track validation loss (should decrease, then plateau)
Watch for divergence or oscillation (learning rate too high)
Save checkpoints regularly

Stage 5: Evaluation

Evaluate the fine-tuned model:

Task-specific metrics (accuracy, F1, BLEU, ROUGE)
Human evaluation for quality assessment
Comparison against base model and prompting baselines
Regression testing on general capabilities

Stage 6: Deployment

Deploy the model:

Merge LoRA adapters with base model (optional, simplifies serving)
Quantize for inference efficiency
Set up serving infrastructure
Monitor production metrics

Stage 7: Monitoring and Iteration

Continuously improve:

Collect production feedback
Identify failure cases
Add examples addressing failures
Retrain periodically

Tools and Frameworks

Unsloth

Unsloth provides dramatically faster training through custom CUDA kernels, with 2-5x speedup and 60% memory reduction compared to standard implementations.

Strengths: Fastest training speeds, excellent documentation, easy setup.

Best for: Rapid experimentation, teams wanting maximum training efficiency.

Axolotl

Axolotl provides a config-driven approach to fine-tuning with extensive customization options.

Strengths: Highly configurable, supports many model architectures, active community.

Best for: Teams needing flexibility and customization, complex training setups.

LLaMA-Factory

LLaMA-Factory offers a web UI for fine-tuning, making it accessible to non-experts.

Strengths: User-friendly interface, broad model support, integrated evaluation.

Best for: Teams new to fine-tuning, rapid prototyping.

Hugging Face PEFT

PEFT (Parameter-Efficient Fine-Tuning) is the foundational library for LoRA and other efficient fine-tuning methods.

Strengths: Industry standard, extensive documentation, broad compatibility.

Best for: Production deployments, teams wanting full control.

Tool Selection

The best GenAI fine-tuning tools in 2025 combine efficient training methods with robust experiment tracking and scalable serving infrastructure.

Need	Recommended Tool
Maximum speed	Unsloth
Maximum flexibility	Axolotl
Ease of use	LLaMA-Factory
Production control	Hugging Face PEFT

Common Pitfalls

Overfitting

Symptoms: Training loss continues decreasing but validation loss increases. Model memorizes training data rather than learning patterns.

Solutions:

Add more diverse training data
Reduce epochs
Increase dropout
Reduce LoRA rank
Use early stopping

Catastrophic Forgetting

Symptoms: Model loses general capabilities after fine-tuning. Performs well on fine-tuning tasks but poorly on general tasks.

Solutions:

Use LoRA (keeps base model frozen)
Mix general examples with task-specific examples
Use lower learning rates
Train for fewer epochs

Format Learning Instead of Task Learning

Symptoms: Model learns to produce outputs in the right format but with wrong content.

Solutions:

Ensure training data has diverse content
Include negative examples
Evaluate on held-out content, not just format

Data Contamination

Symptoms: Suspiciously high evaluation scores. Model may have seen evaluation data during training.

Solutions:

Carefully separate train/eval data
Use temporal splits when possible
Create novel evaluation examples

Evaluation Strategies

Automated Metrics

Task-specific metrics: Accuracy for classification, F1 for extraction, BLEU/ROUGE for generation.

Perplexity: Lower is better. Compare against base model to ensure improvement.

Format compliance: What percentage of outputs match required format?

LLM-as-Judge

Use a capable model (GPT-4, Claude) to evaluate fine-tuned model outputs:

Rate response quality on multiple dimensions
Compare fine-tuned vs. base model outputs
Identify specific failure modes

Human Evaluation

A/B testing: Show evaluators outputs from fine-tuned and base models. Which do they prefer?

Rating scales: Rate specific dimensions (accuracy, helpfulness, fluency).

Error analysis: Categorize failures to guide data collection.

Regression Testing

Ensure fine-tuning didn't break general capabilities:

Test on standard benchmarks (MMLU, HumanEval)
Compare against base model
Acceptable degradation depends on use case

Table of Contents

Fine-Tuning Workflows & Best Practices

When to Fine-Tune

Good Candidates for Fine-Tuning

Poor Candidates for Fine-Tuning

The Decision Framework

Data Preparation

Dataset Structure

Quality Guidelines

Data Cleaning

Dataset Size Guidelines

Fine-Tuning Methods

LoRA (Low-Rank Adaptation)

QLoRA (Quantized LoRA)

Full Fine-Tuning

DoRA (Weight-Decomposed Low-Rank Adaptation)

Method Comparison (Updated)

Dataset Formats

Alpaca Format

ShareGPT Format

OpenAI Chat Format

Completion Format

Format Conversion Tips

Hyperparameter Selection

Learning Rate

Batch Size

Epochs

LoRA-Specific Parameters

The Fine-Tuning Pipeline

Stage 1: Dataset Preparation

Stage 2: Model Selection

Stage 3: Training Environment Setup

Stage 4: Training

Stage 5: Evaluation

Stage 6: Deployment

Stage 7: Monitoring and Iteration

Tools and Frameworks

Unsloth

Axolotl

LLaMA-Factory

Hugging Face PEFT

Tool Selection

Common Pitfalls

Overfitting

Catastrophic Forgetting

Format Learning Instead of Task Learning

Data Contamination

Evaluation Strategies

Automated Metrics

LLM-as-Judge

Human Evaluation

Regression Testing

Sources

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

SFT Deep Dive: Instruction Tuning Techniques and Best Practices

RLHF Complete Guide: Aligning LLMs with Human Preferences

RL Algorithms for LLM Training: PPO, GRPO, GSPO, and Beyond

HuggingFace TRL: A Deep Dive into the Transformer Reinforcement Learning Library