Fine-Tuning vs Prompting: When to Use Each
A practical guide to deciding between fine-tuning and prompt engineering for your LLM application, based on real-world experience with both approaches.
Table of Contents
The Customization Spectrum
When you need an LLM to behave differently than its default, you have a spectrum of options:
- Zero-shot prompting: Just describe what you want
- Few-shot prompting: Provide examples in the prompt
- Prompt engineering: Craft detailed instructions, personas, and constraints
- RAG (Retrieval-Augmented Generation): Inject relevant knowledge at inference time
- Fine-tuning: Train the model on examples of desired behavior
- Continued pre-training: Train on large domain-specific corpora
Most production applications use a combination. The question isn't "prompting OR fine-tuning" but "how much of each?"
This post provides a decision framework based on our experience at Goji AI, where we've deployed both heavily-prompted base models and fine-tuned specialists across different use cases.
Understanding the Trade-offs
Prompting: The Default Choice
Prompting should be your starting point. Here's why:
Advantages:
- Instant iteration: Change prompts in seconds, no training required
- No data requirements: Works without training examples
- Uses frontier models: Access the most capable models immediately
- Flexibility: Different prompts for different contexts without multiple models
- No infrastructure: No GPU clusters, training pipelines, or model hosting
- Transparency: Easy to inspect and understand what the model sees
Limitations:
- Context window cost: Long prompts consume tokens on every request
- Consistency challenges: Hard to guarantee consistent output formats
- Knowledge boundaries: Can only use knowledge in the prompt + model's training
- Latency overhead: Processing long prompts adds latency
- Instruction following: Some behaviors are hard to specify in words
Fine-tuning: The Specialized Tool
Fine-tuning trains the model's weights on your data. It's more powerful but more expensive.
Advantages:
- Behavioral consistency: Learned behaviors are more reliable than instructed ones
- Token efficiency: No need for long system prompts
- Domain knowledge: Can incorporate specialized knowledge into weights
- Output format control: Trained formats are more consistent than prompted ones
- Capability boundaries: Can remove unwanted behaviors more reliably
- Latency improvement: Shorter prompts = faster inference
Limitations:
- Data requirements: Need quality training examples (typically 100-10,000)
- Training costs: Compute, time, and expertise to fine-tune
- Iteration speed: Days to iterate vs. seconds for prompts
- Model lock-in: Fine-tuned weights don't transfer between models
- Catastrophic forgetting: Risk of degrading general capabilities
- Evaluation complexity: Need to test both target task AND general abilities
The Decision Framework
The decision between prompting and fine-tuning isn't binary—it's a progression. Almost every project should start with prompting, and only move to fine-tuning when prompting hits clear limits. Here's why this order matters:
Prompting establishes your quality ceiling: Before fine-tuning, you need to know what's achievable with the base model. If prompting gets you 85% accuracy, fine-tuning might get you 95%. But if prompting only gets 50%, something is fundamentally wrong—either the task is too hard for the model, your evaluation is flawed, or you need a different approach entirely.
Prompting generates training data: Good prompting produces good outputs. These outputs can become training data for fine-tuning. If you fine-tune before prompting, you're creating training data from scratch. If you prompt first, you can filter your best prompt results as training examples.
Fine-tuning locks in decisions: A fine-tuned model encodes your current understanding of the task. Prompts can be changed instantly. Fine-tuned weights require retraining to update. Get the task definition right with prompting before committing to fine-tuning.
Start with Prompting If...
Your task is well-defined in natural language: If you can explain to a human what you want in a few paragraphs, the model can probably follow instructions. Most tasks fall into this category.
You need flexibility: Different users need different behaviors? Different contexts require different approaches? Prompts can be dynamically generated. Fine-tuned models are static.
You're still learning what works: Early in development, requirements change constantly. Prompting lets you iterate in minutes. Fine-tuning locks you into decisions.
You have limited training data: Fine-tuning requires examples. If you don't have them, prompting is your only option. Even with data, prompting establishes a baseline.
You're using frontier models: GPT-4, Claude 3 Opus, and Gemini Ultra are incredibly capable. These models often don't benefit from fine-tuning because their base capabilities are so strong. Prompting extracts most of their potential.
Cost flexibility matters: Prompting costs scale linearly with usage. Fine-tuning has upfront costs but lower per-query costs. For low-to-medium volume, prompting is cheaper.
Consider Fine-tuning If...
Output format must be perfectly consistent: If outputs feed into downstream systems that require exact schemas, fine-tuning on format examples beats even detailed format instructions. We've seen 95%+ format compliance from fine-tuning vs. 80-90% from prompting.
You have domain-specific terminology: Specialized vocabulary, jargon, or conventions that the base model handles poorly. Fine-tuning on domain text helps more than prompting.
You need to remove capabilities: If the model does things you don't want (refuses valid requests, adds unwanted caveats, uses disallowed formats), fine-tuning can suppress these more reliably than prompting.
Latency is critical: Long system prompts add 100-500ms of latency. Fine-tuned models with short prompts are faster. For real-time applications, this matters.
You're operating at high scale: At millions of queries per month, the token cost of long prompts adds up. Fine-tuning amortizes training cost across usage. Break-even depends on your volumes and prompt lengths.
You've exhausted prompt optimization: If you've iterated extensively on prompts and hit a quality ceiling, fine-tuning may break through. But make sure you've actually exhausted prompting first.
The RAG Question
RAG often competes with fine-tuning for knowledge injection:
| Approach | Best For | Drawbacks |
|---|---|---|
| Fine-tuning | Static domain knowledge, behavioral patterns | Knowledge can become stale, limited update flexibility |
| RAG | Dynamic knowledge, citations needed, large knowledge bases | Retrieval latency and errors, context window limits |
They're complementary. Use RAG for knowledge that changes or needs attribution. Use fine-tuning for stable behaviors and conventions.
Prompt Engineering Deep Dive
Before fine-tuning, exhaust prompting. Here's how:
Structure Your Prompts
A production prompt has distinct sections:
[ROLE/PERSONA]
You are an expert tax accountant specializing in small business taxation...
[CONTEXT]
The user is a small business owner using our tax preparation software...
[TASK]
Answer the user's tax question accurately and helpfully...
[CONSTRAINTS]
- Only answer questions about US federal and state taxation
- Do not provide advice on audit defense or legal disputes
- Always recommend consulting a CPA for complex situations
[FORMAT]
Respond in 2-3 paragraphs. Use bullet points for lists of items.
End with a brief disclaimer about seeking professional advice.
[EXAMPLES]
User: Can I deduct my home office?
Assistant: Yes, if you use part of your home regularly and exclusively...
[CONVERSATION]
{conversation_history}
[CURRENT QUERY]
User: {user_message}
Prompt Engineering Techniques
Persona assignment: "You are an expert X with Y years of experience..." improves domain performance. Be specific about expertise.
Chain-of-thought: "Think step by step" or "Let's work through this systematically" improves reasoning on complex tasks.
Output formatting: Explicit format instructions with examples. JSON schema specifications. "Respond only with valid JSON matching this schema: {...}"
Negative constraints: "Do not include..." is often clearer than "Only include..." for boundary setting.
Few-shot examples: 2-5 input-output examples demonstrating desired behavior. Choose diverse examples covering edge cases.
Self-consistency: For critical decisions, ask the model multiple times and look for consensus.
Decomposition: Break complex tasks into simpler subtasks. "First identify X, then evaluate Y, finally recommend Z."
Systematic Prompt Optimization
Don't guess at prompts. Optimize systematically:
- Create evaluation set: 50-100 examples with expected outputs
- Baseline: Measure current prompt performance
- Hypothesis: Identify specific failure modes
- Iterate: Change one element at a time
- Measure: Compare against baseline
- Repeat: Until plateau
Track prompt versions with evaluation scores. You're doing machine learning without gradient descent.
When Prompting Fails
Signs you've hit prompting limits:
- Performance plateaus despite iterations
- The model "understands" but doesn't "do"
- Inconsistency that instructions can't fix
- Required behavior contradicts model's training
- Token costs are unsustainable
These are signals to consider fine-tuning.
Fine-tuning Deep Dive
Fine-tuning is often misunderstood. It's not "teaching the model new knowledge"—it's "changing how the model behaves." The base model already knows most things; fine-tuning adjusts its default behaviors, output patterns, and decision boundaries. Understanding this distinction prevents common mistakes.
What fine-tuning actually changes: When you fine-tune, you're adjusting the model's weights so that, given similar inputs to your training data, it produces similar outputs. You're not adding facts to a database—you're changing the probability distribution over outputs. This is why fine-tuning works well for style/format changes but poorly for injecting specific factual knowledge (use RAG for that).
The overfitting trap: With small datasets, the model can memorize training examples rather than generalizing patterns. Signs of overfitting: perfect performance on training data, poor performance on new inputs, outputs that look like verbatim copies of training examples. Validation splits and diverse test cases are essential.
Data Requirements
Quality matters more than quantity. Guidelines:
Minimum viable dataset:
- Simple format/style changes: 50-100 examples
- Moderate task learning: 200-500 examples
- Complex domain adaptation: 1,000-5,000 examples
- Significant behavior change: 5,000-10,000 examples
Data quality checklist:
- Diverse: Cover the full range of expected inputs
- Representative: Match production distribution
- Clean: No errors in outputs you're teaching
- Consistent: Same task, same format, same style
- Balanced: Don't over-represent any category
Data collection strategies:
- Production logs (filter for high-quality outputs)
- Human labeling (expensive but precise)
- Synthetic generation (use stronger models to create training data)
- Semi-synthetic (human edits of model outputs)
Fine-tuning Approaches
Supervised Fine-tuning (SFT): Train on input-output pairs. The model learns to produce outputs similar to your examples. This is the most common approach.
Instruction Fine-tuning: Train on instruction-following examples. Teaches the model to follow a style of instructions.
RLHF/DPO (Preference Optimization): Train on preference pairs (better vs. worse outputs). More complex but can improve subtle quality dimensions.
LoRA/QLoRA (Parameter-Efficient Fine-tuning): Train adapter layers instead of full model weights. Much cheaper, nearly as effective for most tasks. This is what most teams should use.
The Fine-tuning Process
-
Prepare data:
- Format as conversation/completion pairs
- Split: 90% train, 10% validation
- Verify data quality manually
-
Choose base model:
- Match capability level to task complexity
- Consider fine-tuning APIs (OpenAI, Anthropic) vs. open models
- Factor in inference cost and latency
-
Configure training:
- Learning rate (typically 1e-5 to 5e-5 for full fine-tuning, higher for LoRA)
- Number of epochs (1-3 for most tasks; more epochs = more overfitting risk)
- Batch size (larger is more stable, limited by memory)
-
Train and monitor:
- Watch validation loss (should decrease, then plateau)
- Early stopping if validation loss increases
-
Evaluate thoroughly:
- Task performance on held-out examples
- General capabilities (safety, instruction following)
- Format consistency
- Edge case handling
-
Deploy carefully:
- A/B test against baseline
- Monitor production metrics
- Have rollback ready
Fine-tuning Pitfalls
Overfitting to training data: Model memorizes examples instead of learning patterns. Fix: More diverse data, fewer epochs, regularization.
Catastrophic forgetting: Model loses general capabilities while learning specific task. Fix: Include general examples in training, use LoRA, evaluate general abilities.
Distribution mismatch: Training data doesn't match production queries. Fix: Sample training data from production, use data augmentation.
Format fragility: Model only works with exact input formats it was trained on. Fix: Vary input formats in training data.
Evaluation gaps: Model looks great on held-out data but fails in production. Fix: Build comprehensive evaluation covering edge cases.
Hybrid Approaches
The best systems often combine prompting and fine-tuning:
Fine-tune for format, prompt for content
Fine-tune a model to reliably produce JSON in your schema. Use prompts to specify what content goes in the JSON. You get format consistency from fine-tuning and flexibility from prompting.
Fine-tune base behavior, prompt for customization
Fine-tune a model for your domain and default behavior. Use per-request prompts to customize for specific users or contexts. The fine-tuned model follows domain conventions; prompts handle variation.
Fine-tune small, prompt large
Use a fine-tuned small model for classification, routing, or simple generation. Use a prompted large model for complex reasoning. Combine outputs. You get cost efficiency and capability.
The "Distillation" Pattern
Use a prompted large model (GPT-4, Claude 3 Opus) to generate training data. Fine-tune a smaller model on this data. Deploy the smaller model for cost efficiency. This is a powerful pattern for production systems.
GPT-4 + Complex Prompt → Generate 1000 training examples
↓
Fine-tune GPT-3.5 or open model
↓
Deploy fine-tuned model at 10x lower cost
Cost Analysis
Let's do real math. Consider a customer support assistant:
Prompting approach:
- System prompt: 2,000 tokens
- Average conversation: 1,500 tokens
- Total per query: 3,500 tokens
- At 0.035/query
Fine-tuned approach:
- System prompt: 200 tokens (minimal with fine-tuning)
- Average conversation: 1,500 tokens
- Total per query: 1,700 tokens
- Training cost: $5,000 (one-time)
- At 0.017/query
Break-even calculation:
- Cost savings: $0.018/query
- Break-even: 0.018 = ~280,000 queries
At 1M queries/month, you save ~$13,000/month after break-even. At 10K queries/month, break-even takes 2+ years.
Include in your calculation:
- Engineering time for fine-tuning (~40 hours minimum)
- Ongoing maintenance (retraining with new data)
- Quality differences (may need more/fewer queries to accomplish goals)
Real-World Examples
Example 1: Legal Document Summarization
Initial approach: Prompting with GPT-4
- Detailed system prompt with legal terminology
- Few-shot examples of summaries
- Results: 78% user satisfaction, good for prototyping
Issue: Inconsistent format, missed key clauses
Solution: Fine-tuned model on 500 human-written summaries
- Consistent section structure
- Improved clause identification
- Results: 91% user satisfaction, 40% lower token cost
Lesson: Domain-specific output conventions benefit from fine-tuning.
Example 2: Code Review Assistant
Initial approach: Fine-tuned model on code review comments
- 2,000 training examples from senior engineers
- Results: Good at catching simple issues, missed complex patterns
Issue: Couldn't adapt to different codebases and conventions
Solution: Switched to prompting with codebase context
- Dynamic prompts with repo-specific style guides
- RAG for codebase patterns
- Results: Better adaptation, competitive quality
Lesson: Flexibility requirements favor prompting + RAG.
Example 3: Customer Support Bot
Initial approach: Prompted base model
- Long system prompt with product information
- Results: 65% resolution rate, high latency
Solution: Hybrid approach
- Fine-tuned model for common questions (80% of volume)
- Prompted large model for complex issues (20% of volume)
- Router to classify and route
- Results: 82% resolution rate, 60% cost reduction
Lesson: Hybrid approaches capture benefits of both.
Decision Checklist
Use this checklist when deciding:
Default to prompting if:
- You're early in development
- Requirements are changing
- You have < 100 quality examples
- You need flexibility across contexts
- Using frontier models (GPT-4, Claude 3 Opus)
- Volume is < 100K queries/month
Consider fine-tuning if:
- Prompting quality has plateaued
- You have 500+ quality examples
- Output format consistency is critical
- Domain conventions are important
- Latency must be minimized
- Volume is > 500K queries/month
- You need to modify base model behaviors
Choose hybrid if:
- Different query types have different needs
- You want format consistency + content flexibility
- Cost optimization is important at scale
- You can build routing infrastructure
Conclusion
The prompting vs. fine-tuning decision isn't binary. It's about finding the right point on a spectrum for your specific use case.
Start with prompting. Iterate systematically. Measure rigorously. Graduate to fine-tuning when you have evidence it will help—not before.
The best teams treat this as an ongoing optimization, not a one-time decision. As your data grows, your understanding deepens, and model capabilities evolve, the right balance shifts. Build infrastructure that lets you experiment with both approaches and measure what works.
Frequently Asked Questions
Related Articles
SFT and RLHF: The Complete Guide to Post-Training LLMs
A deep dive into Supervised Fine-Tuning and Reinforcement Learning from Human Feedback—the techniques that transform base models into useful assistants.
Advanced Prompt Engineering: From Basic to Production-Grade
Master the techniques that separate amateur prompts from production systems—chain-of-thought, structured outputs, model-specific optimization, and prompt architecture.
LLM Evaluation in Production: Beyond Benchmarks
How to evaluate LLM performance in real-world applications, where academic benchmarks often fail to capture what matters.