Skip to main content
Back to Blog

Fine-Tuning vs Prompting: When to Use Each

A practical guide to deciding between fine-tuning and prompt engineering for your LLM application, based on real-world experience with both approaches.

11 min read
Share:

The Customization Spectrum

When you need an LLM to behave differently than its default, you have a spectrum of options:

  1. Zero-shot prompting: Just describe what you want
  2. Few-shot prompting: Provide examples in the prompt
  3. Prompt engineering: Craft detailed instructions, personas, and constraints
  4. RAG (Retrieval-Augmented Generation): Inject relevant knowledge at inference time
  5. Fine-tuning: Train the model on examples of desired behavior
  6. Continued pre-training: Train on large domain-specific corpora

Most production applications use a combination. The question isn't "prompting OR fine-tuning" but "how much of each?"

This post provides a decision framework based on our experience at Goji AI, where we've deployed both heavily-prompted base models and fine-tuned specialists across different use cases.

Understanding the Trade-offs

Prompting: The Default Choice

Prompting should be your starting point. Here's why:

Advantages:

  • Instant iteration: Change prompts in seconds, no training required
  • No data requirements: Works without training examples
  • Uses frontier models: Access the most capable models immediately
  • Flexibility: Different prompts for different contexts without multiple models
  • No infrastructure: No GPU clusters, training pipelines, or model hosting
  • Transparency: Easy to inspect and understand what the model sees

Limitations:

  • Context window cost: Long prompts consume tokens on every request
  • Consistency challenges: Hard to guarantee consistent output formats
  • Knowledge boundaries: Can only use knowledge in the prompt + model's training
  • Latency overhead: Processing long prompts adds latency
  • Instruction following: Some behaviors are hard to specify in words

Fine-tuning: The Specialized Tool

Fine-tuning trains the model's weights on your data. It's more powerful but more expensive.

Advantages:

  • Behavioral consistency: Learned behaviors are more reliable than instructed ones
  • Token efficiency: No need for long system prompts
  • Domain knowledge: Can incorporate specialized knowledge into weights
  • Output format control: Trained formats are more consistent than prompted ones
  • Capability boundaries: Can remove unwanted behaviors more reliably
  • Latency improvement: Shorter prompts = faster inference

Limitations:

  • Data requirements: Need quality training examples (typically 100-10,000)
  • Training costs: Compute, time, and expertise to fine-tune
  • Iteration speed: Days to iterate vs. seconds for prompts
  • Model lock-in: Fine-tuned weights don't transfer between models
  • Catastrophic forgetting: Risk of degrading general capabilities
  • Evaluation complexity: Need to test both target task AND general abilities

The Decision Framework

The decision between prompting and fine-tuning isn't binary—it's a progression. Almost every project should start with prompting, and only move to fine-tuning when prompting hits clear limits. Here's why this order matters:

Prompting establishes your quality ceiling: Before fine-tuning, you need to know what's achievable with the base model. If prompting gets you 85% accuracy, fine-tuning might get you 95%. But if prompting only gets 50%, something is fundamentally wrong—either the task is too hard for the model, your evaluation is flawed, or you need a different approach entirely.

Prompting generates training data: Good prompting produces good outputs. These outputs can become training data for fine-tuning. If you fine-tune before prompting, you're creating training data from scratch. If you prompt first, you can filter your best prompt results as training examples.

Fine-tuning locks in decisions: A fine-tuned model encodes your current understanding of the task. Prompts can be changed instantly. Fine-tuned weights require retraining to update. Get the task definition right with prompting before committing to fine-tuning.

Start with Prompting If...

Your task is well-defined in natural language: If you can explain to a human what you want in a few paragraphs, the model can probably follow instructions. Most tasks fall into this category.

You need flexibility: Different users need different behaviors? Different contexts require different approaches? Prompts can be dynamically generated. Fine-tuned models are static.

You're still learning what works: Early in development, requirements change constantly. Prompting lets you iterate in minutes. Fine-tuning locks you into decisions.

You have limited training data: Fine-tuning requires examples. If you don't have them, prompting is your only option. Even with data, prompting establishes a baseline.

You're using frontier models: GPT-4, Claude 3 Opus, and Gemini Ultra are incredibly capable. These models often don't benefit from fine-tuning because their base capabilities are so strong. Prompting extracts most of their potential.

Cost flexibility matters: Prompting costs scale linearly with usage. Fine-tuning has upfront costs but lower per-query costs. For low-to-medium volume, prompting is cheaper.

Consider Fine-tuning If...

Output format must be perfectly consistent: If outputs feed into downstream systems that require exact schemas, fine-tuning on format examples beats even detailed format instructions. We've seen 95%+ format compliance from fine-tuning vs. 80-90% from prompting.

You have domain-specific terminology: Specialized vocabulary, jargon, or conventions that the base model handles poorly. Fine-tuning on domain text helps more than prompting.

You need to remove capabilities: If the model does things you don't want (refuses valid requests, adds unwanted caveats, uses disallowed formats), fine-tuning can suppress these more reliably than prompting.

Latency is critical: Long system prompts add 100-500ms of latency. Fine-tuned models with short prompts are faster. For real-time applications, this matters.

You're operating at high scale: At millions of queries per month, the token cost of long prompts adds up. Fine-tuning amortizes training cost across usage. Break-even depends on your volumes and prompt lengths.

You've exhausted prompt optimization: If you've iterated extensively on prompts and hit a quality ceiling, fine-tuning may break through. But make sure you've actually exhausted prompting first.

The RAG Question

RAG often competes with fine-tuning for knowledge injection:

ApproachBest ForDrawbacks
Fine-tuningStatic domain knowledge, behavioral patternsKnowledge can become stale, limited update flexibility
RAGDynamic knowledge, citations needed, large knowledge basesRetrieval latency and errors, context window limits

They're complementary. Use RAG for knowledge that changes or needs attribution. Use fine-tuning for stable behaviors and conventions.

Prompt Engineering Deep Dive

Before fine-tuning, exhaust prompting. Here's how:

Structure Your Prompts

A production prompt has distinct sections:

Code
[ROLE/PERSONA]
You are an expert tax accountant specializing in small business taxation...

[CONTEXT]
The user is a small business owner using our tax preparation software...

[TASK]
Answer the user's tax question accurately and helpfully...

[CONSTRAINTS]
- Only answer questions about US federal and state taxation
- Do not provide advice on audit defense or legal disputes
- Always recommend consulting a CPA for complex situations

[FORMAT]
Respond in 2-3 paragraphs. Use bullet points for lists of items.
End with a brief disclaimer about seeking professional advice.

[EXAMPLES]
User: Can I deduct my home office?
Assistant: Yes, if you use part of your home regularly and exclusively...

[CONVERSATION]
{conversation_history}

[CURRENT QUERY]
User: {user_message}

Prompt Engineering Techniques

Persona assignment: "You are an expert X with Y years of experience..." improves domain performance. Be specific about expertise.

Chain-of-thought: "Think step by step" or "Let's work through this systematically" improves reasoning on complex tasks.

Output formatting: Explicit format instructions with examples. JSON schema specifications. "Respond only with valid JSON matching this schema: {...}"

Negative constraints: "Do not include..." is often clearer than "Only include..." for boundary setting.

Few-shot examples: 2-5 input-output examples demonstrating desired behavior. Choose diverse examples covering edge cases.

Self-consistency: For critical decisions, ask the model multiple times and look for consensus.

Decomposition: Break complex tasks into simpler subtasks. "First identify X, then evaluate Y, finally recommend Z."

Systematic Prompt Optimization

Don't guess at prompts. Optimize systematically:

  1. Create evaluation set: 50-100 examples with expected outputs
  2. Baseline: Measure current prompt performance
  3. Hypothesis: Identify specific failure modes
  4. Iterate: Change one element at a time
  5. Measure: Compare against baseline
  6. Repeat: Until plateau

Track prompt versions with evaluation scores. You're doing machine learning without gradient descent.

When Prompting Fails

Signs you've hit prompting limits:

  • Performance plateaus despite iterations
  • The model "understands" but doesn't "do"
  • Inconsistency that instructions can't fix
  • Required behavior contradicts model's training
  • Token costs are unsustainable

These are signals to consider fine-tuning.

Fine-tuning Deep Dive

Fine-tuning is often misunderstood. It's not "teaching the model new knowledge"—it's "changing how the model behaves." The base model already knows most things; fine-tuning adjusts its default behaviors, output patterns, and decision boundaries. Understanding this distinction prevents common mistakes.

What fine-tuning actually changes: When you fine-tune, you're adjusting the model's weights so that, given similar inputs to your training data, it produces similar outputs. You're not adding facts to a database—you're changing the probability distribution over outputs. This is why fine-tuning works well for style/format changes but poorly for injecting specific factual knowledge (use RAG for that).

The overfitting trap: With small datasets, the model can memorize training examples rather than generalizing patterns. Signs of overfitting: perfect performance on training data, poor performance on new inputs, outputs that look like verbatim copies of training examples. Validation splits and diverse test cases are essential.

Data Requirements

Quality matters more than quantity. Guidelines:

Minimum viable dataset:

  • Simple format/style changes: 50-100 examples
  • Moderate task learning: 200-500 examples
  • Complex domain adaptation: 1,000-5,000 examples
  • Significant behavior change: 5,000-10,000 examples

Data quality checklist:

  • Diverse: Cover the full range of expected inputs
  • Representative: Match production distribution
  • Clean: No errors in outputs you're teaching
  • Consistent: Same task, same format, same style
  • Balanced: Don't over-represent any category

Data collection strategies:

  • Production logs (filter for high-quality outputs)
  • Human labeling (expensive but precise)
  • Synthetic generation (use stronger models to create training data)
  • Semi-synthetic (human edits of model outputs)

Fine-tuning Approaches

Supervised Fine-tuning (SFT): Train on input-output pairs. The model learns to produce outputs similar to your examples. This is the most common approach.

Instruction Fine-tuning: Train on instruction-following examples. Teaches the model to follow a style of instructions.

RLHF/DPO (Preference Optimization): Train on preference pairs (better vs. worse outputs). More complex but can improve subtle quality dimensions.

LoRA/QLoRA (Parameter-Efficient Fine-tuning): Train adapter layers instead of full model weights. Much cheaper, nearly as effective for most tasks. This is what most teams should use.

The Fine-tuning Process

  1. Prepare data:

    • Format as conversation/completion pairs
    • Split: 90% train, 10% validation
    • Verify data quality manually
  2. Choose base model:

    • Match capability level to task complexity
    • Consider fine-tuning APIs (OpenAI, Anthropic) vs. open models
    • Factor in inference cost and latency
  3. Configure training:

    • Learning rate (typically 1e-5 to 5e-5 for full fine-tuning, higher for LoRA)
    • Number of epochs (1-3 for most tasks; more epochs = more overfitting risk)
    • Batch size (larger is more stable, limited by memory)
  4. Train and monitor:

    • Watch validation loss (should decrease, then plateau)
    • Early stopping if validation loss increases
  5. Evaluate thoroughly:

    • Task performance on held-out examples
    • General capabilities (safety, instruction following)
    • Format consistency
    • Edge case handling
  6. Deploy carefully:

    • A/B test against baseline
    • Monitor production metrics
    • Have rollback ready

Fine-tuning Pitfalls

Overfitting to training data: Model memorizes examples instead of learning patterns. Fix: More diverse data, fewer epochs, regularization.

Catastrophic forgetting: Model loses general capabilities while learning specific task. Fix: Include general examples in training, use LoRA, evaluate general abilities.

Distribution mismatch: Training data doesn't match production queries. Fix: Sample training data from production, use data augmentation.

Format fragility: Model only works with exact input formats it was trained on. Fix: Vary input formats in training data.

Evaluation gaps: Model looks great on held-out data but fails in production. Fix: Build comprehensive evaluation covering edge cases.

Hybrid Approaches

The best systems often combine prompting and fine-tuning:

Fine-tune for format, prompt for content

Fine-tune a model to reliably produce JSON in your schema. Use prompts to specify what content goes in the JSON. You get format consistency from fine-tuning and flexibility from prompting.

Fine-tune base behavior, prompt for customization

Fine-tune a model for your domain and default behavior. Use per-request prompts to customize for specific users or contexts. The fine-tuned model follows domain conventions; prompts handle variation.

Fine-tune small, prompt large

Use a fine-tuned small model for classification, routing, or simple generation. Use a prompted large model for complex reasoning. Combine outputs. You get cost efficiency and capability.

The "Distillation" Pattern

Use a prompted large model (GPT-4, Claude 3 Opus) to generate training data. Fine-tune a smaller model on this data. Deploy the smaller model for cost efficiency. This is a powerful pattern for production systems.

Code
GPT-4 + Complex Prompt → Generate 1000 training examples
                              ↓
                    Fine-tune GPT-3.5 or open model
                              ↓
                    Deploy fine-tuned model at 10x lower cost

Cost Analysis

Let's do real math. Consider a customer support assistant:

Prompting approach:

  • System prompt: 2,000 tokens
  • Average conversation: 1,500 tokens
  • Total per query: 3,500 tokens
  • At 0.01/1Ktokens(GPT4o):0.01/1K tokens (GPT-4o): 0.035/query

Fine-tuned approach:

  • System prompt: 200 tokens (minimal with fine-tuning)
  • Average conversation: 1,500 tokens
  • Total per query: 1,700 tokens
  • Training cost: $5,000 (one-time)
  • At 0.01/1Ktokens:0.01/1K tokens: 0.017/query

Break-even calculation:

  • Cost savings: $0.018/query
  • Break-even: 5,000/5,000 / 0.018 = ~280,000 queries

At 1M queries/month, you save ~$13,000/month after break-even. At 10K queries/month, break-even takes 2+ years.

Include in your calculation:

  • Engineering time for fine-tuning (~40 hours minimum)
  • Ongoing maintenance (retraining with new data)
  • Quality differences (may need more/fewer queries to accomplish goals)

Real-World Examples

Initial approach: Prompting with GPT-4

  • Detailed system prompt with legal terminology
  • Few-shot examples of summaries
  • Results: 78% user satisfaction, good for prototyping

Issue: Inconsistent format, missed key clauses

Solution: Fine-tuned model on 500 human-written summaries

  • Consistent section structure
  • Improved clause identification
  • Results: 91% user satisfaction, 40% lower token cost

Lesson: Domain-specific output conventions benefit from fine-tuning.

Example 2: Code Review Assistant

Initial approach: Fine-tuned model on code review comments

  • 2,000 training examples from senior engineers
  • Results: Good at catching simple issues, missed complex patterns

Issue: Couldn't adapt to different codebases and conventions

Solution: Switched to prompting with codebase context

  • Dynamic prompts with repo-specific style guides
  • RAG for codebase patterns
  • Results: Better adaptation, competitive quality

Lesson: Flexibility requirements favor prompting + RAG.

Example 3: Customer Support Bot

Initial approach: Prompted base model

  • Long system prompt with product information
  • Results: 65% resolution rate, high latency

Solution: Hybrid approach

  • Fine-tuned model for common questions (80% of volume)
  • Prompted large model for complex issues (20% of volume)
  • Router to classify and route
  • Results: 82% resolution rate, 60% cost reduction

Lesson: Hybrid approaches capture benefits of both.

Decision Checklist

Use this checklist when deciding:

Default to prompting if:

  • You're early in development
  • Requirements are changing
  • You have < 100 quality examples
  • You need flexibility across contexts
  • Using frontier models (GPT-4, Claude 3 Opus)
  • Volume is < 100K queries/month

Consider fine-tuning if:

  • Prompting quality has plateaued
  • You have 500+ quality examples
  • Output format consistency is critical
  • Domain conventions are important
  • Latency must be minimized
  • Volume is > 500K queries/month
  • You need to modify base model behaviors

Choose hybrid if:

  • Different query types have different needs
  • You want format consistency + content flexibility
  • Cost optimization is important at scale
  • You can build routing infrastructure

Conclusion

The prompting vs. fine-tuning decision isn't binary. It's about finding the right point on a spectrum for your specific use case.

Start with prompting. Iterate systematically. Measure rigorously. Graduate to fine-tuning when you have evidence it will help—not before.

The best teams treat this as an ongoing optimization, not a one-time decision. As your data grows, your understanding deepens, and model capabilities evolve, the right balance shifts. Build infrastructure that lets you experiment with both approaches and measure what works.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles