How much training data do I need for fine-tuning?

It depends on task complexity. For format/style changes: 50-100 examples. For task learning: 200-500. For domain adaptation: 1,000-5,000. Quality matters more than quantity—100 excellent examples beat 1,000 mediocre ones. Start small, evaluate, and add data to address specific gaps.

Can I fine-tune on proprietary data without data leakage?

Yes, with precautions. Major providers (OpenAI, Anthropic) contractually commit that fine-tuning data isn't used to train base models. For maximum security, fine-tune open models on your own infrastructure. Never fine-tune on data you can't afford to potentially expose.

How do I know if my fine-tuned model has catastrophic forgetting?

Evaluate on both task-specific tests AND general capability tests (instruction following, safety, coherence). If general capabilities drop significantly, you're overfitting or training too long. Solutions: fewer epochs, LoRA instead of full fine-tuning, mix general examples into training data.

What about RAG vs fine-tuning for domain knowledge?

RAG for: dynamic knowledge, need for citations, large knowledge bases, knowledge that changes. Fine-tuning for: stable conventions, terminology, formatting patterns, behavioral defaults. Often you want both: fine-tune for behavior, RAG for knowledge.

Should I fine-tune an open model or use OpenAI/Anthropic fine-tuning APIs?

API fine-tuning is easier (no infrastructure) but locked-in. Open models give control but require ML infrastructure. For most teams: start with API fine-tuning to validate the approach. Move to open models if you need more control, lower costs at scale, or on-premise deployment.

How often should I retrain a fine-tuned model?

When: (1) Production distribution shifts significantly. (2) You've collected substantially more training data. (3) Model quality metrics decline. (4) A better base model becomes available. For most applications, quarterly review is sufficient. Build monitoring to detect when retraining is needed rather than retraining on a fixed schedule.

Fine-Tuning vs Prompting: When to Use Each | Enrico Piovano

The Customization Spectrum

When you need an LLM to behave differently than its default, you have a spectrum of options:

Zero-shot prompting: Just describe what you want
Few-shot prompting: Provide examples in the prompt
Prompt engineering: Craft detailed instructions, personas, and constraints
RAG (Retrieval-Augmented Generation): Inject relevant knowledge at inference time
Fine-tuning: Train the model on examples of desired behavior
Continued pre-training: Train on large domain-specific corpora

Most production applications use a combination. The question isn't "prompting OR fine-tuning" but "how much of each?"

This post provides a decision framework based on our experience at Goji AI, where we've deployed both heavily-prompted base models and fine-tuned specialists across different use cases.

Understanding the Trade-offs

Prompting: The Default Choice

Prompting should be your starting point. Here's why:

Advantages:

Instant iteration: Change prompts in seconds, no training required
No data requirements: Works without training examples
Uses frontier models: Access the most capable models immediately
Flexibility: Different prompts for different contexts without multiple models
No infrastructure: No GPU clusters, training pipelines, or model hosting
Transparency: Easy to inspect and understand what the model sees

Limitations:

Context window cost: Long prompts consume tokens on every request
Consistency challenges: Hard to guarantee consistent output formats
Knowledge boundaries: Can only use knowledge in the prompt + model's training
Latency overhead: Processing long prompts adds latency
Instruction following: Some behaviors are hard to specify in words

Fine-tuning: The Specialized Tool

Fine-tuning trains the model's weights on your data. It's more powerful but more expensive.

Advantages:

Behavioral consistency: Learned behaviors are more reliable than instructed ones
Token efficiency: No need for long system prompts
Domain knowledge: Can incorporate specialized knowledge into weights
Output format control: Trained formats are more consistent than prompted ones
Capability boundaries: Can remove unwanted behaviors more reliably
Latency improvement: Shorter prompts = faster inference

Limitations:

Data requirements: Need quality training examples (typically 100-10,000)
Training costs: Compute, time, and expertise to fine-tune
Iteration speed: Days to iterate vs. seconds for prompts
Model lock-in: Fine-tuned weights don't transfer between models
Catastrophic forgetting: Risk of degrading general capabilities
Evaluation complexity: Need to test both target task AND general abilities

The Decision Framework

The decision between prompting and fine-tuning isn't binary—it's a progression. Almost every project should start with prompting, and only move to fine-tuning when prompting hits clear limits. Here's why this order matters:

Prompting establishes your quality ceiling: Before fine-tuning, you need to know what's achievable with the base model. If prompting gets you 85% accuracy, fine-tuning might get you 95%. But if prompting only gets 50%, something is fundamentally wrong—either the task is too hard for the model, your evaluation is flawed, or you need a different approach entirely.

Prompting generates training data: Good prompting produces good outputs. These outputs can become training data for fine-tuning. If you fine-tune before prompting, you're creating training data from scratch. If you prompt first, you can filter your best prompt results as training examples.

Fine-tuning locks in decisions: A fine-tuned model encodes your current understanding of the task. Prompts can be changed instantly. Fine-tuned weights require retraining to update. Get the task definition right with prompting before committing to fine-tuning.

Start with Prompting If...

Your task is well-defined in natural language: If you can explain to a human what you want in a few paragraphs, the model can probably follow instructions. Most tasks fall into this category.

You need flexibility: Different users need different behaviors? Different contexts require different approaches? Prompts can be dynamically generated. Fine-tuned models are static.

You're still learning what works: Early in development, requirements change constantly. Prompting lets you iterate in minutes. Fine-tuning locks you into decisions.

You have limited training data: Fine-tuning requires examples. If you don't have them, prompting is your only option. Even with data, prompting establishes a baseline.

You're using frontier models: GPT-4, Claude 3 Opus, and Gemini Ultra are incredibly capable. These models often don't benefit from fine-tuning because their base capabilities are so strong. Prompting extracts most of their potential.

Cost flexibility matters: Prompting costs scale linearly with usage. Fine-tuning has upfront costs but lower per-query costs. For low-to-medium volume, prompting is cheaper.

Consider Fine-tuning If...

Output format must be perfectly consistent: If outputs feed into downstream systems that require exact schemas, fine-tuning on format examples beats even detailed format instructions. We've seen 95%+ format compliance from fine-tuning vs. 80-90% from prompting.

You have domain-specific terminology: Specialized vocabulary, jargon, or conventions that the base model handles poorly. Fine-tuning on domain text helps more than prompting.

You need to remove capabilities: If the model does things you don't want (refuses valid requests, adds unwanted caveats, uses disallowed formats), fine-tuning can suppress these more reliably than prompting.

Latency is critical: Long system prompts add 100-500ms of latency. Fine-tuned models with short prompts are faster. For real-time applications, this matters.

You're operating at high scale: At millions of queries per month, the token cost of long prompts adds up. Fine-tuning amortizes training cost across usage. Break-even depends on your volumes and prompt lengths.

You've exhausted prompt optimization: If you've iterated extensively on prompts and hit a quality ceiling, fine-tuning may break through. But make sure you've actually exhausted prompting first.

The RAG Question

RAG often competes with fine-tuning for knowledge injection:

Approach	Best For	Drawbacks
Fine-tuning	Static domain knowledge, behavioral patterns	Knowledge can become stale, limited update flexibility
RAG	Dynamic knowledge, citations needed, large knowledge bases	Retrieval latency and errors, context window limits

They're complementary. Use RAG for knowledge that changes or needs attribution. Use fine-tuning for stable behaviors and conventions.

Prompt Engineering Deep Dive

Before fine-tuning, exhaust prompting. Here's how:

Structure Your Prompts

A production prompt has distinct sections:

Code

[ROLE/PERSONA]
You are an expert tax accountant specializing in small business taxation...

[CONTEXT]
The user is a small business owner using our tax preparation software...

[TASK]
Answer the user's tax question accurately and helpfully...

[CONSTRAINTS]
- Only answer questions about US federal and state taxation
- Do not provide advice on audit defense or legal disputes
- Always recommend consulting a CPA for complex situations

[FORMAT]
Respond in 2-3 paragraphs. Use bullet points for lists of items.
End with a brief disclaimer about seeking professional advice.

[EXAMPLES]
User: Can I deduct my home office?
Assistant: Yes, if you use part of your home regularly and exclusively...

[CONVERSATION]
{conversation_history}

[CURRENT QUERY]
User: {user_message}

Prompt Engineering Techniques

Persona assignment: "You are an expert X with Y years of experience..." improves domain performance. Be specific about expertise.

Chain-of-thought: "Think step by step" or "Let's work through this systematically" improves reasoning on complex tasks.

Output formatting: Explicit format instructions with examples. JSON schema specifications. "Respond only with valid JSON matching this schema: {...}"

Negative constraints: "Do not include..." is often clearer than "Only include..." for boundary setting.

Few-shot examples: 2-5 input-output examples demonstrating desired behavior. Choose diverse examples covering edge cases.

Self-consistency: For critical decisions, ask the model multiple times and look for consensus.

Decomposition: Break complex tasks into simpler subtasks. "First identify X, then evaluate Y, finally recommend Z."

Systematic Prompt Optimization

Don't guess at prompts. Optimize systematically:

Create evaluation set: 50-100 examples with expected outputs
Baseline: Measure current prompt performance
Hypothesis: Identify specific failure modes
Iterate: Change one element at a time
Measure: Compare against baseline
Repeat: Until plateau

Track prompt versions with evaluation scores. You're doing machine learning without gradient descent.

When Prompting Fails

Signs you've hit prompting limits:

Performance plateaus despite iterations
The model "understands" but doesn't "do"
Inconsistency that instructions can't fix
Required behavior contradicts model's training
Token costs are unsustainable

These are signals to consider fine-tuning.

Fine-tuning Deep Dive

Fine-tuning is often misunderstood. It's not "teaching the model new knowledge"—it's "changing how the model behaves." The base model already knows most things; fine-tuning adjusts its default behaviors, output patterns, and decision boundaries. Understanding this distinction prevents common mistakes.

What fine-tuning actually changes: When you fine-tune, you're adjusting the model's weights so that, given similar inputs to your training data, it produces similar outputs. You're not adding facts to a database—you're changing the probability distribution over outputs. This is why fine-tuning works well for style/format changes but poorly for injecting specific factual knowledge (use RAG for that).

The overfitting trap: With small datasets, the model can memorize training examples rather than generalizing patterns. Signs of overfitting: perfect performance on training data, poor performance on new inputs, outputs that look like verbatim copies of training examples. Validation splits and diverse test cases are essential.

Data Requirements

Quality matters more than quantity. Guidelines:

Minimum viable dataset:

Simple format/style changes: 50-100 examples
Moderate task learning: 200-500 examples
Complex domain adaptation: 1,000-5,000 examples
Significant behavior change: 5,000-10,000 examples

Data quality checklist:

Diverse: Cover the full range of expected inputs
Representative: Match production distribution
Clean: No errors in outputs you're teaching
Consistent: Same task, same format, same style
Balanced: Don't over-represent any category

Data collection strategies:

Production logs (filter for high-quality outputs)
Human labeling (expensive but precise)
Synthetic generation (use stronger models to create training data)
Semi-synthetic (human edits of model outputs)

Fine-tuning Approaches

Supervised Fine-tuning (SFT): Train on input-output pairs. The model learns to produce outputs similar to your examples. This is the most common approach.

Instruction Fine-tuning: Train on instruction-following examples. Teaches the model to follow a style of instructions.

RLHF/DPO (Preference Optimization): Train on preference pairs (better vs. worse outputs). More complex but can improve subtle quality dimensions.

LoRA/QLoRA (Parameter-Efficient Fine-tuning): Train adapter layers instead of full model weights. Much cheaper, nearly as effective for most tasks. This is what most teams should use.

The Fine-tuning Process

Prepare data:
- Format as conversation/completion pairs
- Split: 90% train, 10% validation
- Verify data quality manually
Choose base model:
- Match capability level to task complexity
- Consider fine-tuning APIs (OpenAI, Anthropic) vs. open models
- Factor in inference cost and latency
Configure training:
- Learning rate (typically 1e-5 to 5e-5 for full fine-tuning, higher for LoRA)
- Number of epochs (1-3 for most tasks; more epochs = more overfitting risk)
- Batch size (larger is more stable, limited by memory)
Train and monitor:
- Watch validation loss (should decrease, then plateau)
- Early stopping if validation loss increases
Evaluate thoroughly:
- Task performance on held-out examples
- General capabilities (safety, instruction following)
- Format consistency
- Edge case handling
Deploy carefully:
- A/B test against baseline
- Monitor production metrics
- Have rollback ready

Fine-tuning Pitfalls

Overfitting to training data: Model memorizes examples instead of learning patterns. Fix: More diverse data, fewer epochs, regularization.

Catastrophic forgetting: Model loses general capabilities while learning specific task. Fix: Include general examples in training, use LoRA, evaluate general abilities.

Distribution mismatch: Training data doesn't match production queries. Fix: Sample training data from production, use data augmentation.

Format fragility: Model only works with exact input formats it was trained on. Fix: Vary input formats in training data.

Evaluation gaps: Model looks great on held-out data but fails in production. Fix: Build comprehensive evaluation covering edge cases.

Hybrid Approaches

The best systems often combine prompting and fine-tuning:

Fine-tune for format, prompt for content

Fine-tune a model to reliably produce JSON in your schema. Use prompts to specify what content goes in the JSON. You get format consistency from fine-tuning and flexibility from prompting.

Fine-tune base behavior, prompt for customization

Fine-tune a model for your domain and default behavior. Use per-request prompts to customize for specific users or contexts. The fine-tuned model follows domain conventions; prompts handle variation.

Fine-tune small, prompt large

Use a fine-tuned small model for classification, routing, or simple generation. Use a prompted large model for complex reasoning. Combine outputs. You get cost efficiency and capability.

The "Distillation" Pattern

Use a prompted large model (GPT-4, Claude 3 Opus) to generate training data. Fine-tune a smaller model on this data. Deploy the smaller model for cost efficiency. This is a powerful pattern for production systems.

Code

GPT-4 + Complex Prompt → Generate 1000 training examples
                              ↓
                    Fine-tune GPT-3.5 or open model
                              ↓
                    Deploy fine-tuned model at 10x lower cost

Cost Analysis

Let's do real math. Consider a customer support assistant:

Prompting approach:

System prompt: 2,000 tokens
Average conversation: 1,500 tokens
Total per query: 3,500 tokens
At $0.01/1K tokens (GPT-4o):$ 0.035/query

Fine-tuned approach:

System prompt: 200 tokens (minimal with fine-tuning)
Average conversation: 1,500 tokens
Total per query: 1,700 tokens
Training cost: $5,000 (one-time)
At $0.01/1K tokens:$ 0.017/query

Break-even calculation:

Cost savings: $0.018/query
Break-even: $5,000 /$ 0.018 = ~280,000 queries

At 1M queries/month, you save ~$13,000/month after break-even. At 10K queries/month, break-even takes 2+ years.

Include in your calculation:

Engineering time for fine-tuning (~40 hours minimum)
Ongoing maintenance (retraining with new data)
Quality differences (may need more/fewer queries to accomplish goals)

Real-World Examples

Example 1: Legal Document Summarization

Initial approach: Prompting with GPT-4

Detailed system prompt with legal terminology
Few-shot examples of summaries
Results: 78% user satisfaction, good for prototyping

Issue: Inconsistent format, missed key clauses

Solution: Fine-tuned model on 500 human-written summaries

Consistent section structure
Improved clause identification
Results: 91% user satisfaction, 40% lower token cost

Lesson: Domain-specific output conventions benefit from fine-tuning.

Example 2: Code Review Assistant

Initial approach: Fine-tuned model on code review comments

2,000 training examples from senior engineers
Results: Good at catching simple issues, missed complex patterns

Issue: Couldn't adapt to different codebases and conventions

Solution: Switched to prompting with codebase context

Dynamic prompts with repo-specific style guides
RAG for codebase patterns
Results: Better adaptation, competitive quality

Lesson: Flexibility requirements favor prompting + RAG.

Example 3: Customer Support Bot

Initial approach: Prompted base model

Long system prompt with product information
Results: 65% resolution rate, high latency

Solution: Hybrid approach

Fine-tuned model for common questions (80% of volume)
Prompted large model for complex issues (20% of volume)
Router to classify and route
Results: 82% resolution rate, 60% cost reduction

Lesson: Hybrid approaches capture benefits of both.

Decision Checklist

Use this checklist when deciding:

Default to prompting if:

You're early in development
Requirements are changing
You have < 100 quality examples
You need flexibility across contexts
Using frontier models (GPT-4, Claude 3 Opus)
Volume is < 100K queries/month

Consider fine-tuning if:

Prompting quality has plateaued
You have 500+ quality examples
Output format consistency is critical
Domain conventions are important
Latency must be minimized
Volume is > 500K queries/month
You need to modify base model behaviors

Choose hybrid if:

Different query types have different needs
You want format consistency + content flexibility
Cost optimization is important at scale
You can build routing infrastructure

Conclusion

The prompting vs. fine-tuning decision isn't binary. It's about finding the right point on a spectrum for your specific use case.

Start with prompting. Iterate systematically. Measure rigorously. Graduate to fine-tuning when you have evidence it will help—not before.

The best teams treat this as an ongoing optimization, not a one-time decision. As your data grows, your understanding deepens, and model capabilities evolve, the right balance shifts. Build infrastructure that lets you experiment with both approaches and measure what works.

Fine-Tuning vs Prompting: When to Use Each

Table of Contents