How much synthetic data do I need?

Depends on the task. For instruction tuning, 10K-100K high-quality examples is typical. More isn't better if quality drops. Start small (1K), evaluate, then scale.

Can I use synthetic data for pre-training?

Yes, with careful quality control. NVIDIA's Nemotron-CC shows it's possible at trillion-token scale, but requires sophisticated filtering pipelines and quality classifiers.

How do I know if my synthetic data is good enough?

Evaluate on held-out benchmarks before and after training. Track diversity metrics during generation. Use LLM-as-judge for spot checks. If downstream performance improves, the data is working.

What about bias amplification?

From research: "Initial flaws or biases in the training data can persist or worsen during the generation process." Mitigate by using diverse seed data, multiple generator models, and explicit diversity requirements in prompts.

Is it legal to train on LLM-generated data?

Generally yes for most open models. Check specific model licenses—some restrict using outputs to train competing models. OpenAI's terms previously restricted this; policies vary by provider.

Synthetic Data Generation for LLM Training | Enrico Piovano

The Synthetic Data Revolution

Real-world training data is expensive, limited, and often privacy-constrained. Synthetic data generation using LLMs has emerged as a powerful alternative.

From research: "Synthetic data will become the default, not the exception, but its success is based on transparency and rigorous validation."

The results speak for themselves: "Nemotron-4 340B is an LLM that synthetically generated 98% of the data used for its supervised fine-tuning and preference fine-tuning."

Why Synthetic Data?

Benefits

From research: "Benefits include cost-effectiveness, broad coverage, and controllable diversity."

Benefit	Description
Cost	Orders of magnitude cheaper than human annotation
Scale	Generate millions of examples quickly
Coverage	Create examples for rare edge cases
Privacy	No real user data needed
Control	Precise control over data characteristics

Use Cases

From research: "Synthetic data is widely used to train foundation models when data is scarce, sensitive, or costly to collect."

Common applications:

Instruction tuning for chat models
Domain-specific fine-tuning
Preference data for RLHF/DPO
Multilingual expansion
Code generation training

The NVIDIA Nemotron Pipeline

Overview

NVIDIA's pipeline represents the state-of-the-art in synthetic data generation:

From NVIDIA: "The Nemotron-4 340B family includes base, instruct and reward models that form a pipeline to generate synthetic data used for training and refining LLMs."

Pipeline components:

Nemotron-4-340B-Instruct: Generates synthetic responses
Nemotron-4-340B-Reward: Ranks and filters quality
NeMo Curator: Data curation and deduplication

Scale of Generation

From NVIDIA: "Out of 10.6 trillion tokens, 3,534,013,958,278 tokens are synthetically generated."

The Nemotron-CC pipeline: "By employing classifier ensembling and synthetic data rephrasing, the pipeline generates 2 trillion tokens of high-quality synthetic data, recovering up to 90% of content lost by filtering."

Quality Filtering

From NVIDIA: "Robust SDG methods go beyond just generating response data—they also include verification and checks to ensure data quality remains high. LLM accuracy is often directly determined by the quality rather than quantity of the training data."

Quality buckets approach: From NVIDIA: "For Nemotron 3 Nano, they trained only on the Medium-Quality, Medium-High-Quality, and High-Quality buckets."

Rejection Sampling

From NVIDIA: "For responses, they prompted LLMs for multiple generations and then did rejection sampling with the Llama-3.1-Nemotron-70B reward model. This ensured that the responses were of high quality."

Generation Techniques

Prompt-Based Generation

The simplest synthetic data generation approach: prompt an LLM to create examples. The key insight is that LLMs have absorbed patterns from millions of instruction-response pairs during training—they understand what good training data looks like.

The quality depends entirely on your prompt. Be explicit about format, diversity requirements, and quality criteria. Generating in batches (10 examples per call) is more efficient than one-at-a-time, but increases parsing complexity.

Python

def generate_instruction_data(seed_topics: list, model, n_per_topic: int = 10):
    """Generate instruction-following examples from seed topics."""
    examples = []

    for topic in seed_topics:
        prompt = f"""Generate {n_per_topic} diverse instruction-response pairs
about {topic}. Each should be:
- Clear and specific instruction
- Helpful, accurate response
- Varied in complexity and format

Format as JSON:
[{{"instruction": "...", "response": "..."}}]"""

        response = model.generate(prompt)
        examples.extend(parse_json(response))

    return examples

Self-Instruct Pipeline

Self-Instruct, introduced by Stanford, bootstraps a large instruction dataset from a small seed set. The process is iterative: sample existing instructions as examples, ask the model to generate diverse new ones, deduplicate, and repeat. This creates an expanding pool of instructions without manual curation.

The key is the sampling step: by showing diverse examples from your existing set, you encourage the model to generate instructions in different styles, topics, and complexity levels. Without this diversity pressure, generated instructions converge to repetitive patterns.

Python

def self_instruct(seed_instructions: list, model, iterations: int = 5):
    """Expand instruction set using self-instruct methodology."""
    all_instructions = seed_instructions.copy()

    for _ in range(iterations):
        # Sample diverse seed instructions
        samples = random.sample(all_instructions, min(8, len(all_instructions)))

        prompt = f"""Here are some example instructions:
{format_examples(samples)}

Generate 10 new, diverse instructions that are different from these examples.
Cover different topics, formats, and complexity levels."""

        new_instructions = model.generate(prompt)
        all_instructions.extend(parse_instructions(new_instructions))

        # Deduplicate
        all_instructions = deduplicate(all_instructions)

    return all_instructions

Evol-Instruct (WizardLM)

Evol-Instruct, from Microsoft's WizardLM paper, takes existing instructions and systematically increases their complexity. Instead of generating new instructions from scratch, you evolve simple ones into harder versions. This produces a natural difficulty gradient that improves model capabilities on complex tasks.

The evolution operations each target a different aspect of complexity:

Deepen: Add constraints or requirements
Concretize: Make abstract instructions specific
Reason: Require multi-step thinking
Complicate: Introduce edge cases

Python

EVOLUTION_PROMPTS = {
    "deepen": "Make this instruction more complex by adding constraints or requirements: {instruction}",
    "concretize": "Make this instruction more specific with concrete details: {instruction}",
    "reason": "Rewrite to require multi-step reasoning: {instruction}",
    "complicate": "Add a complication or edge case to handle: {instruction}",
}

def evolve_instruction(instruction: str, model, evolution_type: str):
    prompt = EVOLUTION_PROMPTS[evolution_type].format(instruction=instruction)
    return model.generate(prompt)

Preference Data Generation

Preference data—pairs of (chosen, rejected) responses—trains models to prefer better outputs. This is essential for RLHF and DPO. The challenge is getting preference labels without expensive human annotation.

The solution: generate multiple responses per prompt, score them with a reward model, and use the best as "chosen" and worst as "rejected." This automated pipeline can generate millions of preference pairs, but quality depends on your reward model's accuracy. A bad reward model encodes bad preferences.

Python

def generate_preference_data(prompts: list, model, reward_model):
    """Generate chosen/rejected pairs using reward model scoring."""
    preference_data = []

    for prompt in prompts:
        # Generate multiple responses
        responses = [model.generate(prompt, temperature=0.8) for _ in range(4)]

        # Score with reward model
        scores = [reward_model.score(prompt, r) for r in responses]

        # Best as chosen, worst as rejected
        sorted_pairs = sorted(zip(scores, responses), reverse=True)
        chosen = sorted_pairs[0][1]
        rejected = sorted_pairs[-1][1]

        preference_data.append({
            "prompt": prompt,
            "chosen": chosen,
            "rejected": rejected
        })

    return preference_data

Quality Filtering

Why Filtering Matters

From research: "LLM accuracy is often directly determined by the quality rather than quantity of the training data, making the step of quality filtering crucial."

Filtering Approaches

1. Reward Model Filtering:

Python

def filter_by_reward(examples: list, reward_model, threshold: float = 0.7):
    """Keep only high-scoring examples."""
    filtered = []
    for ex in examples:
        score = reward_model.score(ex["instruction"], ex["response"])
        if score >= threshold:
            filtered.append(ex)
    return filtered

2. LLM-as-Judge:

Python

QUALITY_PROMPT = """Rate this instruction-response pair on a scale of 1-5:

Instruction: {instruction}
Response: {response}

Criteria:
- Accuracy: Is the response factually correct?
- Helpfulness: Does it fully address the instruction?
- Clarity: Is it well-written and clear?

Respond with just the numeric score (1-5):"""

def filter_by_llm_judge(examples: list, judge_model, threshold: int = 4):
    filtered = []
    for ex in examples:
        prompt = QUALITY_PROMPT.format(**ex)
        score = int(judge_model.generate(prompt))
        if score >= threshold:
            filtered.append(ex)
    return filtered

3. Classifier Ensembling: From NVIDIA: "The Nemotron-CC pipeline uses a combination of classifier ensembling and synthetic data rephrasing to generate high-quality synthetic data."

Python

def ensemble_filter(examples: list, classifiers: list, min_votes: int = 2):
    """Keep examples that pass multiple quality classifiers."""
    filtered = []
    for ex in examples:
        votes = sum(1 for clf in classifiers if clf.is_quality(ex))
        if votes >= min_votes:
            filtered.append(ex)
    return filtered

Deduplication

Critical for avoiding repetition:

Python

from datasketch import MinHash, MinHashLSH

def deduplicate_minhash(examples: list, threshold: float = 0.8):
    """Remove near-duplicates using MinHash LSH."""
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    unique = []

    for i, ex in enumerate(examples):
        mh = MinHash(num_perm=128)
        for word in ex["response"].split():
            mh.update(word.encode('utf8'))

        # Check for duplicates
        result = lsh.query(mh)
        if not result:
            lsh.insert(str(i), mh)
            unique.append(ex)

    return unique

Avoiding Model Collapse

The Risk

From research: "Model collapse is a risk where repeated training on synthetic data can degrade model performance, leading to 'hallucinations' or oversimplified outputs."

Mitigation Strategies

1. Mix with Real Data:

Python

def create_training_mix(synthetic: list, real: list, synthetic_ratio: float = 0.7):
    """Combine synthetic and real data."""
    n_synthetic = int(len(synthetic) * synthetic_ratio)
    n_real = len(synthetic) - n_synthetic

    return random.sample(synthetic, n_synthetic) + random.sample(real, n_real)

2. Use Multiple Teacher Models: Avoid single-model bias by using diverse generators.

3. Quality Over Quantity: From research: "A systematic pipeline with a critic system can help filter only high-quality examples."

4. Iterative Refinement: Generate → Filter → Train → Evaluate → Repeat

Warning Signs

Decreasing diversity in outputs
Increasing repetition
Degraded performance on held-out benchmarks
Outputs converging to specific patterns

Production Pipeline

End-to-End Example

A production pipeline combines all the techniques above into a coherent system. The key insight is that synthetic data generation is not just generation—it's a multi-stage process where each stage filters and refines the data. Think of it as a funnel: generate more than you need, then progressively filter to keep only the best.

This pipeline follows the NVIDIA approach: generate multiple responses per instruction (rejection sampling), score with a reward model, verify with an LLM judge, and deduplicate. The redundancy is intentional—each filter catches different quality issues.

Python

class SyntheticDataPipeline:
    def __init__(self, generator_model, reward_model, judge_model):
        self.generator = generator_model
        self.reward = reward_model
        self.judge = judge_model

    def generate_sft_data(self, seed_instructions: list, target_size: int):
        # Step 1: Expand instructions
        instructions = self.expand_instructions(seed_instructions, target_size * 2)

        # Step 2: Generate responses
        examples = []
        for inst in tqdm(instructions):
            # Generate multiple, keep best
            responses = [self.generator.generate(inst) for _ in range(3)]
            scores = [self.reward.score(inst, r) for r in responses]
            best_response = responses[scores.index(max(scores))]

            examples.append({"instruction": inst, "response": best_response})

        # Step 3: Quality filter
        examples = self.filter_quality(examples)

        # Step 4: Deduplicate
        examples = deduplicate_minhash(examples)

        # Step 5: Final sample
        return random.sample(examples, min(target_size, len(examples)))

    def filter_quality(self, examples: list, threshold: float = 0.75):
        filtered = []
        for ex in examples:
            # Reward model score
            reward_score = self.reward.score(ex["instruction"], ex["response"])
            if reward_score < threshold:
                continue

            # LLM judge verification
            judge_score = self.get_judge_score(ex)
            if judge_score < 4:
                continue

            filtered.append(ex)

        return filtered

Understanding the pipeline stages:

Step 1: Expand instructions (line 22): Start with more than you need. target_size * 2 means generate double the target, assuming 50% will be filtered out. Adjust this multiplier based on your filtering pass rate.
Step 2: Rejection sampling (lines 26-32): For each instruction, generate 3 responses with different random seeds (temperature sampling). Score all three, keep only the best. This is expensive (3x generation cost) but dramatically improves quality.
Step 3: Quality filter (line 35): Two-stage filtering. First, the reward model provides a fast, consistent score. Then the LLM judge does a more nuanced evaluation. Items must pass both—the reward model catches obviously bad responses, the judge catches subtle issues.
Step 4: Deduplicate (line 38): Near-duplicate removal prevents your model from memorizing repeated patterns. MinHash LSH is fast enough for large datasets.
Step 5: Final sample (line 41): Even after filtering, you may have more than needed. Random sampling ensures diversity in the final dataset.

Why target_size * 2 in line 22? Empirically, aggressive quality filtering removes 40-60% of generated data. Generating 2x ensures you hit your target after filtering. If your filters are stricter, increase the multiplier.

Monitoring Quality

Quality monitoring catches problems before they corrupt your training. The key metrics are diversity (are outputs varied?) and length distribution (are responses the right size?). Sudden drops in diversity or spikes in repetition indicate model collapse beginning.

Run these checks after each batch, not just at the end. Early detection lets you adjust prompts, swap generators, or stop generation before wasting compute.

Python

def monitor_generation(examples: list):
    """Track quality metrics during generation."""
    metrics = {
        "total_generated": len(examples),
        "avg_length": np.mean([len(ex["response"]) for ex in examples]),
        "unique_trigrams": count_unique_trigrams(examples),
        "diversity_score": calculate_diversity(examples),
    }

    # Alert if diversity drops
    if metrics["diversity_score"] < 0.5:
        logger.warning("Low diversity detected in synthetic data")

    return metrics

Interpreting the metrics:

diversity_score < 0.5: Indicates significant repetition. Common causes: too few seed topics, generator defaulting to safe patterns, or deduplication threshold too loose. Action: broaden seed data, increase temperature, or regenerate.
unique_trigrams decreasing: Early sign of convergence. The generator is producing similar phrasings. Action: rotate generators, refresh prompts, or add explicit diversity requirements.
avg_length drift: If average length suddenly changes, the generator may be truncating or padding responses. Check for prompt format issues or model temperature changes.

Tools and Frameworks

NVIDIA NeMo Curator

From NVIDIA: "NeMo Curator offers prebuilt synthetic data generation pipelines for Supervised Fine-Tuning (SFT) and preference data."

HuggingFace Datasets

Python

from datasets import Dataset

# Save synthetic data
dataset = Dataset.from_list(synthetic_examples)
dataset.push_to_hub("my-org/synthetic-instructions")

Argilla

Open-source data curation platform:

Python

import argilla as rg

# Log synthetic data for review
for ex in synthetic_examples:
    rg.log(
        rg.TextClassificationRecord(
            text=ex["response"],
            metadata={"instruction": ex["instruction"]}
        ),
        name="synthetic-review"
    )

Conclusion

Synthetic data generation enables training data creation at unprecedented scale:

Use quality filtering aggressively—quantity without quality causes model collapse
Reward models are essential for scoring and rejection sampling
Deduplicate to maintain diversity
Mix with real data when possible
Monitor continuously for signs of degradation

The NVIDIA Nemotron approach demonstrates that 98% synthetic data can produce state-of-the-art models when quality is prioritized.

Synthetic Data Generation for LLM Training

Table of Contents

The Synthetic Data Revolution

Why Synthetic Data?

Benefits

Use Cases

The NVIDIA Nemotron Pipeline

Overview

Scale of Generation

Quality Filtering

Rejection Sampling

Generation Techniques

Prompt-Based Generation

Self-Instruct Pipeline

Evol-Instruct (WizardLM)

Preference Data Generation

Quality Filtering

Why Filtering Matters

Filtering Approaches

Deduplication

Avoiding Model Collapse

The Risk

Mitigation Strategies

Warning Signs

Production Pipeline

End-to-End Example

Monitoring Quality

Tools and Frameworks

NVIDIA NeMo Curator

HuggingFace Datasets

Argilla

Conclusion

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

SFT and RLHF: The Complete Guide to Post-Training LLMs

Fine-Tuning vs Prompting: When to Use Each

LLM Evaluation in Production: Beyond Benchmarks