Synthetic Data Generation for LLM Training
How to generate high-quality synthetic training data using LLMs—from NVIDIA's Nemotron pipeline to quality filtering techniques and avoiding model collapse.
Table of Contents
The Synthetic Data Revolution
Real-world training data is expensive, limited, and often privacy-constrained. Synthetic data generation using LLMs has emerged as a powerful alternative.
From research: "Synthetic data will become the default, not the exception, but its success is based on transparency and rigorous validation."
The results speak for themselves: "Nemotron-4 340B is an LLM that synthetically generated 98% of the data used for its supervised fine-tuning and preference fine-tuning."
Why Synthetic Data?
Benefits
From research: "Benefits include cost-effectiveness, broad coverage, and controllable diversity."
| Benefit | Description |
|---|---|
| Cost | Orders of magnitude cheaper than human annotation |
| Scale | Generate millions of examples quickly |
| Coverage | Create examples for rare edge cases |
| Privacy | No real user data needed |
| Control | Precise control over data characteristics |
Use Cases
From research: "Synthetic data is widely used to train foundation models when data is scarce, sensitive, or costly to collect."
Common applications:
- Instruction tuning for chat models
- Domain-specific fine-tuning
- Preference data for RLHF/DPO
- Multilingual expansion
- Code generation training
The NVIDIA Nemotron Pipeline
Overview
NVIDIA's pipeline represents the state-of-the-art in synthetic data generation:
From NVIDIA: "The Nemotron-4 340B family includes base, instruct and reward models that form a pipeline to generate synthetic data used for training and refining LLMs."
Pipeline components:
- Nemotron-4-340B-Instruct: Generates synthetic responses
- Nemotron-4-340B-Reward: Ranks and filters quality
- NeMo Curator: Data curation and deduplication
Scale of Generation
From NVIDIA: "Out of 10.6 trillion tokens, 3,534,013,958,278 tokens are synthetically generated."
The Nemotron-CC pipeline: "By employing classifier ensembling and synthetic data rephrasing, the pipeline generates 2 trillion tokens of high-quality synthetic data, recovering up to 90% of content lost by filtering."
Quality Filtering
From NVIDIA: "Robust SDG methods go beyond just generating response data—they also include verification and checks to ensure data quality remains high. LLM accuracy is often directly determined by the quality rather than quantity of the training data."
Quality buckets approach: From NVIDIA: "For Nemotron 3 Nano, they trained only on the Medium-Quality, Medium-High-Quality, and High-Quality buckets."
Rejection Sampling
From NVIDIA: "For responses, they prompted LLMs for multiple generations and then did rejection sampling with the Llama-3.1-Nemotron-70B reward model. This ensured that the responses were of high quality."
Generation Techniques
Prompt-Based Generation
The simplest synthetic data generation approach: prompt an LLM to create examples. The key insight is that LLMs have absorbed patterns from millions of instruction-response pairs during training—they understand what good training data looks like.
The quality depends entirely on your prompt. Be explicit about format, diversity requirements, and quality criteria. Generating in batches (10 examples per call) is more efficient than one-at-a-time, but increases parsing complexity.
def generate_instruction_data(seed_topics: list, model, n_per_topic: int = 10):
"""Generate instruction-following examples from seed topics."""
examples = []
for topic in seed_topics:
prompt = f"""Generate {n_per_topic} diverse instruction-response pairs
about {topic}. Each should be:
- Clear and specific instruction
- Helpful, accurate response
- Varied in complexity and format
Format as JSON:
[{{"instruction": "...", "response": "..."}}]"""
response = model.generate(prompt)
examples.extend(parse_json(response))
return examples
Self-Instruct Pipeline
Self-Instruct, introduced by Stanford, bootstraps a large instruction dataset from a small seed set. The process is iterative: sample existing instructions as examples, ask the model to generate diverse new ones, deduplicate, and repeat. This creates an expanding pool of instructions without manual curation.
The key is the sampling step: by showing diverse examples from your existing set, you encourage the model to generate instructions in different styles, topics, and complexity levels. Without this diversity pressure, generated instructions converge to repetitive patterns.
def self_instruct(seed_instructions: list, model, iterations: int = 5):
"""Expand instruction set using self-instruct methodology."""
all_instructions = seed_instructions.copy()
for _ in range(iterations):
# Sample diverse seed instructions
samples = random.sample(all_instructions, min(8, len(all_instructions)))
prompt = f"""Here are some example instructions:
{format_examples(samples)}
Generate 10 new, diverse instructions that are different from these examples.
Cover different topics, formats, and complexity levels."""
new_instructions = model.generate(prompt)
all_instructions.extend(parse_instructions(new_instructions))
# Deduplicate
all_instructions = deduplicate(all_instructions)
return all_instructions
Evol-Instruct (WizardLM)
Evol-Instruct, from Microsoft's WizardLM paper, takes existing instructions and systematically increases their complexity. Instead of generating new instructions from scratch, you evolve simple ones into harder versions. This produces a natural difficulty gradient that improves model capabilities on complex tasks.
The evolution operations each target a different aspect of complexity:
- Deepen: Add constraints or requirements
- Concretize: Make abstract instructions specific
- Reason: Require multi-step thinking
- Complicate: Introduce edge cases
EVOLUTION_PROMPTS = {
"deepen": "Make this instruction more complex by adding constraints or requirements: {instruction}",
"concretize": "Make this instruction more specific with concrete details: {instruction}",
"reason": "Rewrite to require multi-step reasoning: {instruction}",
"complicate": "Add a complication or edge case to handle: {instruction}",
}
def evolve_instruction(instruction: str, model, evolution_type: str):
prompt = EVOLUTION_PROMPTS[evolution_type].format(instruction=instruction)
return model.generate(prompt)
Preference Data Generation
Preference data—pairs of (chosen, rejected) responses—trains models to prefer better outputs. This is essential for RLHF and DPO. The challenge is getting preference labels without expensive human annotation.
The solution: generate multiple responses per prompt, score them with a reward model, and use the best as "chosen" and worst as "rejected." This automated pipeline can generate millions of preference pairs, but quality depends on your reward model's accuracy. A bad reward model encodes bad preferences.
def generate_preference_data(prompts: list, model, reward_model):
"""Generate chosen/rejected pairs using reward model scoring."""
preference_data = []
for prompt in prompts:
# Generate multiple responses
responses = [model.generate(prompt, temperature=0.8) for _ in range(4)]
# Score with reward model
scores = [reward_model.score(prompt, r) for r in responses]
# Best as chosen, worst as rejected
sorted_pairs = sorted(zip(scores, responses), reverse=True)
chosen = sorted_pairs[0][1]
rejected = sorted_pairs[-1][1]
preference_data.append({
"prompt": prompt,
"chosen": chosen,
"rejected": rejected
})
return preference_data
Quality Filtering
Why Filtering Matters
From research: "LLM accuracy is often directly determined by the quality rather than quantity of the training data, making the step of quality filtering crucial."
Filtering Approaches
1. Reward Model Filtering:
def filter_by_reward(examples: list, reward_model, threshold: float = 0.7):
"""Keep only high-scoring examples."""
filtered = []
for ex in examples:
score = reward_model.score(ex["instruction"], ex["response"])
if score >= threshold:
filtered.append(ex)
return filtered
2. LLM-as-Judge:
QUALITY_PROMPT = """Rate this instruction-response pair on a scale of 1-5:
Instruction: {instruction}
Response: {response}
Criteria:
- Accuracy: Is the response factually correct?
- Helpfulness: Does it fully address the instruction?
- Clarity: Is it well-written and clear?
Respond with just the numeric score (1-5):"""
def filter_by_llm_judge(examples: list, judge_model, threshold: int = 4):
filtered = []
for ex in examples:
prompt = QUALITY_PROMPT.format(**ex)
score = int(judge_model.generate(prompt))
if score >= threshold:
filtered.append(ex)
return filtered
3. Classifier Ensembling: From NVIDIA: "The Nemotron-CC pipeline uses a combination of classifier ensembling and synthetic data rephrasing to generate high-quality synthetic data."
def ensemble_filter(examples: list, classifiers: list, min_votes: int = 2):
"""Keep examples that pass multiple quality classifiers."""
filtered = []
for ex in examples:
votes = sum(1 for clf in classifiers if clf.is_quality(ex))
if votes >= min_votes:
filtered.append(ex)
return filtered
Deduplication
Critical for avoiding repetition:
from datasketch import MinHash, MinHashLSH
def deduplicate_minhash(examples: list, threshold: float = 0.8):
"""Remove near-duplicates using MinHash LSH."""
lsh = MinHashLSH(threshold=threshold, num_perm=128)
unique = []
for i, ex in enumerate(examples):
mh = MinHash(num_perm=128)
for word in ex["response"].split():
mh.update(word.encode('utf8'))
# Check for duplicates
result = lsh.query(mh)
if not result:
lsh.insert(str(i), mh)
unique.append(ex)
return unique
Avoiding Model Collapse
The Risk
From research: "Model collapse is a risk where repeated training on synthetic data can degrade model performance, leading to 'hallucinations' or oversimplified outputs."
Mitigation Strategies
1. Mix with Real Data:
def create_training_mix(synthetic: list, real: list, synthetic_ratio: float = 0.7):
"""Combine synthetic and real data."""
n_synthetic = int(len(synthetic) * synthetic_ratio)
n_real = len(synthetic) - n_synthetic
return random.sample(synthetic, n_synthetic) + random.sample(real, n_real)
2. Use Multiple Teacher Models: Avoid single-model bias by using diverse generators.
3. Quality Over Quantity: From research: "A systematic pipeline with a critic system can help filter only high-quality examples."
4. Iterative Refinement: Generate → Filter → Train → Evaluate → Repeat
Warning Signs
- Decreasing diversity in outputs
- Increasing repetition
- Degraded performance on held-out benchmarks
- Outputs converging to specific patterns
Production Pipeline
End-to-End Example
A production pipeline combines all the techniques above into a coherent system. The key insight is that synthetic data generation is not just generation—it's a multi-stage process where each stage filters and refines the data. Think of it as a funnel: generate more than you need, then progressively filter to keep only the best.
This pipeline follows the NVIDIA approach: generate multiple responses per instruction (rejection sampling), score with a reward model, verify with an LLM judge, and deduplicate. The redundancy is intentional—each filter catches different quality issues.
class SyntheticDataPipeline:
def __init__(self, generator_model, reward_model, judge_model):
self.generator = generator_model
self.reward = reward_model
self.judge = judge_model
def generate_sft_data(self, seed_instructions: list, target_size: int):
# Step 1: Expand instructions
instructions = self.expand_instructions(seed_instructions, target_size * 2)
# Step 2: Generate responses
examples = []
for inst in tqdm(instructions):
# Generate multiple, keep best
responses = [self.generator.generate(inst) for _ in range(3)]
scores = [self.reward.score(inst, r) for r in responses]
best_response = responses[scores.index(max(scores))]
examples.append({"instruction": inst, "response": best_response})
# Step 3: Quality filter
examples = self.filter_quality(examples)
# Step 4: Deduplicate
examples = deduplicate_minhash(examples)
# Step 5: Final sample
return random.sample(examples, min(target_size, len(examples)))
def filter_quality(self, examples: list, threshold: float = 0.75):
filtered = []
for ex in examples:
# Reward model score
reward_score = self.reward.score(ex["instruction"], ex["response"])
if reward_score < threshold:
continue
# LLM judge verification
judge_score = self.get_judge_score(ex)
if judge_score < 4:
continue
filtered.append(ex)
return filtered
Understanding the pipeline stages:
-
Step 1: Expand instructions (line 22): Start with more than you need.
target_size * 2means generate double the target, assuming 50% will be filtered out. Adjust this multiplier based on your filtering pass rate. -
Step 2: Rejection sampling (lines 26-32): For each instruction, generate 3 responses with different random seeds (temperature sampling). Score all three, keep only the best. This is expensive (3x generation cost) but dramatically improves quality.
-
Step 3: Quality filter (line 35): Two-stage filtering. First, the reward model provides a fast, consistent score. Then the LLM judge does a more nuanced evaluation. Items must pass both—the reward model catches obviously bad responses, the judge catches subtle issues.
-
Step 4: Deduplicate (line 38): Near-duplicate removal prevents your model from memorizing repeated patterns. MinHash LSH is fast enough for large datasets.
-
Step 5: Final sample (line 41): Even after filtering, you may have more than needed. Random sampling ensures diversity in the final dataset.
Why target_size * 2 in line 22? Empirically, aggressive quality filtering removes 40-60% of generated data. Generating 2x ensures you hit your target after filtering. If your filters are stricter, increase the multiplier.
Monitoring Quality
Quality monitoring catches problems before they corrupt your training. The key metrics are diversity (are outputs varied?) and length distribution (are responses the right size?). Sudden drops in diversity or spikes in repetition indicate model collapse beginning.
Run these checks after each batch, not just at the end. Early detection lets you adjust prompts, swap generators, or stop generation before wasting compute.
def monitor_generation(examples: list):
"""Track quality metrics during generation."""
metrics = {
"total_generated": len(examples),
"avg_length": np.mean([len(ex["response"]) for ex in examples]),
"unique_trigrams": count_unique_trigrams(examples),
"diversity_score": calculate_diversity(examples),
}
# Alert if diversity drops
if metrics["diversity_score"] < 0.5:
logger.warning("Low diversity detected in synthetic data")
return metrics
Interpreting the metrics:
-
diversity_score < 0.5: Indicates significant repetition. Common causes: too few seed topics, generator defaulting to safe patterns, or deduplication threshold too loose. Action: broaden seed data, increase temperature, or regenerate.
-
unique_trigrams decreasing: Early sign of convergence. The generator is producing similar phrasings. Action: rotate generators, refresh prompts, or add explicit diversity requirements.
-
avg_length drift: If average length suddenly changes, the generator may be truncating or padding responses. Check for prompt format issues or model temperature changes.
Tools and Frameworks
NVIDIA NeMo Curator
From NVIDIA: "NeMo Curator offers prebuilt synthetic data generation pipelines for Supervised Fine-Tuning (SFT) and preference data."
HuggingFace Datasets
from datasets import Dataset
# Save synthetic data
dataset = Dataset.from_list(synthetic_examples)
dataset.push_to_hub("my-org/synthetic-instructions")
Argilla
Open-source data curation platform:
import argilla as rg
# Log synthetic data for review
for ex in synthetic_examples:
rg.log(
rg.TextClassificationRecord(
text=ex["response"],
metadata={"instruction": ex["instruction"]}
),
name="synthetic-review"
)
Conclusion
Synthetic data generation enables training data creation at unprecedented scale:
- Use quality filtering aggressively—quantity without quality causes model collapse
- Reward models are essential for scoring and rejection sampling
- Deduplicate to maintain diversity
- Mix with real data when possible
- Monitor continuously for signs of degradation
The NVIDIA Nemotron approach demonstrates that 98% synthetic data can produce state-of-the-art models when quality is prioritized.
Frequently Asked Questions
Related Articles
SFT and RLHF: The Complete Guide to Post-Training LLMs
A deep dive into Supervised Fine-Tuning and Reinforcement Learning from Human Feedback—the techniques that transform base models into useful assistants.
Fine-Tuning vs Prompting: When to Use Each
A practical guide to deciding between fine-tuning and prompt engineering for your LLM application, based on real-world experience with both approaches.
LLM Evaluation in Production: Beyond Benchmarks
How to evaluate LLM performance in real-world applications, where academic benchmarks often fail to capture what matters.