How many images do I need to train a LoRA?

Typically 5-20 high-quality images work well for subjects/characters; 20-50+ for styles. Quality matters more than quantity—diverse angles, lighting, and contexts help. For DreamBooth-style subject learning, 5-15 images often suffice. More images can improve consistency but increase risk of overfitting. Include variety in your training images (different backgrounds, poses, lighting).

What's the best model for [specific use case]?

For general quality: Flux.1 [dev] or DALL-E 3. For artistic/creative: Midjourney or SD with artistic LoRAs. For maximum control: SDXL + ControlNet. For text in images: DALL-E 3 or Ideogram. For speed: SDXL Turbo or LCM. For local/private: Stable Diffusion family. For video: Runway Gen-3 or Kling. The best choice depends on whether you need cloud vs local, API vs UI, open weights vs best quality.

What hardware do I need to run diffusion models locally?

Minimum practical: 8GB VRAM for SD 1.5. Recommended: 12GB+ VRAM for SDXL. For Flux: 16GB+ VRAM (or 24GB+ for comfortable generation). Good consumer choices: RTX 4070 Ti Super (16GB), RTX 4090 (24GB). AMD GPUs work via ROCm but with less optimization. Mac M1/M2/M3: Works via MPS backend, slower than NVIDIA. CPU-only: Possible but very slow (minutes per image).

What sampler should I use?

For most use cases: **DPM++ 2M Karras** with 20-30 steps. For faster iteration: **UniPC** or **DPM++ 2M** with 10-15 steps. For photorealism: **DPM++ SDE Karras** with 25-35 steps. For creative variation: **Euler a**. Avoid ancestral samplers (those with "a") if you need reproducible results. See the Samplers section for detailed guidance.

Why does my LoRA/model produce different results than examples I've seen?

Many factors affect output: (1) Different sampler or step count. (2) Different CFG scale. (3) Different VAE. (4) Different prompt phrasing. (5) Different seed. (6) Different base model or checkpoint. (7) Missing trigger words for LoRA. Always check what settings were used for example images—most sharing sites include generation parameters.

ComfyUI vs Automatic1111—which should I use?

**Automatic1111 (A1111)**: Better for beginners, simpler interface, extensions add features. **Forge**: Performance-optimized A1111 fork, recommended over vanilla A1111. **ComfyUI**: Node-based workflow builder, maximum flexibility, better for complex pipelines, steeper learning curve. Choose A1111/Forge to get started quickly; move to ComfyUI when you need advanced workflows or hit limitations.

Back to Blog

Creative AI Video Computer Vision Deep Learning

Diffusion Models: The Complete Guide to Image and Video Generation

In-depth tour of diffusion models for generative AI. Covers the mathematical foundations (DDPM, DDIM, score matching), architectures (U-Net, Latent Diffusion, DiT), major models (Stable Diffusion, DALL-E, Flux, Midjourney), controllability (ControlNet, LoRA, IP-Adapter), video generation (Sora, Runway, Kling), and production deployment.

December 18, 202530 min read

Diffusion Models: The Complete Guide to Image and Video Generation

Diffusion models have revolutionized generative AI, enabling the creation of photorealistic images, videos, and other media from text descriptions. From Stable Diffusion democratizing image generation to Sora producing cinematic video clips, diffusion-based systems represent the current state-of-the-art in generative modeling.

This guide provides a comprehensive understanding of diffusion models: the mathematical foundations that make them work, the architectural innovations that made them practical, the major model families and their differences, techniques for controlling and editing outputs, extensions to video generation, and production deployment considerations.

Part 1: Mathematical Foundations

Understanding diffusion models requires grasping the elegant mathematical framework underlying them. The core insight is surprisingly simple: learn to reverse a gradual noising process.

The Core Intuition

Imagine taking a photograph and gradually adding random noise to it, step by step, until it becomes pure static—indistinguishable from random noise. This is the forward diffusion process. Now imagine learning to reverse this process: given noisy static, gradually remove noise until a coherent image emerges. This is the reverse diffusion process, and it's what diffusion models learn to do.

The key insight is that while generating images from scratch is extraordinarily difficult, removing a small amount of noise from a slightly noisy image is much easier. By chaining many small denoising steps together, we can transform pure noise into coherent images.

Forward Diffusion Process

The forward process gradually adds Gaussian noise to data over T timesteps. Starting with a clean image x₀, we produce increasingly noisy versions x₁, x₂, ..., x_T until x_T is approximately pure Gaussian noise.

At each step, we add a small amount of noise according to a variance schedule β_t:

Code

q(x_t | x_{t-1}) = N(x_t; √(1-β_t) · x_{t-1}, β_t · I)

This says: to get x_t from x_{t-1}, scale down x_{t-1} slightly (by √(1-β_t)) and add Gaussian noise with variance β_t.

The reparameterization trick allows us to compute x_t directly from x_0 without iterating through all intermediate steps:

$x_t = \sqrt{\bar{\alpha}_t} \cdot x_0 + \sqrt{1-\bar{\alpha}_t} \cdot \epsilon$

Where:

ᾱ_t = ∏_{s=1}^{t} (1-β_s) is the cumulative product of (1-β) values
ε ~ N(0, I) is standard Gaussian noise

This formula is crucial for training efficiency: we can jump directly to any noise level without sequential computation.

Noise Schedules

The variance schedule β_t controls how quickly noise is added. The choice of schedule significantly impacts generation quality.

Linear schedule (original DDPM): β increases linearly from β_1 = 10⁻⁴ to β_T = 0.02 over T=1000 steps. Simple but adds noise too quickly in early steps.

Cosine schedule (Improved DDPM): Designed so that ᾱ_t follows a cosine curve, providing more gradual noise addition:

$\bar{\alpha}_t = \cos^2\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)$

Where s is a small offset (typically 0.008) preventing ᾱ_T from reaching exactly zero.

The cosine schedule preserves more signal in early timesteps, improving generation of fine details. Most modern models use cosine or learned schedules.

Shifted schedules for high-resolution: When generating high-resolution images, standard schedules add too much noise too quickly. Resolution-dependent schedule shifting adjusts the noise levels based on image size:

$\bar{\alpha}'_t = \bar{\alpha}_t^{\text{resolution\_factor}}$

This keeps the effective signal-to-noise ratio consistent across resolutions.

Reverse Diffusion Process

The reverse process learns to denoise, transforming x_T (pure noise) back to x_0 (clean image). The true reverse distribution q(x_{t-1}|x_t) is intractable, so we train a neural network to approximate it:

Code

p_θ(x_{t-1}|x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))

The network predicts the mean μ_θ and optionally the variance Σ_θ of the reverse step distribution.

Noise prediction formulation: Rather than predicting μ directly, most implementations train the network to predict the noise ε that was added:

$\mu_\theta(x_t, t) = \frac{1}{\sqrt{1-\beta_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \cdot \epsilon_\theta(x_t, t) \right)$

The network ε_θ(x_t, t) takes a noisy image and timestep, outputting the predicted noise. This formulation works better in practice.

v-prediction formulation: An alternative predicts the "velocity" v = √(ᾱ_t)·ε - √(1-ᾱ_t)·x_0, which interpolates between noise and signal prediction. This improves training stability at high noise levels and is used in many modern models including Stable Diffusion 2.x+.

Training Objective

The training objective is remarkably simple: predict the noise that was added. The loss function is:

Code

L = E_{x_0, ε, t} [ ||ε - ε_θ(x_t, t)||² ]

Training proceeds as:

Sample a clean image x_0 from the dataset
Sample a random timestep t ~ Uniform(1, T)
Sample random noise ε ~ N(0, I)
Compute noisy image: x_t = √(ᾱ_t)·x_0 + √(1-ᾱ_t)·ε
Predict noise: ε̂ = ε_θ(x_t, t)
Compute loss: L = ||ε - ε̂||²
Update network parameters via gradient descent

This is called the denoising score matching objective. The network learns to point toward the data distribution from any noise level.

Score Matching Perspective

An equivalent view comes from score matching. The score of a distribution is the gradient of its log-probability: ∇_x log p(x). The score points toward regions of higher probability.

Diffusion models learn the score function at each noise level:

Code

s_θ(x_t, t) ≈ ∇_{x_t} log q(x_t)

The connection to noise prediction is:

$s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}$

Predicting noise and predicting the score are mathematically equivalent, just scaled differently.

This score matching perspective connects diffusion models to a rich literature on energy-based models and provides theoretical grounding for why the approach works.

DDPM vs DDIM Sampling

DDPM (Denoising Diffusion Probabilistic Models) uses stochastic sampling—each reverse step adds a small amount of random noise:

Code

x_{t-1} = μ_θ(x_t, t) + σ_t · z,  where z ~ N(0, I)

This stochasticity means generating the same image twice (even from the same initial noise) produces different results. DDPM typically requires many steps (hundreds to thousands) for quality results.

DDIM (Denoising Diffusion Implicit Models) reformulates sampling as a deterministic process:

Code

x_{t-1} = √(ᾱ_{t-1}) · x̂_0 + √(1-ᾱ_{t-1}) · ε_θ(x_t, t)

Where x̂_0 is the predicted clean image: x̂_0 = (x_t - √(1-ᾱ_t)·ε_θ(x_t,t)) / √(ᾱ_t)

DDIM advantages:

Deterministic: Same noise → same image (useful for reproducibility)
Fewer steps: Quality results in 20-50 steps vs 1000 for DDPM
Interpolation: Can smoothly interpolate between images in latent space

DDIM also supports a "η" parameter controlling stochasticity: η=0 is fully deterministic, η=1 recovers DDPM.

Classifier-Free Guidance (CFG)

Classifier-free guidance dramatically improves generation quality and text adherence. The key idea: train both conditional and unconditional models, then extrapolate toward the conditional direction at inference time.

During training, randomly drop the conditioning (text prompt) some percentage of the time (typically 10-20%), replacing it with a null/empty condition. This trains the model to work both with and without conditioning.

At inference, compute both conditional and unconditional predictions, then extrapolate:

$\hat{\epsilon} = \epsilon_\theta(x_t, \varnothing) + w \cdot \left(\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing)\right)$

Where:

ε_θ(x_t, c) is the conditional (text-guided) prediction
ε_θ(x_t, ∅) is the unconditional prediction
w is the guidance scale (typically 5-15)

Higher guidance scale means stronger adherence to the prompt but potentially less diversity and more artifacts. Lower scale means more diverse but potentially less prompt-faithful results.

Guidance scale effects:

w = 1.0: Standard conditional generation (no extrapolation)
w = 3-5: Light guidance, good diversity
w = 7-9: Strong guidance, typical for most use cases
w = 12-20: Very strong guidance, can cause oversaturation and artifacts

Negative Prompts

Classifier-free guidance enables negative prompts—descriptions of what you don't want. The modified formula becomes:

$\hat{\epsilon} = \epsilon_\theta(x_t, c_{\text{neg}}) + w \cdot \left(\epsilon_\theta(x_t, c_{\text{pos}}) - \epsilon_\theta(x_t, c_{\text{neg}})\right)$

The unconditional prediction is replaced with prediction conditioned on the negative prompt. The model extrapolates away from the negative direction toward the positive.

Common negative prompts include "blurry, low quality, distorted, ugly, deformed" to steer away from common failure modes.

Note on Flux: Flux models use flow matching without negative conditioning. You cannot use negative prompts with Flux—the model was trained without this capability. This simplifies generation but removes one control mechanism.

Samplers and Schedulers

Beyond DDPM and DDIM, many samplers have been developed, each with different speed-quality tradeoffs. Understanding samplers is crucial for practical use.

Euler and Euler Ancestral (Euler a):

Euler: Simple, fast, deterministic. Good baseline sampler.
Euler a: Adds stochasticity (ancestral sampling). More creative/varied outputs but doesn't converge—different step counts produce different images.

DPM-Solver Family: DPM-Solver++ uses higher-order ODE solvers for faster convergence:

DPM++ 2M: Second-order multistep solver. Fast, converges well. Excellent default choice.
DPM++ 2M Karras: Uses Karras noise schedule (smaller steps near the end). Often better quality at low step counts.
DPM++ SDE: Stochastic variant. More varied outputs but doesn't converge.
DPM++ SDE Karras: Stochastic with Karras schedule. Good for artistic variation.
DPM++ 2S a: Second-order single-step ancestral. Creative outputs.

UniPC: Unified predictor-corrector framework. Fast convergence, good quality in 10-20 steps. Often matches DPM++ quality with fewer steps.

LMS (Linear Multi-Step): Classic numerical method. Stable but slower convergence than modern alternatives.

Karras Schedulers: Samplers labeled "Karras" use the noise schedule from Karras et al. The key insight: use smaller noise steps near the end of denoising, where fine details are resolved. This improves quality especially at lower step counts.

Sampler Selection Guide:

Use Case	Recommended Sampler	Steps
Fast iteration/testing	DPM++ 2M, UniPC	10-15
Quality generation	DPM++ 2M Karras	20-30
Photorealistic	DPM++ SDE Karras	25-35
Reproducible results	Euler, DPM++ 2M	20-30
Creative/artistic	Euler a, DPM++ 2S a	20-30
SDXL specifically	DPM++ 2M Karras, DDIM	25-40

Convergence: Convergent samplers (Euler, DPM++ 2M) produce increasingly similar images as step count increases. Non-convergent/ancestral samplers (Euler a, DPM++ SDE) produce different images at different step counts. Choose convergent samplers when reproducibility matters.

Part 2: Neural Network Architectures

The mathematical framework needs a powerful neural network to learn the denoising function. Architecture choices significantly impact generation quality, speed, and capabilities.

The U-Net Architecture

The U-Net is the original and still widely-used architecture for diffusion models. Originally developed for biomedical image segmentation, it's perfectly suited for denoising.

Structure:

Code

Input (noisy image) → Encoder → Bottleneck → Decoder → Output (predicted noise)
                         ↓                       ↑
                    Skip connections ───────────┘

Encoder path: Series of downsampling blocks that reduce spatial resolution while increasing channel depth. Each block typically contains:

Convolutional layers
Group normalization
Activation (SiLU/Swish)
Downsampling (strided conv or pooling)

Bottleneck: Processes the most compressed representation. Often includes attention layers for global context.

Decoder path: Mirror of encoder, upsampling back to original resolution. Each block contains:

Upsampling (transposed conv or interpolation + conv)
Convolution layers
Group normalization
Activation

Skip connections: Crucial for quality. Connect encoder blocks to corresponding decoder blocks at the same resolution. Allow fine spatial details to bypass the bottleneck.

Timestep conditioning: The network must know the current noise level. Timesteps are embedded using sinusoidal positional encodings (like transformers), then injected via:

Addition to feature maps
FiLM (Feature-wise Linear Modulation): scale and shift activations
Adaptive Group Normalization: modulate normalization parameters

Attention layers: Self-attention layers in the U-Net capture long-range dependencies. Typically added at lower resolutions (16×16, 32×32) where the computational cost is manageable. Cross-attention layers enable text conditioning.

Cross-Attention for Text Conditioning

Text prompts guide generation through cross-attention layers. The process:

Text encoding: The prompt is tokenized and processed by a text encoder (CLIP, T5, etc.) producing text embeddings of shape [sequence_length, embedding_dim]
Cross-attention: In cross-attention layers:
- Query (Q): derived from image features
- Key (K) and Value (V): derived from text embeddings
Code
```
Attention(Q, K, V) = softmax(QK^T / √d_k) · V
```
Information flow: Each spatial location in the image attends to all text tokens, incorporating relevant semantic information.

This mechanism allows precise spatial control—attention maps reveal which image regions attend to which words.

Latent Diffusion Models (LDM)

Latent Diffusion Models (the architecture behind Stable Diffusion) perform diffusion in a compressed latent space rather than pixel space. This dramatically reduces computational cost.

Architecture:

Code

Image (512×512×3) → VAE Encoder → Latent (64×64×4) → Diffusion → VAE Decoder → Image

VAE (Variational Autoencoder):

Encoder: Compresses images to latent representations (typically 8× downsampling)
Decoder: Reconstructs images from latents
Trained separately on image reconstruction, then frozen during diffusion training

Benefits:

8×8 = 64× fewer pixels to process: 512×512 image → 64×64 latent
Computational savings: Attention is O(n²); reducing n by 8× saves 64×
Semantic compression: Latent space captures semantic content, not just pixels

Latent channels: Typical latent spaces use 4 channels (not 3 like RGB). This provides sufficient capacity for reconstruction while being more compact than pixels.

Training: Diffusion model trains entirely in latent space:

Encode training images to latents: z_0 = Encoder(x_0)
Apply forward diffusion: z_t = √(ᾱ_t)·z_0 + √(1-ᾱ_t)·ε
Train to predict noise: L = ||ε - ε_θ(z_t, t, c)||²

Inference:

Sample initial noise: z_T ~ N(0, I)
Iteratively denoise in latent space: z_T → z_{T-1} → ... → z_0
Decode to pixels: x_0 = Decoder(z_0)

VAE Architecture Details

The VAE is crucial for latent diffusion quality. Poor VAE = blurry or artifact-prone images regardless of diffusion model quality.

Encoder architecture:

Code

Input (H×W×3) → ConvBlocks ↓ → ConvBlocks ↓ → ConvBlocks ↓ → Conv → Mean, LogVar (H/8×W/8×4)

Multiple downsampling stages, typically using residual blocks. Final layer produces mean and log-variance for the variational posterior.

Reparameterization: z = mean + std × ε (where ε ~ N(0,1)) enables gradient flow.

Decoder architecture:

Code

Latent (H/8×W/8×4) → ConvBlocks ↑ → ConvBlocks ↑ → ConvBlocks ↑ → Conv → Output (H×W×3)

Mirror of encoder with upsampling. Modern VAEs often use attention layers for better global coherence.

Training losses:

Reconstruction loss: L1 or L2 between input and reconstruction
Perceptual loss: Difference in VGG features (captures high-level structure)
Adversarial loss: GAN discriminator for sharper outputs
KL divergence: Regularizes latent distribution toward standard normal

KL weight: Typically small (e.g., 10⁻⁶) to prioritize reconstruction quality. Higher KL weight = smoother latent space but blurrier reconstructions.

Diffusion Transformers (DiT)

Diffusion Transformers replace the U-Net with a transformer architecture. This is the architecture behind DALL-E 3, Sora, Stable Diffusion 3, and Flux.

Why transformers?

Scalability: Transformers scale predictably with compute (established scaling laws)
Simplicity: Pure attention, no complex U-Net skip connections
Flexibility: Handle variable sequence lengths naturally
Proven: Massive investment in transformer optimization

DiT Architecture:

Code

Latent patches → Linear embedding → [Transformer blocks] × N → Linear → Noise prediction

Patchification: Divide latent into patches (e.g., 2×2), flatten and linearly embed. A 64×64×4 latent with 2×2 patches = 1024 tokens of dimension d.

Transformer blocks: Standard transformer blocks with:

Multi-head self-attention
Feed-forward network (MLP)
Layer normalization

Conditioning: Timestep and text conditioning via:

AdaLN (Adaptive Layer Norm): Predict scale and shift parameters from conditioning
AdaLN-Zero: Initialize conditioning to zero for stable training
Cross-attention: For detailed text conditioning

Position embeddings:

Learnable absolute positions
2D sinusoidal positions (row + column)
RoPE (Rotary Position Embeddings) for resolution generalization

Scaling results (from DiT paper):

DiT-XL/2 (675M params): FID 2.27 on ImageNet 256×256
Larger models consistently improve
Compute-optimal scaling similar to language models

MMDiT, used in Stable Diffusion 3, extends DiT for better text-image interaction:

Separate streams: Text and image have separate transformer streams that interact:

Code

Text tokens  → [Transformer blocks with cross-attention] → Text output
                         ↕ (cross-attention)
Image tokens → [Transformer blocks with cross-attention] → Noise prediction

Bidirectional attention: Unlike standard cross-attention (image attends to text), MMDiT allows:

Image tokens attend to text tokens
Text tokens attend to image tokens

This bidirectional flow improves prompt adherence and semantic understanding.

T5 + CLIP dual encoding: SD3 uses both:

CLIP text encoder: Good at visual concepts
T5 encoder: Better at complex language understanding

The combined representation captures both visual-semantic alignment (CLIP) and nuanced text understanding (T5).

Rectified Flow and Flow Matching

Rectified Flow (used in Stable Diffusion 3 and Flux) reformulates diffusion as learning straight paths between noise and data:

Standard diffusion: Curved paths through data space, learned via score matching.

Rectified flow: Learn to transport mass from noise distribution to data distribution along straight lines:

$x_t = (1-t) \cdot x_0 + t \cdot \epsilon \quad \text{(linear interpolation)}$

$v = \epsilon - x_0 \quad \text{(velocity pointing from data to noise)}$

The network learns to predict velocity v given x_t and t.

Benefits:

Straighter paths: Fewer sampling steps needed
Simpler training: Direct velocity prediction
Better coupling: Noise and data are paired more sensibly

Sampling: Euler integration along learned velocity field:

Code

x_{t-Δt} = x_t - Δt · v_θ(x_t, t)

Stable Diffusion 3 uses rectified flow, achieving good quality in 20-28 steps.

Part 3: Major Model Families

The diffusion model landscape includes several distinct families, each with different architectures, training approaches, and capabilities.

Stable Diffusion Evolution

Stable Diffusion from Stability AI democratized image generation through open-source releases.

SD 1.x (2022):

Latent diffusion with U-Net
860M parameters
CLIP text encoder (OpenCLIP ViT-L/14)
Trained on LAION-5B subset
512×512 native resolution
v-prediction, linear noise schedule

SD 2.x (2022-2023):

OpenCLIP ViT-H/14 (larger text encoder)
Improved VAE
768×768 resolution variant
Depth-to-image model
Negative prompts work better
Some controversy: filtered training data, different aesthetic

SDXL (2023):

Significantly larger U-Net (2.6B parameters)
Dual text encoders: CLIP ViT-L + OpenCLIP ViT-bigG
1024×1024 native resolution
Two-stage generation: base model + refiner
Micro-conditioning: crop coordinates, original size
Better prompt following and image quality

SDXL Turbo / Lightning (2023-2024):

Adversarial diffusion distillation
1-4 step generation (vs 20-50 for SDXL)
Trades some quality for speed
Enables real-time generation

SD 3 and SD 3.5 (2024):

MMDiT architecture (transformer-based)
Rectified flow training
Triple text encoders: CLIP ViT-L + OpenCLIP ViT-bigG + T5-XXL
Improved text rendering in images
Better prompt adherence
Multiple sizes: SD3 Medium (2B), SD3.5 Large (8B), SD3.5 Large Turbo

Key characteristics across versions:

Open weights (enabling research and fine-tuning)
Active community (LoRAs, ControlNets, extensions)
Multiple deployment options (local, cloud, API)

DALL-E Series

DALL-E from OpenAI pioneered text-to-image generation.

DALL-E 1 (2021):

Discrete VAE + autoregressive transformer
Generates image tokens sequentially
12B parameter transformer
Demonstrated text-to-image concept

DALL-E 2 (2022):

CLIP image embeddings as intermediate representation
Prior network: text → CLIP image embedding
Decoder: CLIP embedding → image (diffusion-based)
Unidirectional: CLIP embedding bridges text and image
1024×1024 resolution
Inpainting and variations

DALL-E 3 (2023):

Complete redesign
Diffusion Transformer architecture
Training on highly detailed image captions (GPT-4 generated)
Exceptional prompt following
Native text rendering in images
Integrated with ChatGPT for prompt rewriting
Safety measures built into training and inference

Key innovations from DALL-E 3:

Caption improvement: Training on detailed, accurate captions dramatically improves prompt adherence
Prompt rewriting: ChatGPT expands and clarifies user prompts before generation
Text rendering: Direct training on text in images enables readable text generation

Midjourney

Midjourney has become the aesthetic benchmark for AI image generation, though its architecture is proprietary.

Known characteristics:

Likely diffusion-based
Trained with strong aesthetic curation
Discord-based interface (unique UX)
Exceptional at artistic, stylized imagery
Strong default style (the "Midjourney look")
V1-V6 versions with increasing capability
V6+ includes text rendering, better realism

Strengths:

Artistic quality and coherence
Strong default aesthetics
Good at abstract concepts and artistic styles
Large, engaged user community

Limitations:

Closed source (no local deployment)
Discord-only interface (API limited)
Less control than open alternatives
Monthly subscription required

Flux (Black Forest Labs)

Flux from Black Forest Labs (founded by key Stable Diffusion creators) represents the current state-of-the-art in open image generation.

Flux.1 variants (2024):

Flux.1 [pro]: Best quality, API only, commercial use
Flux.1 [dev]: Open weights, non-commercial license, guidance-distilled
Flux.1 [schnell]: Open weights, Apache 2.0, 4-step generation

Flux.1.1 and Flux 2 (2024-2025):

Flux.1.1 [pro]: Faster generation, improved quality, prompt adherence, and diversity
Flux.1.1 [pro] Ultra: 4× higher resolution without speed penalty
Flux.1.1 [pro] Raw: Hyper-realistic, candid-style images
Flux 2: Major improvements in coherence, detail quality, and speed
Flux Kontext: Context-aware editing and smarter scene understanding

Architecture:

12B parameter rectified flow transformer
Dual text encoders: CLIP ViT-L + T5-XXL (T5 enables superior text understanding)
Rotary position embeddings (resolution flexible)
Dual-stream processing: Simultaneously analyzes global composition and local details
Parallel attention blocks for efficiency

Key Innovations:

Guidance distillation: Flux.1 [dev] trained to internalize CFG—no guidance scale needed at inference, no unconditional forward pass
Flow matching training: Rectified flow for efficient, straight-path sampling
No negative prompts: Flux wasn't trained with negative conditioning—you can't tell it what to avoid, but prompt adherence is better for what you do want
Text rendering: T5 encoder enables excellent text generation in images

Why Flux excels:

T5 text encoder understands complex, nuanced prompts
Larger model (12B vs 2.6B for SDXL) captures more detail
Flow matching produces cleaner outputs than score-based diffusion
Training methodology emphasizes prompt fidelity

Flux vs SDXL comparison:

Aspect	Flux	SDXL
Quality	Higher (photorealism, details)	Good
Text in images	Excellent	Poor
Prompt following	Superior	Good
Speed	Slower	Faster
VRAM	16GB+ recommended	8GB+ workable
ControlNet	Limited availability	Extensive ecosystem
LoRA ecosystem	Growing	Massive
Negative prompts	Not supported	Supported

When to choose Flux: Photorealism, text rendering, complex prompts, maximum quality. When to choose SDXL: Speed, ControlNet workflows, extensive LoRA use, lower VRAM.

Imagen and Parti (Google)

Imagen:

Text-to-image diffusion model
T5-XXL text encoder (much larger than CLIP)
Cascaded diffusion: 64×64 → 256×256 → 1024×1024
Dynamic thresholding for improved CFG
DrawBench benchmark for evaluation

Parti:

Autoregressive approach (not diffusion)
ViT-VQGAN for image tokenization
20B parameter transformer
Demonstrated scaling improves quality

Imagen 2 / 3:

Integrated into Google products
Improved quality and capabilities
Powers image generation in Gemini

Ideogram and Others

Ideogram:

Exceptional text rendering (best in class)
"Magic Prompt" for prompt enhancement
Strong at typography and graphic design
Free tier available

Playground AI:

Playground v2.5: Open weights, aesthetic-focused
Competitive with Midjourney on artistic images
Commercial-friendly licensing

Leonardo AI:

Platform with multiple fine-tuned models
Strong game asset generation
Built-in editing tools
Active community model sharing

Part 4: Controllability and Editing

Raw text-to-image generation provides limited control. Advanced techniques enable precise spatial, stylistic, and semantic control.

ControlNet

ControlNet enables spatial control through conditioning images (edges, poses, depth maps, etc.).

Architecture:

Code

Text prompt ──────────────────────────────────────────→ U-Net → Output
                                                           ↑
Control image → Copy of U-Net encoder → Zero convolutions ─┘

How it works:

Clone the U-Net encoder weights (trainable copy)
Process control image through the cloned encoder
Add control features to main U-Net via zero-initialized convolutions
Zero initialization ensures training starts from working diffusion model

Control types:

Canny edges: Line art, outlines
HED/PIDI soft edges: Softer edge detection
Depth: MiDaS depth maps for spatial structure
Normal maps: Surface orientation
OpenPose: Human body pose keypoints
Segmentation: Semantic region maps
Scribbles: Rough user sketches
Lineart: Clean line drawings
QR codes: Generate images that are scannable QR codes
Tile: For upscaling and detail enhancement
IP2P: Instruction-based editing

Multi-ControlNet: Combine multiple control signals:

Python

# Combine pose + depth control
controlnets = [pose_controlnet, depth_controlnet]
control_images = [pose_image, depth_image]
control_weights = [1.0, 0.8]

Control weights: Adjust influence per control type and per timestep.

IP-Adapter

IP-Adapter (Image Prompt Adapter) enables conditioning on reference images, transferring style or content.

Architecture:

Code

Reference image → Image encoder (CLIP) → Projection → Cross-attention (added to existing)

Key insight: Instead of modifying the U-Net, add new cross-attention layers for image tokens alongside text cross-attention.

Variants:

IP-Adapter: Basic image conditioning
IP-Adapter Plus: Higher resolution CLIP features, better detail
IP-Adapter Face: Specialized for face transfer (CLIP + face embedding)
IP-Adapter Full Face: Even stronger face identity preservation

Use cases:

Style transfer: Reference image defines artistic style
Subject consistency: Same character across images
Face transfer: Apply face identity to new contexts
Composition reference: Use image as layout guide

Combination: IP-Adapter works with ControlNet—use IP-Adapter for style/subject, ControlNet for spatial layout.

Face Identity Preservation: InstantID, PuLID, and FaceID

Beyond general IP-Adapter, specialized methods exist for preserving facial identity—crucial for consistent character generation.

InstantID (2024): Zero-shot face identity transfer using a single reference image.

How it works:

InsightFace extracts face embedding from reference image
IP-Adapter injects the face embedding into cross-attention
ControlNet uses facial landmarks (5 keypoints: eyes, nose, mouth corners) for spatial guidance
Combined, these preserve identity while allowing pose/expression changes

Key advantages:

Single image required (no training)
~82-86% facial recognition similarity to source
Compatible with existing LoRAs and ControlNets
Doesn't modify UNet weights

PuLID (Pure and Lightning ID, 2024): Next-generation identity preservation with higher fidelity.

Improvements over InstantID:

More sophisticated face feature extraction
Contrastive learning for better identity disentanglement
Higher identity fidelity in testing
Better detail preservation (skin texture, facial structure)

Tradeoffs:

Higher VRAM usage
Slower generation
More restrictive on expression changes

IP-Adapter FaceID: Earlier approach using InsightFace embeddings with IP-Adapter.

Variants: FaceID, FaceID Plus, FaceID Portrait
More flexible expressions than InstantID/PuLID
Lower identity fidelity
Fastest of the three approaches

Comparison:

Method	Identity Fidelity	Expression Flexibility	Speed	VRAM
PuLID	Highest	Most restrictive	Slowest	Highest
InstantID	High	Balanced	Medium	Medium
FaceID	Moderate	Most flexible	Fastest	Lowest

Practical guidance:

For maximum likeness: PuLID
For balanced results: InstantID
For creative flexibility: FaceID
For production speed: FaceID or InstantID

Flux-based face methods: PuLID has been adapted for Flux models. EcomID combines InstantID and PuLID approaches. None achieve 100% face matching, but results continue improving.

T2I-Adapter

T2I-Adapter is a lightweight alternative to ControlNet for adding spatial control.

Architecture differences:

ControlNet: Copies entire U-Net encoder (~1.4B params for SD 1.5)
T2I-Adapter: Small adapter networks (~80M params)

How it works:

Adapter processes control image through lightweight CNN
Features are added to U-Net at multiple resolutions
Much smaller memory footprint than ControlNet

Tradeoffs:

Lighter weight, faster training
Less precise control than ControlNet
Fewer pre-trained adapters available
Can be combined more easily (lower memory)

When to use: When memory is constrained, when combining many control signals, or when ControlNet precision isn't required.

LoRA Fine-Tuning

LoRA (Low-Rank Adaptation) enables efficient fine-tuning of diffusion models on custom concepts, styles, or subjects.

How LoRA works: Instead of fine-tuning all weights W, decompose the update as low-rank matrices:

Code

W' = W + ΔW = W + BA

Where:

W is the original weight matrix (frozen)
B is a small matrix (d × r)
A is a small matrix (r × d)
r is the rank (typically 4-128, much smaller than d)

Parameter efficiency: For a 1024×1024 weight matrix:

Full fine-tuning: 1M parameters
LoRA rank 4: 8K parameters (125× reduction)

LoRA for diffusion:

Apply to cross-attention layers (most impact)
Optionally apply to self-attention and FFN
Train on small dataset (5-50 images typical)
Can represent: styles, characters, objects, concepts

Training a LoRA:

Python

# Typical LoRA training config
learning_rate: 1e-4
train_batch_size: 1
max_train_steps: 1000-3000
lora_rank: 4-32
target_modules: ["to_q", "to_k", "to_v", "to_out.0"]  # Cross-attention

Composing LoRAs: Multiple LoRAs can be combined at inference:

Python

# Merge multiple LoRAs with weights
pipe.load_lora_weights("style_lora", weight=0.8)
pipe.load_lora_weights("character_lora", weight=0.6)

DreamBooth

DreamBooth fine-tunes the entire model (or LoRA) to learn specific subjects from a few images.

Key innovation: Use a rare token identifier (e.g., "sks") to represent the subject:

Train on: "a photo of sks dog"
Generate with: "sks dog wearing a hat"

Prior preservation: Generate images of the class ("dog") during training to prevent:

Language drift (forgetting what "dog" means generally)
Overfitting to training poses/contexts

Training:

Collect 5-20 images of subject
Caption each: "a [identifier] [class]" (e.g., "a sks dog")
Generate prior images: "a dog", "a dog running", etc.
Train on both subject images and prior images
Regularization weight balances subject learning vs. prior preservation

DreamBooth + LoRA: Combine DreamBooth training with LoRA for efficient subject learning without full model fine-tuning.

Dataset Curation for Fine-Tuning

The quality of your training data determines the quality of your fine-tuned model. Poor data produces poor results regardless of training settings.

Dataset size guidelines:

Subject/character LoRA: 5-20 high-quality images
Style LoRA: 20-100 images representing the style
DreamBooth: 5-15 images typically sufficient
Full fine-tuning: Thousands to millions of images

Image quality requirements:

Resolution: At least 1024×1024 for SDXL (512×512 for SD 1.5). Upscaled images introduce artifacts.
Sharpness: Avoid blurry, motion-blurred, or out-of-focus images
Format: PNG preferred for quality; JPEG acceptable
Variety: Different angles, lighting, backgrounds, poses

Subject-specific guidance:

Include only the subject—crop out distracting backgrounds
Vary poses, expressions, and contexts
Maintain consistent subject (same person, same object)
Avoid other subjects in frame

Captioning/annotation:

Every image needs a caption describing its content
Use consistent trigger words: "a photo of sks person" or "in the style of xyz"
Detailed captions improve quality: describe pose, lighting, background
Tools: BLIP-2, Florence-2, or GPT-4V for automated captioning
Manual review recommended for quality

Caption format examples:

Code

# For subject learning
"a photo of sks man, wearing a blue shirt, standing outdoors, natural lighting"
"sks man smiling, close-up portrait, studio lighting, neutral background"

# For style learning
"a landscape painting in xyz style, mountains, sunset colors, impressionist brushstrokes"

Common mistakes:

Too few images (model can't generalize)
Too many similar images (overfits to one pose/angle)
Inconsistent subject across images
Poor captions or no captions
Mixed subjects in single images
Low resolution or heavily compressed images

Regularization/prior preservation images: For DreamBooth, generate or collect images of the general class to prevent forgetting:

Training "sks dog" → include generic "dog" images
Ratio typically 1:1 to 1:4 (subject : class images)
Prevents the model from forgetting what "dog" means generally

Textual Inversion

Textual Inversion learns new "words" (token embeddings) for concepts rather than modifying model weights.

How it works:

Add new token(s) to vocabulary (e.g., "")
Initialize embedding randomly or from related word
Train only the new embedding(s), freeze everything else
Use in prompts: "a painting of by Van Gogh"

Advantages:

Very small file size (just the embedding vector)
No model modification
Combine unlimited embeddings

Disadvantages:

Less expressive than LoRA (only modifies text representation)
Harder to capture complex subjects
May require more training images

Image Editing Techniques

Inpainting: Regenerate masked regions while preserving the rest.

How it works:

User provides: original image + binary mask
Encode original image to latent
During denoising, at each step:
- Denoise the full latent
- Replace unmasked regions with original latent (at appropriate noise level)
- Continue denoising

Inpainting models: Some models are specifically trained for inpainting with the mask as additional input channel. These produce better results at mask boundaries.

Outpainting: Extend images beyond their original boundaries.

Similar to inpainting but mask covers new regions
Uncrop/extend in any direction
Challenging: model must imagine consistent content

Img2img (Image-to-Image):

Encode source image to latent
Add noise to a specified level (strength parameter, e.g., 0.7)
Denoise with text prompt guidance
Lower strength = closer to original; higher = more creative freedom

SDEdit: Simpler img2img approach—add noise, denoise with new prompt. Works without specialized training.

InstructPix2Pix: Follow editing instructions:

Input: image + instruction ("make it winter")
Output: edited image
Trained on synthetic edit triplets generated by GPT + Stable Diffusion

Prompt-to-Prompt: Control editing by manipulating cross-attention maps:

Identify attention associated with source concept
Replace/modify for target concept
Enables: word swap, attention refinement, structural editing

ADetailer: Automatic Face and Hand Fixing

ADetailer (After Detailer) automatically detects and fixes common problem areas like faces and hands.

How it works:

Generate initial image normally
ADetailer detects faces/hands using YOLO or MediaPipe models
Creates masks around detected regions
Runs inpainting on each detected region at higher detail
Composites fixed regions back into original image

Detection models:

face_yolo8n.pt: Fast face detection (recommended default)
face_yolo8s.pt: More accurate face detection
hand_yolo8n.pt: Hand detection
person_yolo8n.pt: Full body detection
mediapipe_face: Alternative face detection

Key settings:

Detection confidence: Threshold for what counts as a face (0.3-0.5 typical)
Mask dilation: Expand mask beyond detected region
Inpainting denoising strength: How much to change (0.3-0.5 for faces)
Separate prompts: Different prompt for face region

When to use:

Small faces in group shots
Distant subjects where faces lack detail
Hand correction (with hand model)
Any time faces come out distorted

Limitations:

Adds generation time (runs inpainting per detected region)
Can over-smooth faces if denoising strength too high
May not detect stylized/anime faces well (use appropriate model)

Regional Prompting and Multi-Subject Composition

Generating multiple distinct subjects (e.g., "a man and a woman") often causes attribute mixing—both subjects share characteristics. Regional prompting solves this.

Regional Prompter (A1111 extension):

Divides the image into regions, each with its own prompt:

Code

# Divide image into left and right halves
ADDCOL
a woman with red hair, wearing blue dress
ADDCOL
a man with black hair, wearing green suit

Division modes:

Columns (ADDCOL): Divide horizontally
Rows (ADDROW): Divide vertically
2D regions: Combine rows and columns for grid
Custom ratios: 1:2 for uneven divisions

Generation modes:

Attention mode: Modifies cross-attention (default, usually best)
Latent mode: Separates latent space regions (better for distinct subjects)

Latent Couple (alternative approach):

Define arbitrary mask regions (not just rectangles)
Paint zones where each prompt takes effect
More flexible but more complex setup

Tips for multi-character compositions:

Use regional prompting for basic separation
Add ControlNet pose for precise positioning
Use ADetailer with [SEP] to fix faces separately
Lower LoRA weights to reduce attribute bleeding
Consider generating characters separately and compositing

Common issues:

Color bleeding: Reduce region overlap, use latent mode
Style inconsistency: Add style keywords to all regions
Boundary artifacts: Use gradient/feathered regions

SDXL Refiner Model

SDXL introduced a two-stage pipeline with a specialized refiner model for final denoising steps.

How it works:

Base model generates image from pure noise (steps 0-80%)
Refiner model takes over for final steps (steps 80-100%)
Refiner specializes in high-frequency details and texture

Why a separate refiner?

Different noise levels require different skills
Early steps: composition, structure, major elements
Late steps: fine details, textures, sharpening
Specialized models can optimize for each phase

Usage patterns:

Python

# Base generates to step 0.8, refiner finishes
base_output = base_pipe(prompt, output_type="latent", denoising_end=0.8)
refined = refiner_pipe(prompt, image=base_output, denoising_start=0.8)

When to use refiner:

Photorealistic images (improves skin, hair, texture)
Detailed scenes with fine elements
When base output looks "soft" or lacking detail

When to skip refiner:

Stylized/artistic images (may over-sharpen)
Speed is priority (adds ~40% generation time)
Using certain LoRAs that conflict with refiner

Practical tips:

Handoff point (0.8) can be adjusted—lower = refiner does more
Refiner uses same prompt as base
Some LoRAs only work with base, not refiner
Many users skip refiner and use upscalers instead

Upscaling and Super-Resolution

Generated images often need upscaling for print or high-resolution display. Several approaches exist.

AI Upscalers (trained super-resolution models):

Real-ESRGAN: Most popular general-purpose upscaler.

4× upscaling with good detail preservation
Variants: RealESRGAN_x4plus (general), RealESRGAN_x4plus_anime (anime)
Fast inference, widely supported

4x-UltraSharp: Community-trained upscaler.

Often preferred for photorealistic content
Better texture preservation than default ESRGAN
Available in ComfyUI and A1111

SwinIR: Transformer-based upscaler.

Higher quality than ESRGAN in some cases
Slower inference
Good for maximum quality when speed doesn't matter

Latent Upscaling (using diffusion model):

Tiled upscale with ControlNet Tile:

Upscale image with basic interpolation (bilinear/lanczos)
Process through diffusion with ControlNet Tile
Tile model adds detail consistent with original
Best quality but slowest

Ultimate SD Upscale: A1111 extension combining:

AI upscaler for initial upscale
Tiled img2img for detail enhancement
Seam fixing between tiles

Choosing an approach:

Method	Quality	Speed	Use Case
Real-ESRGAN	Good	Fast	Quick upscaling, general use
4x-UltraSharp	Better	Fast	Photorealistic content
Tiled + ControlNet	Best	Slow	Maximum quality, print
SDXL Refiner	Good	Medium	Built-in detail enhancement

Upscaling workflow:

Generate at native resolution (1024×1024 for SDXL)
Select best generation
Upscale 2-4× with AI upscaler
Optionally: tiled img2img for added detail
Final result: 2048×2048 to 4096×4096

Part 5: Video Generation

Extending diffusion models to video generation introduces temporal consistency challenges but enables remarkable capabilities.

Temporal Consistency Challenge

Video generation must solve:

Frame-to-frame consistency: Objects shouldn't flicker or change appearance
Motion coherence: Movement should be smooth and physically plausible
Temporal logic: Events should follow logical sequences
Longer context: Videos have many more frames than a single image

Video Diffusion Architectures

Approaches to temporal modeling:

3D U-Net: Extend 2D convolutions to 3D (height × width × time):

Code

2D Conv: [H, W, C] → [H, W, C']
3D Conv: [T, H, W, C] → [T, H, W, C']

Temporal attention: Interleave spatial attention with temporal attention:

Code

Spatial attention: Each frame attends to itself
Temporal attention: Each position attends across time

Factorized attention: Process spatial and temporal dimensions separately for efficiency:

Code

Input [T, H, W, C]
→ Reshape to [T×H×W, C], spatial attention
→ Reshape to [H×W, T, C], temporal attention

Sora (OpenAI)

Sora represents the frontier of video generation (as of late 2024).

Architecture insights (from technical report):

Spacetime patches: Video treated as 3D patches (like ViT but spatiotemporal)
DiT backbone: Diffusion Transformer, scaled up significantly
Variable resolution/duration: Trains on native aspect ratios and lengths
Emergent capabilities: 3D consistency, long-range coherence, physics simulation

Capabilities:

Up to 60 seconds of video
1080p resolution
Text-to-video generation
Image-to-video (animate images)
Video-to-video (style transfer, editing)
Video extension (forward and backward)
Multiple shots and camera movements

Spacetime latent patches:

Code

Video [T, H, W, 3] → VAE → Latents [T', H', W', C] → Patches [N, D]

Videos are compressed spatiotemporally before patchification, making long videos tractable.

Training scale:

Trained on massive video dataset
Native resolution training (no fixed size)
Long-context training (minutes of video)
Emergent physics understanding (not explicitly programmed)

Veo (Google)

Veo 2 (2024) from Google DeepMind:

4K resolution support
Up to 2 minutes of video
Native audio generation (Veo 2)
Strong prompt following
Physical realism and lighting

Technical approach:

Flow matching / rectified flow base
Cascaded generation for resolution
Audio-video joint modeling

Runway Gen-3 and Gen-4

Runway pioneered commercial video generation:

Gen-1 (2023): Video-to-video style transfer Gen-2 (2023): Text-to-video, image-to-video Gen-3 Alpha (2024):

10 seconds of video
Strong motion control
Better temporal consistency
"Motion Brush" for local motion control

Gen-4 (anticipated):

Longer duration
Higher resolution
Better coherence
Extended editing capabilities

Kling (Kuaishou)

Kling from Chinese company Kuaishou:

Up to 5 minutes of video generation
Strong physical simulation
Character consistency across long videos
Motion interpolation capabilities

Pika Labs

Pika 1.0:

3-second clips
Good motion quality
Lip sync capabilities
"Modify region" editing

Video Editing and Control

Video ControlNet: Extend image ControlNet to video:

Apply control to each frame
Temporal smoothing of control signals
Types: pose sequences, depth video, edge video

Motion transfer: Extract motion from source video, apply to different content.

Character consistency: Maintain character appearance across scenes:

IP-Adapter for video
Training on character-specific data
Reference frame attention

Animating images:

Single image → video (Stable Video Diffusion, Runway)
Specify camera motion (zoom, pan, rotate)
I2V (Image-to-Video) models

Stable Video Diffusion (SVD)

SVD from Stability AI:

Open weights video generation
Image-to-video (animate single image)
14-25 frames at ~576×1024
Fine-tunable for specific domains

Architecture:

Based on SD 2.1 image model
Added temporal attention layers
Fine-tuned on video datasets
Motion bucket conditioning (control motion amount)

Part 6: Production Deployment

Deploying diffusion models in production requires optimizing for speed, cost, scale, and reliability.

Inference Optimization

Fewer sampling steps:

DDIM: 20-50 steps (vs 1000 for DDPM)
DPM-Solver++: 15-25 steps with good quality
UniPC: 10-20 steps
LCM (Latent Consistency Models): 4-8 steps
Turbo/Lightning distillation: 1-4 steps

Step reduction through distillation:

Consistency distillation: Train student to map any point on trajectory directly to clean image.

Progressive distillation: Iteratively halve the number of steps:

Train 2-step model to match 4-step teacher
Train 1-step model to match 2-step teacher

Adversarial distillation: Use discriminator to ensure single-step outputs are realistic.

Classifier-Free Guidance distillation: Bake guidance into the model:

Train on guided outputs
Inference without computing unconditional prediction
2× speedup (one forward pass instead of two)

Memory Optimization

Attention optimization:

xFormers memory-efficient attention
Flash Attention v2
Attention slicing (process attention in chunks)

VAE optimization:

Tiled VAE decoding (process latent in tiles)
VAE slicing (batch dimension)
FP16/BF16 VAE

Model offloading:

Sequential CPU offload: Move unused components to CPU
Model CPU offload: Keep model on CPU, copy layers to GPU as needed
Disk offload: For extremely limited VRAM

Precision reduction:

FP16: Standard for inference, ~2× memory reduction
BF16: Better dynamic range, preferred for training
INT8/INT4 quantization: Further reduction with some quality loss

Quantization for Diffusion Models

Weight quantization: Reduce model size and memory:

NF4 (4-bit NormalFloat): Information-theoretically optimal for normal distributions:

Code

12B parameter model:
- FP16: 24GB VRAM
- NF4: 6GB VRAM

GGUF quantization (for llama.cpp-style inference):

Q4_K_M, Q5_K_M, Q8 variants
Tradeoff between size and quality

Activation quantization: Harder than weights due to outliers:

SmoothQuant: Migrate quantization difficulty from activations to weights
Per-tensor vs per-channel scaling

Deployment Platforms

Local deployment:

ComfyUI:

Node-based visual workflow builder
Maximum flexibility and customization
Efficient inference with smart caching
Large extension ecosystem

Automatic1111 / Forge:

User-friendly web interface
Extension system for additional features
Forge: Performance-optimized fork

Diffusers (HuggingFace):

Python library for programmatic use
Production-ready inference
Easy model swapping

Cloud deployment:

Replicate:

Pay-per-prediction model
Pre-built models ready to use
Custom model deployment via Cog

fal.ai:

Fast inference focus
Flux, SDXL, SD3 hosted
Queue-based API

Modal:

Serverless GPU compute
Custom container deployment
Scale to zero when idle

Baseten / Banana:

ML model serving platforms
Auto-scaling
Custom model support

Scaling Considerations

Batching: Process multiple images simultaneously:

Python

# Batched generation
images = pipe(
    prompt=["cat", "dog", "bird"],
    num_images_per_prompt=1,
).images

Throughput vs latency: Trade-offs in production:

Larger batches = higher throughput, higher latency
Smaller batches = lower latency, lower throughput
Choose based on use case (interactive vs batch)

Queue management:

Request queuing for load handling
Priority queues for different tiers
Timeout handling for long generations

Caching:

Text encoder outputs (same prompt = same embeddings)
VAE encoder outputs (for img2img with same source)
ControlNet preprocessor outputs

Safety and Content Moderation

Safety classifiers:

Input prompt classification
Output image classification
NSFW detection models

Watermarking:

Invisible watermarks in generated images
Provenance tracking
C2PA standard for content credentials

Prompt filtering:

Blocklist-based filtering
Classifier-based filtering
Human review for edge cases

Rate limiting:

Per-user limits
Cost-based limits
Abuse detection

Cost Optimization

Compute costs: GPU time is the primary cost driver:

A100 40GB: ~$1-2/hour (cloud)
4090: ~$0.40/hour (consumer cloud)
Steps reduction: Biggest cost lever

Cost per image (approximate, cloud pricing):

SD 1.5, 20 steps: $0.01-0.02
SDXL, 30 steps: $0.02-0.05
Flux [dev], 30 steps: $0.05-0.10
DALL-E 3 API: $0.04-0.12

Optimization strategies:

Step reduction via distillation
Batch processing during low-demand periods
Caching repeated operations
Model quantization
Appropriate model selection (don't use Flux for simple tasks)

Part 7: Advanced Topics and Future Directions

Consistency Models

Consistency Models enable single-step generation by learning to map any point on the diffusion trajectory directly to the final output.

Key insight: If we could directly predict x₀ from any x_t, we wouldn't need iterative denoising.

Training approaches:

Consistency distillation: Train from pre-trained diffusion model
Consistency training: Train from scratch

Results: 1-2 step generation with quality approaching multi-step diffusion.

Latent Consistency Models (LCM): Apply consistency training in latent space:

4-8 step generation
Compatible with LoRAs
LCM-LoRA: Add consistency property via LoRA

Rectified Flow and Flow Matching

Beyond diffusion, flow matching learns optimal transport paths:

Optimal transport: Find minimum-cost mapping from noise to data distribution.

Straight paths: Flow matching can learn straighter trajectories:

Fewer discretization errors
Fewer steps needed
Better interpolation properties

Reflow: Iteratively straighten learned flows for even faster sampling.

3D Generation

Extending diffusion to 3D content:

Multi-view generation: Generate consistent images from multiple viewpoints:

Zero-1-to-3: Image → rotated views
MVDream: Text → multi-view images
SyncDreamer: Synchronized multi-view generation

3D reconstruction: Lift 2D generations to 3D:

NeRF fitting to generated views
3D Gaussian Splatting from diffusion outputs
Score Distillation Sampling (DreamFusion)

Native 3D diffusion:

Point cloud diffusion
Mesh diffusion
Triplane representations

Audio Generation

Diffusion for audio:

AudioLDM / AudioLDM 2: Text-to-audio via latent diffusion:

VAE for audio spectrograms
Diffusion in latent space
CLAP (audio-text alignment) for conditioning

Stable Audio: Stability AI's audio generation:

Music and sound effects
Timing control
Long-form generation

Riffusion: Fine-tuned image diffusion on spectrograms—generate music as images.

Architectural Innovations

Hourglass DiT: Hierarchical DiT with different resolutions:

Process low-res globally
Upsample for local details
Efficiency improvements

Streaming generation: Generate images progressively:

Early exit for preview
Refinement on demand
Better user experience

Mixture of Experts for diffusion: Route different timesteps or regions to specialized experts.

Prompt Engineering for Diffusion Models

Effective prompting is crucial for quality results. Unlike LLMs, diffusion model prompts benefit from specific techniques.

Prompt structure (general pattern):

Code

[subject] [action/pose] [environment] [lighting] [style] [quality modifiers]

Example:

Code

"A young woman with red hair reading a book in a cozy library,
warm afternoon sunlight through tall windows,
oil painting style, highly detailed, 8k resolution"

Subject description:

Be specific: "golden retriever puppy" not just "dog"
Include relevant details: age, clothing, expression, pose
Describe from general to specific

Environment/setting:

Location: "ancient forest", "modern kitchen", "busy Tokyo street"
Time: "sunset", "blue hour", "overcast day"
Weather: "foggy morning", "rain-soaked streets"

Lighting (crucial for photorealism):

"soft diffused lighting", "dramatic rim lighting"
"golden hour sunlight", "neon city lights"
"studio lighting with softbox", "natural window light"

Style keywords:

Art styles: "impressionist", "art nouveau", "cyberpunk", "studio ghibli"
Photography: "35mm film", "portrait lens", "wide angle", "macro"
Rendering: "unreal engine", "octane render", "ray tracing"

Quality boosters (use sparingly):

"highly detailed", "8k", "masterpiece", "best quality"
"sharp focus", "intricate details", "professional photography"
Effect varies by model—test which work for your model

CLIP token limit: Most SD models use CLIP with 77 token limit. Longer prompts get truncated. SDXL and Flux handle longer prompts better via T5.

Prompt weighting (A1111/ComfyUI syntax):

Code

(important subject:1.3)  # Increase weight
(unwanted element:0.7)   # Decrease weight
[alternating:prompts:step]  # Alternate between prompts

Model-specific tips:

SDXL:

Benefits from longer, more detailed prompts
Use micro-conditioning for aspect ratio hints
Negative prompts very effective

Flux:

No negative prompts available
Responds well to natural language
T5 understands complex descriptions
Less need for "quality" keywords

Midjourney:

Briefer prompts often work better
Style references via --sref
Aspect ratio via --ar 16:9

Evaluation Metrics

Understanding how diffusion models are evaluated helps interpret benchmarks and assess model quality.

Fréchet Inception Distance (FID): The standard metric for image generation quality.

How it works:

Extract features from real images using Inception-v3 network
Extract features from generated images
Compute Fréchet distance between the two feature distributions

Interpretation:

Lower FID = better (distributions more similar)
FID ~1-5: Excellent quality
FID ~10-20: Good quality
FID ~50+: Noticeable quality issues

Limitations:

Assumes Gaussian feature distributions
Sensitive to sample size (need ~10K+ images)
Doesn't capture prompt adherence
Can be gamed by mode collapse

CLIP Score: Measures text-image alignment (prompt adherence).

How it works:

Encode prompt with CLIP text encoder
Encode generated image with CLIP image encoder
Compute cosine similarity between embeddings

Interpretation:

Higher = better prompt adherence
Measures semantic alignment, not pixel quality
Correlates well with human judgment of prompt following

Inception Score (IS): Measures quality and diversity.

How it works:

Quality: Classifier should be confident about generated images
Diversity: Overall class distribution should be uniform

Limitations: Less used now; FID generally preferred.

CMMD (CLIP Maximum Mean Discrepancy): Improved alternative to FID using CLIP embeddings instead of Inception.

Uses richer CLIP features
No Gaussian assumption
Unbiased estimator

Human Preference Metrics:

Elo ratings: From head-to-head comparisons (used in leaderboards)
Win rate: Percentage of comparisons won against baseline
Aesthetic scores: Trained predictors of human aesthetic preference

Practical benchmarks:

DrawBench: Text-to-image prompt adherence
PartiPrompts: Complex compositional prompts
COCO-30K: Standard image generation benchmark
GenEval: Compositional generation evaluation

Seeds and Reproducibility

Understanding seeds is crucial for reproducible and controllable generation.

What is a seed? The seed initializes the random number generator that creates the initial noise. Same seed + same settings = same initial noise = same image.

Seed behavior:

Python

# Same seed, same image
pipe(prompt="a cat", seed=42)  # Image A
pipe(prompt="a cat", seed=42)  # Image A (identical)

# Different seed, different image
pipe(prompt="a cat", seed=43)  # Image B (different)

When seeds produce identical results:

Same model/checkpoint
Same prompt (including negative)
Same sampler and step count
Same CFG scale
Same resolution
Same seed
Deterministic sampler (not ancestral)

When seeds produce different results (even with same seed):

Ancestral samplers (Euler a, DPM++ SDE) add randomness at each step
Different hardware (floating-point variations)
Different software versions
Different batch sizes (affects batched operations)

Practical uses of seeds:

Reproducibility: Share exact generation settings
Iteration: Find good seed, then refine prompt
Variations: Small seed changes for similar-but-different images
Debugging: Isolate what changed between generations

Seed exploration:

Generate batch with random seeds
Find promising composition
Lock seed, iterate on prompt/settings
Use X/Y/Z plot to explore parameter space

-1 seed: Convention for "random seed"—system generates new random seed each time.

Research Frontiers

Better understanding:

Theoretical foundations still developing
Connection to optimal transport
Convergence guarantees

Scaling laws:

How does quality scale with compute?
Optimal model size vs training data
DiT scaling appears similar to LLMs

Efficiency:

Approaching real-time high-resolution generation
Mobile deployment
Edge inference

Control:

More precise spatial control
Better preservation of reference styles
Composition of multiple concepts

Evaluation directions:

Beyond FID: human preference modeling
Better diversity metrics
Measuring prompt adherence at scale
Benchmark standardization

Sources

Foundational Papers

Architecture

Samplers

Control and Customization

Video Generation

Efficiency

Model Resources

Practical Guides

Frequently Asked Questions

Stable Diffusion is open-source (downloadable, modifiable, locally runnable), focuses on flexibility and community extensions (ControlNet, LoRA). DALL-E 3 from OpenAI is cloud-only, excels at prompt following and text rendering, integrated with ChatGPT. Midjourney is cloud-only (Discord interface), known for artistic/aesthetic quality and distinctive style. Each has different strengths: Stable Diffusion for control and customization, DALL-E for prompt accuracy, Midjourney for artistic results.

Options in order of effectiveness: (1) Train a LoRA/DreamBooth on the character with 10-20 reference images. (2) Use IP-Adapter with a reference image to maintain consistency. (3) Use consistent detailed descriptions in prompts (less reliable). (4) Generate multiple images with same seed and similar prompts (limited). (5) Use character consistency features in tools like Midjourney v6 or specific ComfyUI workflows. For production use, training a dedicated model is most reliable.

Common causes: (1) CFG scale too high (try 7-9 instead of 15+). (2) Negative prompt issues (too aggressive or contradictory). (3) Wrong resolution (use model's native resolution). (4) Incompatible model/VAE combination. (5) Low step count (increase steps). (6) Prompt conflicts (contradictory instructions). (7) CLIP limits (prompts over 77 tokens get truncated in some models). Try reducing guidance scale first—it's the most common cause.

GANs were previous SOTA for image generation. Diffusion advantages: (1) More stable training (no mode collapse). (2) Better diversity in outputs. (3) Better prompt conditioning. (4) Easier to control and guide. (5) Scalable with compute. GAN advantages: (1) Single forward pass (faster inference). (2) Some specific tasks (super-resolution, domain transfer). Diffusion has largely superseded GANs for general image generation, though GANs remain relevant for discriminator-based refinement and specific applications.

Improving but still challenging. Best performers: DALL-E 3, Ideogram, SD3/Flux (with T5 encoder). Why it's hard: Text requires pixel-perfect consistency; diffusion's iterative nature can corrupt characters. Tips: Use short text, common fonts, high guidance, specify text explicitly in prompt. For critical text: generate image, add text in post-processing. Video models generally cannot reliably generate text yet.

Significant. SD 1.5: ~150,000 A100 GPU hours. SDXL: ~400,000+ GPU hours. DALL-E 3/Sora: Likely millions of GPU hours. Carbon footprint depends on energy source. Inference is much cheaper (~seconds of GPU time per image). Fine-tuning (LoRA): Hours to days on consumer GPUs. The environmental cost is real but improving with efficiency gains. Using existing models vs training new ones has ~1000× different impact.

Hands remain challenging for all models. Approaches: (1) Use inpainting to regenerate just the hands. (2) Use ControlNet with OpenPose or depth for hand guidance. (3) Use a dedicated hand-fix LoRA. (4) Generate multiple images and select best hands. (5) Use higher step counts—hands often improve with more steps. (6) Include hand description in prompt: "detailed hands with five fingers". Flux and newer models handle hands better than SD 1.5.

All preserve facial identity from reference images. InstantID: Best balance of identity fidelity and flexibility, works with single image, ~85% face similarity. PuLID: Highest identity fidelity but most restrictive, slower, higher VRAM. FaceID: Fastest, most flexible expressions, but lower identity preservation. Choose based on whether you prioritize likeness (PuLID), balance (InstantID), or creative flexibility (FaceID).

Each model has a native resolution. For other sizes: (1) Generate at native resolution, then upscale with Real-ESRGAN or ControlNet Tile. (2) For different aspect ratios, use resolutions with same total pixels: 1024×1024 = 1216×832 = 1344×768 for SDXL. (3) Avoid non-native resolutions during generation—causes artifacts and composition issues. (4) Some models (Flux) handle variable resolutions better than others.

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Deep LearningComputer Vision

Vision Transformers (ViT): Applying Transformers to Images

Deep dive into Vision Transformers—how the transformer architecture adapts from text to images. Understand patch embeddings, position encoding for images, and why ViT has revolutionized computer vision.

3 min read

LLMsMultimodal

Video Generation AI 2025: Sora 2 vs Veo 3 vs Runway Complete Guide

Field guide to AI video generation in 2025—Sora 2, Veo 3, Runway Gen-4, Kling, and more. Capabilities, pricing, API access, and practical implementation.

6 min read

LLMsMultimodal

Multimodal LLMs: Vision, Audio, and Beyond

Field guide to multimodal LLMs—vision-language models, audio understanding, video comprehension, and any-to-any models. Architecture deep dives, benchmarks, implementation patterns, and production deployment.

12 min read

EducationLLMs

Transformer Architecture: A Complete Deep Dive

Deep exploration of the transformer architecture—from embedding layers through attention and feed-forward networks to the output head. Understand why decoder-only models dominate, how residual connections enable deep networks, and the engineering decisions behind GPT, Llama, and modern LLMs.

30 min read

LLMsML Engineering

Training Embedding Models: From Contrastive Learning to Production Retrieval

Hands-on guide to training text embedding models—from contrastive learning fundamentals to hard negative mining, multi-stage training, and the architectures behind E5, BGE, and GTE. Understanding the foundation of modern retrieval systems.

15 min read

Table of Contents

Diffusion Models: The Complete Guide to Image and Video Generation

Part 1: Mathematical Foundations

The Core Intuition

Forward Diffusion Process

Noise Schedules

Reverse Diffusion Process

Training Objective

Score Matching Perspective

DDPM vs DDIM Sampling

Classifier-Free Guidance (CFG)

Negative Prompts

Samplers and Schedulers

Part 2: Neural Network Architectures

The U-Net Architecture

Cross-Attention for Text Conditioning

Latent Diffusion Models (LDM)

VAE Architecture Details

Diffusion Transformers (DiT)

Multi-Modal Diffusion Transformer (MMDiT)

Rectified Flow and Flow Matching

Part 3: Major Model Families

Stable Diffusion Evolution

DALL-E Series

Midjourney

Flux (Black Forest Labs)

Imagen and Parti (Google)

Ideogram and Others

Part 4: Controllability and Editing

ControlNet

IP-Adapter

Face Identity Preservation: InstantID, PuLID, and FaceID

T2I-Adapter

LoRA Fine-Tuning

DreamBooth

Dataset Curation for Fine-Tuning

Textual Inversion

Image Editing Techniques

ADetailer: Automatic Face and Hand Fixing

Regional Prompting and Multi-Subject Composition

SDXL Refiner Model

Upscaling and Super-Resolution

Part 5: Video Generation

Temporal Consistency Challenge

Video Diffusion Architectures

Sora (OpenAI)

Veo (Google)

Runway Gen-3 and Gen-4

Kling (Kuaishou)

Pika Labs

Video Editing and Control

Stable Video Diffusion (SVD)

Part 6: Production Deployment

Inference Optimization

Memory Optimization

Quantization for Diffusion Models

Deployment Platforms

Scaling Considerations

Safety and Content Moderation

Cost Optimization

Part 7: Advanced Topics and Future Directions

Consistency Models

Rectified Flow and Flow Matching

3D Generation

Audio Generation

Architectural Innovations

Prompt Engineering for Diffusion Models

Evaluation Metrics

Seeds and Reproducibility

Research Frontiers

Sources

Foundational Papers

Architecture

Samplers

Control and Customization

Video Generation

Efficiency

Model Resources

Practical Guides

Frequently Asked Questions