Skip to main content
Back to Blog

Text Generation & Decoding Strategies: A Complete Guide

A comprehensive guide to how LLMs actually generate text—from greedy decoding to beam search, temperature scaling, nucleus sampling, speculative decoding, and structured generation. Master the techniques that control LLM output quality, creativity, and speed.

16 min read
Share:

The Generation Problem

You've trained a transformer that outputs probability distributions over tokens. Now what? How do you actually convert those probabilities into coherent text?

2025: Speculative decoding goes mainstream. Google now uses speculative decoding in AI Overviews and other products, achieving remarkable speedups while maintaining output quality. NVIDIA reports 2-3x inference speedups by running draft and target models in parallel. IBM deployed speculators in production with 2x speedup on Llama models and 3x on code models.

Advanced 2025 methods: EAGLE-3 uses a lightweight autoregressive head attached to the target model's internal layers, eliminating the need for a separate draft model. SWIFT (ICLR 2025) achieves 1.3-1.6x speedup with layer-skipping—a plug-and-play solution requiring no auxiliary models or additional training.

This is the decoding problem: given a model that predicts P(next_token | context), how do we select tokens to generate a complete response?

The answer isn't obvious, and getting it right has profound implications for the quality of LLM outputs. The same underlying model can produce wildly different text depending on how you convert its probability distributions into actual tokens. A model that seems creative and engaging with one decoding strategy might appear repetitive and boring with another. A model that produces coherent, factual responses with one approach might generate nonsensical gibberish with another.

Understanding decoding strategies is essential for anyone working with LLMs because these strategies are the bridge between the model's learned knowledge (encoded in probability distributions) and the actual text users see. When you adjust the "temperature" slider in ChatGPT or choose between "creative" and "precise" modes, you're changing decoding parameters. When a coding assistant produces deterministic, reproducible completions versus varied suggestions, that's decoding strategy at work. When a story generator creates surprising plot twists versus predictable narratives, decoding choices drive that difference.

The fundamental tension in decoding is between determinism and diversity. Should we always pick the most likely token (deterministic but potentially boring)? Should we sample randomly from the full distribution (diverse but potentially nonsensical)? Or should we find some middle ground that balances coherence with creativity?

Different tasks demand different answers:

  • Math and logic: Want determinism—there's one correct answer, don't introduce randomness
  • Creative writing: Want diversity—predictable stories are boring
  • Code generation: Want mostly deterministic with occasional exploration of valid alternatives
  • Conversation: Want a balance—coherent and natural but not robotic

This post covers every major decoding strategy, explaining not just how each works mechanically but why it produces the outputs it does and when each is appropriate. By the end, you'll understand why LLMs sometimes repeat themselves, why "temperature" affects creativity, why the same prompt can produce different responses, and how to choose parameters for your specific use case.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    THE GENERATION PROBLEM                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT: "The capital of France is"                                      │
│                                                                          │
│  MODEL OUTPUT (probability distribution over vocabulary):               │
│                                                                          │
│  Token        Probability                                               │
│  ──────────────────────────                                             │
│  "Paris"      0.85   ████████████████████████████████████               │
│  "Lyon"       0.03   ██                                                 │
│  "the"        0.02   █                                                  │
│  "a"          0.02   █                                                  │
│  "known"      0.01   █                                                  │
│  "one"        0.01   █                                                  │
│  ...          ...                                                       │
│  (32000 tokens total, sum to 1.0)                                      │
│                                                                          │
│  QUESTION: Which token do we select?                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OPTION 1: GREEDY                                                        │
│  Always pick highest probability → "Paris"                             │
│  ✓ Deterministic  ✓ Fast                                               │
│  ✗ Can be boring  ✗ Can get stuck in loops                            │
│                                                                          │
│  OPTION 2: SAMPLE                                                        │
│  Sample from distribution → might get "Lyon" (3% chance)               │
│  ✓ Creative  ✓ Diverse outputs                                         │
│  ✗ Can be nonsensical  ✗ Non-deterministic                            │
│                                                                          │
│  OPTION 3: SOMETHING IN BETWEEN                                          │
│  Temperature, top-k, top-p → balanced approaches                       │
│  Most production systems use carefully tuned combinations              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

This post covers every major decoding strategy, from the simplest to the most sophisticated, with practical guidance on when to use each.


Part I: Autoregressive Generation Basics

How LLMs Generate Text

Before diving into decoding strategies, we need to understand the fundamental process by which large language models generate text. This understanding is crucial because every decoding strategy operates within this same basic framework—they all differ only in how they select the next token from a probability distribution.

LLMs generate text autoregressively, meaning they produce one token at a time, each new token conditioned on all previous tokens. This sequential, token-by-token process is very different from how humans might think about writing. We often conceptualize entire sentences or paragraphs before putting them down; LLMs literally construct text one piece at a time, with each piece influencing what comes next.

The autoregressive process works as follows: Given an input prompt like "The capital of France is", the model processes all these tokens and outputs a probability distribution over its entire vocabulary (often 32,000 to 100,000+ tokens). This distribution represents the model's beliefs about what token should come next—perhaps P("Paris") = 0.85, P("Lyon") = 0.03, P("the") = 0.02, and so on across every token in the vocabulary.

A decoding strategy then selects one token from this distribution. Let's say "Paris" is selected. Now the model processes the extended sequence "The capital of France is Paris" and outputs a new probability distribution for what comes after "Paris". Perhaps P(".") = 0.60, P(",") = 0.15, P("and") = 0.05. Again, a token is selected—say "."—and the process continues.

This loop repeats until either a special end-of-sequence token is generated, a maximum length is reached, or some other stopping condition is met. The entire generated response is built up token by token through this iterative process.

A critical insight is that the model itself doesn't "decide" what to generate—it only produces probability distributions. The decoding strategy is what converts those distributions into actual tokens. This separation is powerful: the same model can behave very differently depending on how tokens are selected. A greedy strategy that always picks the highest-probability token will produce deterministic, focused output. A sampling strategy that randomly selects from the distribution will produce varied, creative output. The model's underlying knowledge is the same; only the selection mechanism changes.

This also explains why LLM outputs are often non-deterministic: if the decoding strategy involves any randomness (sampling), the same prompt can produce different responses each time. Deterministic strategies like greedy decoding always produce the same output for the same input, which is valuable for reproducibility but limits diversity.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    AUTOREGRESSIVE GENERATION                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROCESS:                                                            │
│  ────────────                                                            │
│  Generate one token at a time, feeding each token back as input.       │
│                                                                          │
│  Step 1: "The capital of France is" → model → P(next)                  │
│          Select token: "Paris"                                         │
│                                                                          │
│  Step 2: "The capital of France is Paris" → model → P(next)            │
│          Select token: "."                                             │
│                                                                          │
│  Step 3: "The capital of France is Paris." → model → P(next)           │
│          Select token: <EOS> (end of sequence)                         │
│                                                                          │
│  DONE!                                                                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE GENERATION LOOP:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  def generate(prompt, max_tokens, decode_fn):                          │
│      tokens = tokenize(prompt)                                          │
│                                                                          │
│      for _ in range(max_tokens):                                        │
│          # Get probability distribution over next token                │
│          logits = model(tokens)[-1]  # Last position only              │
│          probs = softmax(logits)                                        │
│                                                                          │
│          # DECODE: Select next token (this is what varies!)            │
│          next_token = decode_fn(probs)                                  │
│                                                                          │
│          # Append and continue                                          │
│          tokens.append(next_token)                                      │
│                                                                          │
│          # Stop if end token                                            │
│          if next_token == EOS:                                          │
│              break                                                      │
│                                                                          │
│      return detokenize(tokens)                                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY INSIGHT:                                                            │
│  ────────────                                                            │
│  The model itself doesn't "generate"—it only predicts distributions.   │
│  The DECODING STRATEGY decides what to do with those distributions.    │
│                                                                          │
│  Same model + different decoding = very different outputs!             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Logits to Probabilities Pipeline

Understanding the pipeline from model output to token selection:

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    LOGITS → PROBABILITIES → TOKEN                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STEP 1: MODEL OUTPUTS LOGITS                                           │
│  ─────────────────────────────                                           │
│  Raw, unnormalized scores for each vocabulary token.                   │
│                                                                          │
│  logits = [2.5, -1.2, 0.3, 4.1, -0.8, ...]  (vocab_size values)       │
│            "the" "cat" "dog" "Paris" "is"                              │
│                                                                          │
│  Logits can be any real number (positive or negative).                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STEP 2: APPLY TEMPERATURE (optional)                                   │
│  ─────────────────────────────────────                                   │
│  Temperature controls the "sharpness" of the distribution.             │
│                                                                          │
│  scaled_logits = logits / temperature                                  │
│                                                                          │
│  • temperature = 1.0: No change                                        │
│  • temperature < 1.0: Sharper (more confident)                         │
│  • temperature > 1.0: Flatter (more random)                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STEP 3: APPLY SOFTMAX                                                  │
│  ─────────────────────                                                   │
│  Convert logits to probabilities that sum to 1.                        │
│                                                                          │
│  probs[i] = exp(logits[i]) / Σⱼ exp(logits[j])                        │
│                                                                          │
│  Example:                                                               │
│  logits = [2.5, -1.2, 0.3, 4.1, -0.8]                                 │
│  probs  = [0.07, 0.002, 0.008, 0.91, 0.003]                           │
│                                                                          │
│  "Paris" has highest logit (4.1) → highest probability (0.91)         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STEP 4: FILTERING (optional)                                           │
│  ───────────────────────────                                             │
│  Remove low-probability tokens (top-k, top-p, min-p).                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STEP 5: SELECT TOKEN                                                   │
│  ────────────────────                                                    │
│  • Greedy: argmax(probs)                                               │
│  • Sampling: sample from probs                                         │
│  • Beam: maintain multiple candidates                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part II: Basic Decoding Strategies

Greedy Decoding

Greedy decoding is the simplest possible strategy: at each step, select the token with the highest probability. No randomness, no exploration of alternatives—just pick the winner every time. Despite its simplicity, greedy decoding is remarkably effective for many tasks and serves as an important baseline for understanding more sophisticated methods.

The appeal of greedy decoding is clear. It's fast—no need to sample or maintain multiple candidates. It's deterministic—the same prompt always produces the same output, which is valuable for testing, debugging, and applications requiring reproducibility. It's simple to implement—literally one line of code. And for tasks with clear "correct" answers, like factual questions or mathematical computations, greedy decoding often produces good results because the correct answer typically has the highest probability.

However, greedy decoding has fundamental limitations that become apparent in longer generations. The most famous is the repetition problem: greedy decoding tends to get stuck in loops. Once the model produces a phrase like "I think", the next most likely token might be "that", followed by "I" (the model might be generating "I think that I think that..."). Each repetition reinforces the pattern because the model was trained on text where repeated phrases do occasionally occur. Greedy decoding follows this path relentlessly, producing degenerate outputs like "I think I think I think I think..."

Another limitation is that greedy decoding is locally optimal but globally suboptimal. It picks the best token at each step, but the best sequence might not consist of the locally best tokens. Consider generating a sentence where "good" is the most likely next token, but "very good" produces a much better sentence than "good morning" would. Greedy would pick "good", potentially leading to a worse overall sentence than if it had picked "very".

Despite these limitations, greedy decoding remains valuable for specific use cases: short generations where repetition isn't a risk, tasks with clear correct answers, situations requiring determinism, and as a fast baseline for comparison.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    GREEDY DECODING                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ALGORITHM:                                                              │
│  ──────────                                                              │
│  At each step, select token with highest probability:                  │
│                                                                          │
│  next_token = argmax(probs)                                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  def greedy_decode(logits):                                             │
│      return torch.argmax(logits, dim=-1)                               │
│                                                                          │
│  # That's it! One line.                                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE:                                                                │
│  ────────                                                                │
│                                                                          │
│  Input: "The best programming language is"                             │
│                                                                          │
│  Step 1: probs = {"Python": 0.35, "Java": 0.20, "C++": 0.15, ...}     │
│          Select: "Python" (highest)                                    │
│                                                                          │
│  Step 2: probs = {"because": 0.40, "for": 0.25, ".": 0.10, ...}       │
│          Select: "because" (highest)                                   │
│                                                                          │
│  Step 3: probs = {"it": 0.45, "of": 0.20, "Python": 0.08, ...}        │
│          Select: "it" (highest)                                        │
│                                                                          │
│  Output: "The best programming language is Python because it..."       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ADVANTAGES:                                                             │
│  ───────────                                                             │
│  ✓ Deterministic (same input → same output)                           │
│  ✓ Fast (no sampling overhead)                                         │
│  ✓ Simple to implement                                                 │
│  ✓ Good for factual queries with clear answers                        │
│                                                                          │
│  DISADVANTAGES:                                                          │
│  ──────────────                                                          │
│  ✗ Boring, repetitive outputs                                          │
│  ✗ Can get stuck in loops: "I think I think I think..."               │
│  ✗ Misses good sequences that don't start with highest-prob token     │
│  ✗ No diversity—always same response to same prompt                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE REPETITION PROBLEM:                                                 │
│  ───────────────────────                                                 │
│  Greedy decoding often produces repetitive text:                       │
│                                                                          │
│  "The movie was good. The movie was good. The movie was good..."       │
│                                                                          │
│  Why? Once in a repetitive pattern, that pattern has high probability │
│  (the model learned from text that often repeats), so greedy keeps    │
│  selecting it.                                                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHEN TO USE:                                                            │
│  ────────────                                                            │
│  • Math problems (one correct answer)                                  │
│  • Factual extraction                                                  │
│  • Code completion where determinism matters                           │
│  • Short generations where loops aren't a risk                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Pure Random Sampling

The opposite extreme: sample randomly from the full distribution.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    PURE RANDOM SAMPLING                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ALGORITHM:                                                              │
│  ──────────                                                              │
│  Sample from the probability distribution:                             │
│                                                                          │
│  next_token = sample(probs)                                            │
│                                                                          │
│  Each token i is selected with probability probs[i].                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  def sample_decode(logits):                                             │
│      probs = F.softmax(logits, dim=-1)                                 │
│      return torch.multinomial(probs, num_samples=1)                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE:                                                                │
│  ────────                                                                │
│                                                                          │
│  probs = {"Paris": 0.85, "Lyon": 0.03, "the": 0.02, "a": 0.02, ...}   │
│                                                                          │
│  Run 1: Sample → "Paris" (85% chance)                                  │
│  Run 2: Sample → "Paris"                                               │
│  Run 3: Sample → "Lyon" (got the 3%!)                                  │
│  Run 4: Sample → "the" (uh oh, 2% happened)                            │
│                                                                          │
│  Output with "the": "The capital of France is the city of..."         │
│  (grammatically fine but not what we wanted)                           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE PROBLEM: SAMPLING LOW-PROBABILITY TOKENS                           │
│  ─────────────────────────────────────────────                          │
│                                                                          │
│  Distribution: {"good": 0.4, "great": 0.3, "nice": 0.1,               │
│                 "asdfgh": 0.0001, "!!!": 0.0001, ...}                  │
│                                                                          │
│  There are 32,000 tokens. Even if each bad token has only 0.0001      │
│  probability, with thousands of them, there's a reasonable chance     │
│  of sampling something nonsensical.                                    │
│                                                                          │
│  After many sampling steps, errors compound:                           │
│  "The capital of France is a very the good !!!"                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ADVANTAGES:                                                             │
│  ───────────                                                             │
│  ✓ Maximum diversity and creativity                                    │
│  ✓ Can explore rare but valid completions                             │
│  ✓ Avoids repetition loops                                             │
│                                                                          │
│  DISADVANTAGES:                                                          │
│  ──────────────                                                          │
│  ✗ Can select nonsensical tokens                                       │
│  ✗ Incoherent for long generations                                    │
│  ✗ Too random for most practical uses                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHEN TO USE:                                                            │
│  ────────────                                                            │
│  Almost never in production! Pure sampling is too chaotic.             │
│  It's useful as a baseline or for understanding other methods.        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part III: Temperature Scaling

Temperature is the single most important and widely-used parameter for controlling LLM generation behavior. It appears in virtually every LLM API, from OpenAI to Anthropic to open-source frameworks. When someone says they "turned down the temperature" to get more focused outputs or "cranked up the temperature" for more creative responses, they're using this fundamental control.

Understanding temperature deeply is crucial because it affects every other aspect of generation. Temperature interacts with all other decoding parameters and fundamentally shapes the character of model outputs. A model at temperature 0.2 feels precise, focused, and sometimes stilted; the same model at temperature 1.2 feels creative, surprising, and sometimes chaotic. Neither is inherently better—the right setting depends entirely on your task.

Temperature derives its name from statistical mechanics, where it describes the energy distribution of particles. In that domain, low temperature means particles cluster in low-energy states; high temperature means energy is spread more uniformly. The analogy to text generation is apt: low temperature concentrates probability on high-likelihood tokens; high temperature spreads probability more uniformly across the vocabulary.

How Temperature Works

Temperature operates by scaling the logits (raw model outputs) before the softmax that converts them to probabilities. The formula is simple: divide all logits by the temperature value, then apply softmax. But this simple operation has profound effects on the resulting probability distribution.

To understand why, consider what softmax does: it exponentiates each logit and normalizes. When logits are divided by a small temperature (say, 0.5), their differences are amplified. A logit of 4 becomes 8; a logit of 2 becomes 4. After exponentiating, e^8 ≈ 2981 while e^4 ≈ 55—the ratio between them has grown enormously. The high-probability token dominates even more strongly.

Conversely, when logits are divided by a large temperature (say, 2.0), their differences are compressed. A logit of 4 becomes 2; a logit of 2 becomes 1. After exponentiating, e^2 ≈ 7.4 while e^1 ≈ 2.7—the ratio is much smaller. Probabilities become more uniform; lower-probability tokens get a better chance.

In the limit, as temperature approaches 0, the distribution becomes a one-hot vector where all probability concentrates on the highest-logit token—equivalent to greedy decoding. As temperature approaches infinity, the distribution becomes uniform over all tokens—equivalent to pure random sampling.

This mathematical relationship has intuitive consequences:

Low temperature (0.1-0.3): The model becomes highly confident, almost always picking the most likely tokens. Outputs are deterministic in character, focused, and predictable. Good for tasks requiring precision and consistency. Risk: can become boring, repetitive, or overly conservative.

Medium temperature (0.5-0.8): A balance between focus and exploration. Most likely tokens are favored but alternatives have meaningful probability. This is where most production chat applications operate. Good for general assistance, where you want coherent responses that aren't robotic.

High temperature (1.0-1.5): Probability is spread more broadly. Lower-ranked tokens have real chances of selection, creating more varied and surprising outputs. Good for creative tasks, brainstorming, generating diverse options. Risk: outputs can become incoherent or include low-quality tokens.

Very high temperature (>1.5): Probability approaches uniform. Token selection becomes nearly random. Rarely useful except for specific creative experiments. High risk of nonsensical output.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TEMPERATURE SCALING                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE FORMULA:                                                            │
│  ────────────                                                            │
│  scaled_logits = logits / temperature                                  │
│  probs = softmax(scaled_logits)                                        │
│                                                                          │
│  Temperature T controls the "sharpness" of the probability             │
│  distribution BEFORE sampling.                                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  VISUALIZATION:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  Original logits: [4.0, 2.0, 1.0, 0.5]                                 │
│  Original probs:  [0.72, 0.10, 0.04, 0.02, ...]                       │
│                                                                          │
│                                                                          │
│  T = 0.5 (Lower = Sharper)                                             │
│  ─────────────────────────                                               │
│  scaled: [8.0, 4.0, 2.0, 1.0]                                          │
│  probs:  [0.95, 0.04, 0.007, 0.002, ...]                              │
│                                                                          │
│  ████████████████████████████████████████  "Paris" (95%)              │
│  ██                                         "Lyon" (4%)               │
│  █                                          others                     │
│                                                                          │
│  Higher differences → more peaked → more deterministic                 │
│                                                                          │
│                                                                          │
│  T = 1.0 (Neutral)                                                      │
│  ─────────────────                                                       │
│  scaled: [4.0, 2.0, 1.0, 0.5]                                          │
│  probs:  [0.72, 0.10, 0.04, 0.02, ...]                                │
│                                                                          │
│  ████████████████████████████               "Paris" (72%)              │
│  ████                                       "Lyon" (10%)              │
│  ██                                         others                     │
│                                                                          │
│  Original distribution unchanged                                        │
│                                                                          │
│                                                                          │
│  T = 2.0 (Higher = Flatter)                                            │
│  ──────────────────────────                                              │
│  scaled: [2.0, 1.0, 0.5, 0.25]                                         │
│  probs:  [0.45, 0.17, 0.10, 0.08, ...]                                │
│                                                                          │
│  ██████████████████                         "Paris" (45%)              │
│  ███████                                    "Lyon" (17%)              │
│  ████                                       others spread out          │
│                                                                          │
│  Lower differences → flatter → more random                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MATHEMATICAL INTUITION:                                                 │
│  ───────────────────────                                                 │
│                                                                          │
│  softmax(x/T)_i = exp(x_i/T) / Σⱼ exp(x_j/T)                          │
│                                                                          │
│  As T → 0:  Distribution → one-hot (greedy)                           │
│  As T → ∞:  Distribution → uniform (random)                           │
│                                                                          │
│  Low temperature amplifies differences between logits.                 │
│  High temperature dampens differences.                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  def temperature_sample(logits, temperature=1.0):                      │
│      if temperature == 0:                                               │
│          return torch.argmax(logits, dim=-1)  # Greedy                │
│      scaled_logits = logits / temperature                              │
│      probs = F.softmax(scaled_logits, dim=-1)                          │
│      return torch.multinomial(probs, num_samples=1)                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Temperature Guidelines

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TEMPERATURE SELECTION GUIDE                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TEMPERATURE RANGES AND USE CASES:                                      │
│  ─────────────────────────────────                                       │
│                                                                          │
│  T = 0 (or very low, 0.01)                                             │
│  ─────────────────────────                                               │
│  • Greedy decoding                                                     │
│  • Math, logic, factual questions                                      │
│  • Code generation (deterministic)                                     │
│  • When reproducibility matters                                        │
│                                                                          │
│  T = 0.1 - 0.3 (Low)                                                   │
│  ─────────────────────                                                   │
│  • Mostly deterministic with slight variation                          │
│  • Technical writing, documentation                                    │
│  • Structured outputs (JSON)                                           │
│  • When you want consistency but not complete rigidity                │
│                                                                          │
│  T = 0.5 - 0.7 (Medium-Low)                                            │
│  ───────────────────────────                                             │
│  • Balanced factual/creative                                           │
│  • General assistant responses                                         │
│  • Code with some flexibility                                          │
│  • DEFAULT for many production systems                                 │
│                                                                          │
│  T = 0.8 - 1.0 (Medium)                                                │
│  ─────────────────────                                                   │
│  • Creative writing                                                    │
│  • Brainstorming                                                       │
│  • Chat conversations                                                  │
│  • When diversity is valuable                                          │
│                                                                          │
│  T = 1.2 - 1.5 (High)                                                  │
│  ─────────────────────                                                   │
│  • Very creative, experimental                                         │
│  • Poetry, fiction                                                     │
│  • Generating diverse options                                          │
│  • Risk of incoherence increases                                       │
│                                                                          │
│  T > 1.5 (Very High)                                                   │
│  ────────────────────                                                    │
│  • Rarely useful in practice                                           │
│  • Outputs become random/nonsensical                                  │
│  • Only for specific creative experiments                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMMON DEFAULTS:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  OpenAI API:          Default T = 1.0                                  │
│  Anthropic API:       Default T = 1.0                                  │
│  Llama.cpp:           Default T = 0.8                                  │
│  vLLM:                Default T = 1.0                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PRO TIP: Temperature interacts with other parameters!                 │
│  ─────────────────────────────────────────────────────                   │
│  High temperature + top_p filtering = controlled creativity           │
│  Low temperature + no filtering = nearly greedy                        │
│                                                                          │
│  Don't set temperature in isolation—consider the full pipeline.       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part IV: Top-K Sampling

Top-k limits sampling to the k most probable tokens.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TOP-K SAMPLING                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE IDEA:                                                               │
│  ─────────                                                               │
│  Only consider the top k tokens. Set all others to probability 0.     │
│  Then renormalize and sample.                                          │
│                                                                          │
│  This prevents sampling from the "long tail" of unlikely tokens.       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE WITH K = 3:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  Original distribution:                                                 │
│  {"Paris": 0.70, "Lyon": 0.10, "city": 0.05, "the": 0.03,             │
│   "a": 0.02, "one": 0.02, "!!!": 0.0001, ...}                         │
│                                                                          │
│  Step 1: Keep only top 3 tokens                                        │
│  {"Paris": 0.70, "Lyon": 0.10, "city": 0.05}                          │
│                                                                          │
│  Step 2: Renormalize to sum to 1                                       │
│  {"Paris": 0.82, "Lyon": 0.12, "city": 0.06}                          │
│       (0.70/0.85)  (0.10/0.85)  (0.05/0.85)                           │
│                                                                          │
│  Step 3: Sample from filtered distribution                             │
│  Result: "Paris" (82%), "Lyon" (12%), or "city" (6%)                  │
│                                                                          │
│  No chance of sampling "!!!" or other garbage!                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  def top_k_sample(logits, k=50, temperature=1.0):                      │
│      # Apply temperature                                                │
│      scaled_logits = logits / temperature                              │
│                                                                          │
│      # Get top k logits and indices                                    │
│      top_k_logits, top_k_indices = torch.topk(scaled_logits, k)       │
│                                                                          │
│      # Convert to probabilities                                        │
│      top_k_probs = F.softmax(top_k_logits, dim=-1)                    │
│                                                                          │
│      # Sample from top k                                               │
│      sample_idx = torch.multinomial(top_k_probs, num_samples=1)       │
│      return top_k_indices[sample_idx]                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CHOOSING K:                                                             │
│  ───────────                                                             │
│                                                                          │
│  k = 1:     Equivalent to greedy decoding                              │
│  k = 10:    Very focused, low diversity                                │
│  k = 40:    Common default, balanced                                   │
│  k = 100:   Higher diversity                                           │
│  k = vocab: Equivalent to pure sampling                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE PROBLEM WITH TOP-K:                                                 │
│  ───────────────────────                                                 │
│                                                                          │
│  Fixed k doesn't adapt to the distribution shape!                      │
│                                                                          │
│  Case 1: Peaked distribution                                           │
│  {"Paris": 0.95, "Lyon": 0.02, "city": 0.01, ...}                     │
│  With k=40, we're including 37 nearly-zero-probability tokens.        │
│  Wasteful and adds noise.                                              │
│                                                                          │
│  Case 2: Flat distribution                                             │
│  {"good": 0.1, "great": 0.1, "nice": 0.1, "fine": 0.08, ...}         │
│  With k=40, we might miss many reasonable options ranked 41+.         │
│  Too restrictive.                                                      │
│                                                                          │
│  Same k, very different effects depending on the distribution!        │
│  This motivates top-p (nucleus) sampling.                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part V: Top-P (Nucleus) Sampling

Top-p sampling, also known as nucleus sampling, represents a significant conceptual advance over top-k. It was introduced in the 2019 paper "The Curious Case of Neural Text Degeneration" by Holtzman et al., which systematically analyzed why language model outputs often degrade into repetitive, generic text and proposed nucleus sampling as a solution.

The key insight behind top-p is that a fixed number k of candidate tokens doesn't make sense because probability distributions vary enormously in shape. Sometimes the model is confident and one token dominates with 95% probability; including k=50 tokens would add 49 nearly-zero-probability distractors. Other times the model is uncertain and probability is spread across many reasonable alternatives; limiting to k=50 might exclude valid options ranked 51+.

Top-p solves this by defining the candidate set in terms of cumulative probability rather than count. Instead of saying "consider the top 50 tokens," it says "consider the minimum set of tokens whose cumulative probability reaches p." This automatically adapts to the distribution shape:

When the model is confident (peaked distribution with one dominant token), few tokens are needed to reach the probability threshold. With p=0.9 and a distribution where one token has 85% probability, we might only include 2-3 tokens.

When the model is uncertain (flat distribution with many reasonable options), more tokens are needed. With p=0.9 and a flat distribution where the top token has only 10% probability, we might include 15-20 tokens before reaching the threshold.

This adaptive behavior aligns with intuition about what the model "thinks is reasonable." If the model strongly prefers one token, why consider many alternatives? If the model is genuinely uncertain among many options, why artificially restrict to a fixed few?

The "nucleus" terminology comes from thinking of the probability distribution as having a concentrated "nucleus" of reasonable tokens surrounded by a long tail of unlikely tokens. Top-p sampling keeps the nucleus and discards the tail. The nucleus size varies naturally with the distribution's confidence.

Top-p has become the dominant sampling method in production systems. OpenAI, Anthropic, and most other providers use it either as the default or as the primary recommended tuning parameter alongside temperature. The typical recommended value is around 0.9-0.95, which retains most of the probability mass while filtering out the lowest-probability tokens that could cause incoherent outputs.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    TOP-P (NUCLEUS) SAMPLING                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE IDEA:                                                               │
│  ─────────                                                               │
│  Include the smallest set of tokens whose cumulative probability       │
│  exceeds p. This adapts to the distribution shape.                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE WITH P = 0.9:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  CASE 1: Peaked distribution                                           │
│  ─────────────────────────                                               │
│  Token      Prob     Cumulative                                        │
│  ──────────────────────────────                                         │
│  "Paris"    0.85     0.85         ← stop! 0.85 < 0.9                  │
│  "Lyon"     0.05     0.90         ← include, now 0.90 ≥ 0.9           │
│  "city"     0.03     0.93                                              │
│  ...                                                                    │
│                                                                          │
│  Nucleus = {"Paris", "Lyon"} (only 2 tokens!)                         │
│  Renormalize: {"Paris": 0.94, "Lyon": 0.06}                           │
│                                                                          │
│                                                                          │
│  CASE 2: Flat distribution                                             │
│  ─────────────────────────                                               │
│  Token      Prob     Cumulative                                        │
│  ──────────────────────────────                                         │
│  "good"     0.12     0.12                                              │
│  "great"    0.11     0.23                                              │
│  "nice"     0.10     0.33                                              │
│  "fine"     0.09     0.42                                              │
│  "okay"     0.08     0.50                                              │
│  "well"     0.07     0.57                                              │
│  "fair"     0.06     0.63                                              │
│  "decent"   0.05     0.68                                              │
│  "solid"    0.05     0.73                                              │
│  "alright"  0.04     0.77                                              │
│  "sweet"    0.04     0.81                                              │
│  "cool"     0.03     0.84                                              │
│  "neat"     0.03     0.87                                              │
│  "rad"      0.02     0.89         ← stop! 0.89 < 0.9                  │
│  "super"    0.02     0.91         ← include, now 0.91 ≥ 0.9           │
│                                                                          │
│  Nucleus = 15 tokens (adapts to flat distribution!)                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY THIS IS BETTER THAN TOP-K:                                         │
│  ───────────────────────────────                                         │
│                                                                          │
│  Top-p automatically:                                                   │
│  • Uses few tokens when distribution is peaked (confident model)      │
│  • Uses many tokens when distribution is flat (uncertain model)       │
│                                                                          │
│  This matches intuition: include tokens the model thinks are          │
│  reasonable, exclude tokens the model thinks are wrong.               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  def top_p_sample(logits, p=0.9, temperature=1.0):                     │
│      # Apply temperature                                                │
│      scaled_logits = logits / temperature                              │
│      probs = F.softmax(scaled_logits, dim=-1)                          │
│                                                                          │
│      # Sort probabilities descending                                   │
│      sorted_probs, sorted_indices = torch.sort(probs, descending=True) │
│                                                                          │
│      # Compute cumulative probabilities                                │
│      cumsum_probs = torch.cumsum(sorted_probs, dim=-1)                 │
│                                                                          │
│      # Find cutoff index (first position where cumsum >= p)           │
│      cutoff_mask = cumsum_probs > p                                    │
│      # Shift mask right by 1 to include the token that crosses p      │
│      cutoff_mask[..., 1:] = cutoff_mask[..., :-1].clone()             │
│      cutoff_mask[..., 0] = False                                       │
│                                                                          │
│      # Zero out tokens beyond cutoff                                   │
│      sorted_probs[cutoff_mask] = 0.0                                   │
│                                                                          │
│      # Renormalize                                                      │
│      sorted_probs = sorted_probs / sorted_probs.sum(dim=-1, keepdim=True)
│                                                                          │
│      # Sample                                                           │
│      sample_idx = torch.multinomial(sorted_probs, num_samples=1)       │
│      return sorted_indices.gather(-1, sample_idx)                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CHOOSING P:                                                             │
│  ───────────                                                             │
│                                                                          │
│  p = 0.1:   Very focused (like low k)                                  │
│  p = 0.5:   Moderate focus                                             │
│  p = 0.9:   Common default, good balance                               │
│  p = 0.95:  Slightly more diverse                                      │
│  p = 1.0:   No filtering (pure sampling)                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMMON DEFAULTS:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  OpenAI:        top_p = 1.0 (disabled by default)                      │
│  Anthropic:     top_p = 0.999 (very permissive)                        │
│  Llama.cpp:     top_p = 0.9                                            │
│  Most papers:   top_p = 0.9 - 0.95                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Combining Temperature with Top-K and Top-P

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    COMBINING SAMPLING PARAMETERS                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ORDER OF OPERATIONS:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  1. Get logits from model                                              │
│  2. Apply temperature scaling (logits / T)                             │
│  3. Convert to probabilities (softmax)                                 │
│  4. Apply top-k filtering (keep only top k)                           │
│  5. Apply top-p filtering (keep cumulative p)                         │
│  6. Renormalize remaining probabilities                                │
│  7. Sample from filtered distribution                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  def sample_with_params(logits, temperature=1.0, top_k=0, top_p=1.0): │
│      # Step 1: Temperature                                             │
│      if temperature != 1.0:                                            │
│          logits = logits / temperature                                 │
│                                                                          │
│      # Step 2: Top-k filtering                                         │
│      if top_k > 0:                                                     │
│          top_k_values, _ = torch.topk(logits, top_k)                  │
│          min_top_k = top_k_values[..., -1, None]                      │
│          logits = torch.where(logits < min_top_k,                     │
│                               float('-inf'), logits)                  │
│                                                                          │
│      # Step 3: Top-p filtering                                         │
│      if top_p < 1.0:                                                   │
│          sorted_logits, sorted_idx = torch.sort(logits, descending=True)
│          probs = F.softmax(sorted_logits, dim=-1)                     │
│          cumsum = torch.cumsum(probs, dim=-1)                         │
│          mask = cumsum - probs > top_p                                │
│          sorted_logits[mask] = float('-inf')                          │
│          # Unsort                                                      │
│          logits = sorted_logits.scatter(-1, sorted_idx, sorted_logits)│
│                                                                          │
│      # Step 4: Sample                                                  │
│      probs = F.softmax(logits, dim=-1)                                 │
│      return torch.multinomial(probs, num_samples=1)                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RECOMMENDED COMBINATIONS:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  FACTUAL / DETERMINISTIC:                                               │
│  temperature=0.0   (greedy)                                            │
│  OR temperature=0.1, top_p=0.9                                         │
│                                                                          │
│  GENERAL ASSISTANT:                                                      │
│  temperature=0.7, top_p=0.9                                            │
│                                                                          │
│  CREATIVE WRITING:                                                       │
│  temperature=0.9, top_p=0.95                                           │
│                                                                          │
│  CODE GENERATION:                                                        │
│  temperature=0.2, top_p=0.95                                           │
│  (low temp for correctness, high top_p for rare but valid tokens)     │
│                                                                          │
│  BRAINSTORMING:                                                          │
│  temperature=1.0, top_p=0.9, top_k=100                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WARNING: Don't use both top_k AND top_p aggressively!                 │
│  ───────────────────────────────────────────────────────                 │
│                                                                          │
│  top_k=10, top_p=0.5 might leave you with only 2-3 tokens.            │
│  This can hurt quality. Pick one as your main filter.                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VI: Min-P Sampling

A newer alternative that filters based on minimum probability relative to the top token.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    MIN-P SAMPLING                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE IDEA:                                                               │
│  ─────────                                                               │
│  Keep tokens with probability ≥ (min_p × max_probability).             │
│  This scales the threshold relative to the top token.                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  HOW IT WORKS:                                                           │
│  ──────────────                                                          │
│                                                                          │
│  1. Find max_prob = max(probs)                                         │
│  2. threshold = min_p × max_prob                                       │
│  3. Keep only tokens with prob ≥ threshold                            │
│                                                                          │
│  Example with min_p = 0.1:                                             │
│                                                                          │
│  probs = {"Paris": 0.70, "Lyon": 0.10, "city": 0.05, "the": 0.03}     │
│  max_prob = 0.70                                                       │
│  threshold = 0.1 × 0.70 = 0.07                                        │
│                                                                          │
│  Keep tokens with prob ≥ 0.07:                                        │
│  {"Paris": 0.70, "Lyon": 0.10}  ✓ (both ≥ 0.07)                      │
│  {"city": 0.05}  ✗ (0.05 < 0.07)                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY MIN-P IS USEFUL:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  Peaked distribution (max=0.95):                                       │
│  threshold = 0.1 × 0.95 = 0.095                                       │
│  Only tokens with prob ≥ 9.5% included (very few)                     │
│                                                                          │
│  Flat distribution (max=0.15):                                         │
│  threshold = 0.1 × 0.15 = 0.015                                       │
│  Tokens with prob ≥ 1.5% included (many more)                         │
│                                                                          │
│  Like top-p, it adapts to distribution shape!                         │
│  But the semantics are more intuitive: "keep tokens at least X% as   │
│  likely as the best option."                                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  def min_p_sample(logits, min_p=0.1, temperature=1.0):                 │
│      scaled_logits = logits / temperature                              │
│      probs = F.softmax(scaled_logits, dim=-1)                          │
│                                                                          │
│      # Compute threshold                                               │
│      max_prob = probs.max(dim=-1, keepdim=True).values                │
│      threshold = min_p * max_prob                                      │
│                                                                          │
│      # Filter                                                          │
│      mask = probs < threshold                                          │
│      probs[mask] = 0.0                                                 │
│                                                                          │
│      # Renormalize and sample                                          │
│      probs = probs / probs.sum(dim=-1, keepdim=True)                  │
│      return torch.multinomial(probs, num_samples=1)                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RECOMMENDED VALUES:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  min_p = 0.05:  Permissive (include 5%+ relative probability)         │
│  min_p = 0.1:   Balanced (common default)                             │
│  min_p = 0.2:   Restrictive (only strong alternatives)                │
│                                                                          │
│  Min-p is gaining popularity as a simpler alternative to top-p.       │
│  Supported in: llama.cpp, vLLM, text-generation-inference             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Beam search represents a fundamentally different approach to decoding compared to the sampling methods we've discussed so far. While sampling methods select one token at a time, accepting whatever randomness produces, beam search maintains multiple candidate sequences (called "beams") simultaneously and explores different possible continuations in parallel. It's a search algorithm rather than a sampling algorithm.

The motivation for beam search comes from the observation that greedy decoding makes locally optimal choices that may not be globally optimal. Consider generating "The best way to learn programming is". Greedy might select "practice" (the single most likely next token), but perhaps "through practice" would have produced a better overall sentence than "practice and dedication". By the time greedy has committed to "practice", the opportunity for "through practice" is lost.

Beam search addresses this by hedging: instead of committing to a single token, it maintains B candidate sequences (the "beams"). At each step, it expands all beams by all possible next tokens, scores the resulting sequences, and keeps the top B. This explores multiple paths through the generation space simultaneously.

For example, with beam width B=3, beam search might maintain:

  • Beam 1: "...is through practice" (score: 0.42)
  • Beam 2: "...is by coding" (score: 0.38)
  • Beam 3: "...is practice and" (score: 0.35)

At the next step, each beam is extended by the top candidate tokens, producing many candidates, from which the best 3 are kept. This continues until all beams reach an end token.

Beam search has a long history in machine translation and other sequence-to-sequence tasks, where finding the highest-probability translation is more important than diversity. It remains the standard method for neural machine translation, speech recognition transcription, and other applications with clear "correct" outputs.

However, beam search has significant limitations for open-ended text generation. It tends to produce repetitive, generic, and short outputs—a phenomenon known as "neural text degeneration." The core issue is that beam search optimizes for probability, and high-probability text is often boring. "I don't know" is more probable than "The quantum fluctuations in the early universe led to..." because the former is common and the latter is rare and specific. Beam search gravitates toward common, safe text.

For chat applications and creative writing, sampling methods with temperature and top-p almost always outperform beam search. The diversity and interestingness that come from controlled randomness produce much more engaging outputs than the probability-maximizing but often dull text from beam search.

Understanding when to use beam search versus sampling is crucial: use beam search for tasks with correct answers (translation, transcription); use sampling for tasks valuing creativity, engagement, or diversity (chat, stories, brainstorming).

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    BEAM SEARCH                                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROBLEM WITH GREEDY:                                                │
│  ─────────────────────────                                               │
│  Greedy picks the best token at each step, but the locally best       │
│  choice may not lead to the globally best sequence.                   │
│                                                                          │
│  Example:                                                               │
│  Step 1: "good" (0.4) vs "very" (0.35)                                │
│          Greedy picks "good"                                           │
│                                                                          │
│  Step 2 after "good": "movie" (0.3)                                   │
│  Step 2 after "very": "good movie" (0.5) → total 0.35 × 0.5 = 0.175  │
│                                                                          │
│  "very good movie" (0.175) > "good movie" (0.4 × 0.3 = 0.12)         │
│  Greedy missed the better sequence!                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE BEAM SEARCH IDEA:                                                   │
│  ─────────────────────                                                   │
│  Keep track of B "beams" (candidate sequences) at each step.          │
│  Expand each beam by all possible next tokens.                        │
│  Keep only the top B expanded beams.                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE WITH B = 2 (beam width = 2):                                  │
│  ─────────────────────────────────────                                   │
│                                                                          │
│  Start: "The movie was"                                                │
│                                                                          │
│  Step 1:                                                                │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │ Beam 1: "The movie was good"     score: 0.40                    │  │
│  │ Beam 2: "The movie was very"     score: 0.35                    │  │
│  └─────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  Step 2: Expand each beam by top tokens                                │
│  From Beam 1: "good" + {".": 0.3, "and": 0.2, ...}                   │
│  From Beam 2: "very" + {"good": 0.5, "bad": 0.2, ...}                │
│                                                                          │
│  All candidates:                                                        │
│  • "good." → 0.40 × 0.30 = 0.120                                      │
│  • "good and" → 0.40 × 0.20 = 0.080                                   │
│  • "very good" → 0.35 × 0.50 = 0.175  ← Best!                        │
│  • "very bad" → 0.35 × 0.20 = 0.070                                   │
│                                                                          │
│  Keep top 2:                                                            │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │ Beam 1: "The movie was very good"   score: 0.175                │  │
│  │ Beam 2: "The movie was good."       score: 0.120                │  │
│  └─────────────────────────────────────────────────────────────────┘  │
│                                                                          │
│  Continue until all beams reach <EOS> or max length.                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  def beam_search(model, prompt_ids, beam_width=5, max_length=50):      │
│      # Initialize beams: (sequence, score)                             │
│      beams = [(prompt_ids, 0.0)]                                       │
│      completed = []                                                     │
│                                                                          │
│      for _ in range(max_length):                                        │
│          all_candidates = []                                            │
│                                                                          │
│          for seq, score in beams:                                       │
│              if seq[-1] == EOS_TOKEN:                                  │
│                  completed.append((seq, score))                        │
│                  continue                                              │
│                                                                          │
│              # Get next token probabilities                            │
│              logits = model(seq)[-1]                                   │
│              log_probs = F.log_softmax(logits, dim=-1)                │
│                                                                          │
│              # Get top-k candidates for this beam                      │
│              top_log_probs, top_indices = torch.topk(log_probs,       │
│                                                       beam_width)     │
│                                                                          │
│              for log_p, idx in zip(top_log_probs, top_indices):       │
│                  new_seq = seq + [idx.item()]                         │
│                  new_score = score + log_p.item()                     │
│                  all_candidates.append((new_seq, new_score))          │
│                                                                          │
│          # Keep top beam_width candidates                              │
│          all_candidates.sort(key=lambda x: x[1], reverse=True)        │
│          beams = all_candidates[:beam_width]                           │
│                                                                          │
│          if len(beams) == 0:                                           │
│              break                                                      │
│                                                                          │
│      # Return best completed sequence                                   │
│      completed.extend(beams)                                            │
│      return max(completed, key=lambda x: x[1])[0]                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ADVANTAGES:                                                             │
│  ───────────                                                             │
│  ✓ Finds higher-probability sequences than greedy                     │
│  ✓ Deterministic (reproducible)                                       │
│  ✓ Good for structured outputs (translation, summarization)          │
│                                                                          │
│  DISADVANTAGES:                                                          │
│  ──────────────                                                          │
│  ✗ B× more computation than greedy                                    │
│  ✗ Tends toward short, generic, repetitive outputs                    │
│  ✗ Not good for open-ended generation                                 │
│  ✗ Doesn't handle well tasks needing creativity                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE DEGENERATION PROBLEM:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Beam search often produces text like:                                 │
│  "I don't know. I don't know. I don't know."                          │
│                                                                          │
│  Why? Repetition is high probability! Once a pattern starts,          │
│  continuing it has high probability, so beam search reinforces it.   │
│                                                                          │
│  Mitigations:                                                           │
│  • Length normalization (divide score by length)                      │
│  • Repetition penalty                                                  │
│  • n-gram blocking (prevent repeating n-grams)                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHEN TO USE:                                                            │
│  ────────────                                                            │
│                                                                          │
│  ✓ Machine translation                                                 │
│  ✓ Summarization                                                       │
│  ✓ Speech recognition transcription                                   │
│  ✓ Any task with clear "correct" outputs                              │
│                                                                          │
│  ✗ Chatbots, creative writing                                         │
│  ✗ Open-ended generation                                              │
│  ✗ Anything needing diversity                                         │
│                                                                          │
│  For chat and creative tasks, sampling with temperature/top-p         │
│  is almost always better than beam search.                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VIII: Repetition Penalties

Preventing models from repeating themselves.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    REPETITION PENALTIES                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROBLEM:                                                            │
│  ────────────                                                            │
│  LLMs love to repeat themselves, especially with greedy/beam search:  │
│                                                                          │
│  "The best way to learn is to practice. The best way to learn is      │
│   to practice. The best way to learn is to practice."                 │
│                                                                          │
│  Why? Once a phrase appears, it has high probability of appearing     │
│  again (the model learned from text that often repeats for emphasis). │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION 1: REPETITION PENALTY (Multiplicative)                       │
│  ───────────────────────────────────────────────                        │
│                                                                          │
│  For tokens that have appeared in the context, divide their logits    │
│  by a penalty factor:                                                  │
│                                                                          │
│  for token in context:                                                 │
│      if logits[token] > 0:                                            │
│          logits[token] /= repetition_penalty                          │
│      else:                                                             │
│          logits[token] *= repetition_penalty                          │
│                                                                          │
│  penalty = 1.0: No change                                              │
│  penalty = 1.1: Mild discouragement of repetition                     │
│  penalty = 1.5: Strong discouragement                                  │
│  penalty = 2.0: Very strong                                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION 2: PRESENCE PENALTY (Additive, OpenAI-style)                 │
│  ──────────────────────────────────────────────────────                 │
│                                                                          │
│  Subtract a constant from logits of tokens that appeared:             │
│                                                                          │
│  for token in context:                                                 │
│      logits[token] -= presence_penalty                                │
│                                                                          │
│  Flat penalty regardless of frequency.                                 │
│  Encourages talking about new topics.                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION 3: FREQUENCY PENALTY (Additive, scales with count)          │
│  ────────────────────────────────────────────────────────               │
│                                                                          │
│  Penalty proportional to how many times token appeared:               │
│                                                                          │
│  for token in context:                                                 │
│      count = context.count(token)                                     │
│      logits[token] -= frequency_penalty * count                       │
│                                                                          │
│  More repetitions → stronger penalty.                                  │
│  Better for preventing specific word overuse.                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION 4: N-GRAM BLOCKING                                            │
│  ───────────────────────────                                             │
│                                                                          │
│  Prevent exact repetition of n-grams:                                  │
│                                                                          │
│  If the last (n-1) tokens form a prefix that would complete           │
│  a previously seen n-gram, set that token's probability to 0.         │
│                                                                          │
│  Example with n=3 (trigram blocking):                                  │
│  Context: "I think that I think"                                       │
│  Last 2 tokens: "I think"                                              │
│  "that" would create "I think that" (already seen!)                   │
│  → Set P("that") = 0                                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  def apply_repetition_penalty(logits, context_ids, penalty=1.1):       │
│      # Get unique tokens in context                                    │
│      unique_tokens = set(context_ids.tolist())                        │
│                                                                          │
│      for token_id in unique_tokens:                                    │
│          if logits[token_id] > 0:                                     │
│              logits[token_id] /= penalty                              │
│          else:                                                         │
│              logits[token_id] *= penalty                              │
│                                                                          │
│      return logits                                                      │
│                                                                          │
│  def apply_frequency_penalty(logits, context_ids, penalty=0.5):        │
│      from collections import Counter                                   │
│      counts = Counter(context_ids.tolist())                           │
│                                                                          │
│      for token_id, count in counts.items():                           │
│          logits[token_id] -= penalty * count                          │
│                                                                          │
│      return logits                                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RECOMMENDED VALUES:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  Repetition penalty: 1.0-1.2 (subtle), 1.3+ (aggressive)              │
│  Presence penalty:   0.0-0.5 (OpenAI default: 0)                      │
│  Frequency penalty:  0.0-0.5 (OpenAI default: 0)                      │
│                                                                          │
│  Start low and increase if you see repetition.                        │
│  Too high → unnatural, forced topic changes.                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part IX: Speculative Decoding

Speculative decoding is one of the most important advances in LLM inference optimization, offering significant speedups without any change to output quality. Unlike other optimization techniques that trade quality for speed (like using smaller models or aggressive quantization), speculative decoding produces mathematically identical outputs to standard decoding while generating tokens 2-3× faster.

The technique addresses a fundamental inefficiency in autoregressive generation. When generating text, we make one forward pass through the model for each token generated. A 100-token response requires 100 forward passes, each waiting for the previous one to complete. But here's the inefficiency: modern GPUs are massively parallel, designed to process thousands of operations simultaneously. Processing 1 token versus processing 10 tokens takes nearly the same wall-clock time because the GPU's parallel capacity is vastly underutilized for single-token processing.

This underutilization is inherent to autoregressive generation—we can't process position N until we know what token was generated at position N-1. The sequential dependency seems unavoidable.

Speculative decoding cleverly circumvents this limitation using a draft-then-verify approach. The key insight is that we can use a fast, small "draft" model to guess multiple tokens ahead, then verify all those guesses with the large "target" model in a single parallel forward pass. If the guesses are good (which they often are, since most text is fairly predictable), we accept multiple tokens for the cost of one target model forward pass. If some guesses are wrong, we reject them and regenerate—but we're never worse off than standard decoding.

The mathematics behind speculative decoding are elegant and guarantee that the output distribution is identical to what you'd get from standard decoding with the target model alone. This is crucial: speculative decoding is not an approximation. It's a lossless acceleration technique. The draft model's role is purely to propose candidates efficiently; the target model's distribution determines what actually gets generated.

The speedup depends on how well the draft model predicts what the target model would generate—the "acceptance rate." For code generation and factual text, where outputs are highly predictable, acceptance rates of 70-90% are common, yielding 2-4× speedups. For creative or unpredictable content, acceptance rates drop, reducing the benefit.

Speculative decoding has rapidly moved from research to production. It's implemented in vLLM, Hugging Face's text-generation-inference, and various proprietary systems. For latency-sensitive applications like real-time chat, the 2-3× improvement can be transformative, reducing response times from several seconds to under a second.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    SPECULATIVE DECODING                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE BOTTLENECK:                                                         │
│  ───────────────                                                         │
│  LLM inference is sequential—each token requires a full forward pass. │
│  For 100 tokens, that's 100 forward passes, each waiting for the last. │
│                                                                          │
│  But GPUs are parallel! Processing 1 token vs 10 tokens takes nearly  │
│  the same time. We're not using the GPU's parallelism.               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE IDEA:                                                               │
│  ─────────                                                               │
│  Use a small, fast "draft" model to guess multiple tokens.            │
│  Then verify all guesses with the large "target" model in parallel.   │
│                                                                          │
│  If guesses are good, we accept them → multiple tokens per forward!  │
│  If guesses are wrong, we reject and regenerate → no worse than normal│
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  HOW IT WORKS:                                                           │
│  ──────────────                                                          │
│                                                                          │
│  1. DRAFT: Small model generates K candidate tokens                    │
│                                                                          │
│     Draft model (1B):                                                  │
│     "The capital of France is" → "Paris" → "." → "It" → "is"          │
│     (4 tokens generated quickly)                                       │
│                                                                          │
│  2. VERIFY: Large model checks all K tokens in ONE forward pass       │
│                                                                          │
│     Target model (70B):                                                │
│     Input: "The capital of France is Paris . It is"                   │
│     Output: Probabilities for each position                           │
│                                                                          │
│     Position 1: P("Paris") = 0.95 ✓ (draft said "Paris", accept!)    │
│     Position 2: P(".") = 0.80 ✓ (draft said ".", accept!)            │
│     Position 3: P("It") = 0.15 ✗ (draft said "It", target prefers    │
│                                    "The", reject!)                    │
│                                                                          │
│  3. ACCEPT/REJECT: Keep verified tokens, resample from rejection      │
│                                                                          │
│     Accepted: "Paris", "."                                             │
│     Rejected: "It" → sample from target's distribution → "The"        │
│     Result: "The capital of France is Paris. The"                     │
│                                                                          │
│     We got 3 tokens for 1 forward pass of the big model!              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE ACCEPTANCE CRITERION:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  We can't just accept if target gives high probability.               │
│  We need to maintain the EXACT distribution of the target model.      │
│                                                                          │
│  Acceptance probability:                                               │
│  P(accept) = min(1, P_target(token) / P_draft(token))                 │
│                                                                          │
│  If target likes token more than draft → always accept               │
│  If target likes less → accept with probability proportional to ratio│
│                                                                          │
│  This ensures output distribution matches target model exactly!       │
│  Speculative decoding is LOSSLESS—same outputs, just faster.         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION (Simplified):                                           │
│  ─────────────────────────────                                           │
│                                                                          │
│  def speculative_decode(target, draft, prompt, K=4):                   │
│      tokens = prompt                                                    │
│                                                                          │
│      while not done:                                                   │
│          # Draft K tokens                                              │
│          draft_tokens = []                                             │
│          draft_probs = []                                              │
│          context = tokens                                              │
│          for _ in range(K):                                            │
│              p = draft.get_probs(context)                             │
│              t = sample(p)                                             │
│              draft_tokens.append(t)                                    │
│              draft_probs.append(p[t])                                 │
│              context = context + [t]                                   │
│                                                                          │
│          # Verify with target (single forward pass!)                   │
│          full_context = tokens + draft_tokens                          │
│          target_probs = target.get_all_probs(full_context)            │
│                                                                          │
│          # Accept/reject each token                                    │
│          accepted = []                                                 │
│          for i, (t, p_d) in enumerate(zip(draft_tokens, draft_probs)):│
│              p_t = target_probs[len(tokens) + i][t]                   │
│              if random() < min(1, p_t / p_d):                         │
│                  accepted.append(t)                                    │
│              else:                                                     │
│                  # Rejection: sample from adjusted distribution       │
│                  adjusted = max(0, target_probs[i] - draft_probs[i])  │
│                  accepted.append(sample(adjusted / sum(adjusted)))    │
│                  break  # Stop at first rejection                     │
│                                                                          │
│          tokens = tokens + accepted                                    │
│                                                                          │
│      return tokens                                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SPEEDUP:                                                                │
│  ────────                                                                │
│                                                                          │
│  Speedup ≈ K × acceptance_rate / (1 + K × draft_cost/target_cost)    │
│                                                                          │
│  With good draft model (high acceptance ~70-90%):                     │
│  • K=4, 80% acceptance → ~2.5-3x speedup                              │
│  • K=8, 70% acceptance → ~3-4x speedup                                │
│                                                                          │
│  Real-world results:                                                    │
│  • 2-3x speedup typical for code generation                           │
│  • 1.5-2x for creative text (less predictable)                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DRAFT MODEL CHOICES:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  1. SMALLER VERSION OF SAME FAMILY                                      │
│     Target: Llama 70B → Draft: Llama 7B                               │
│     Good alignment, ~80% acceptance                                   │
│                                                                          │
│  2. DISTILLED MODEL                                                      │
│     Target's knowledge distilled into tiny model                      │
│     Can achieve 90%+ acceptance                                        │
│                                                                          │
│  3. N-GRAM MODEL                                                         │
│     Simple statistical model from target's outputs                    │
│     Very fast but lower acceptance                                    │
│                                                                          │
│  4. SAME MODEL, EARLY EXIT                                              │
│     Use early layers of target as draft                               │
│     (Medusa, EAGLE approaches)                                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHEN TO USE:                                                            │
│  ────────────                                                            │
│                                                                          │
│  ✓ Latency-sensitive applications (chat, real-time)                   │
│  ✓ When you have a good draft model available                         │
│  ✓ Tasks where draft can predict well (code, factual)                │
│                                                                          │
│  ✗ When draft acceptance is low (creative, unpredictable)            │
│  ✗ When draft model overhead is significant                          │
│  ✗ Batch processing (throughput matters more than latency)           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part X: Structured and Constrained Generation

One of the most practically important developments in LLM generation is structured or constrained generation—techniques that guarantee outputs conform to specific formats like JSON, SQL, or any other grammar. This addresses a fundamental challenge in production LLM applications: you often need the model's output to be programmatically usable, not just readable by humans.

Consider building an application that uses an LLM to extract structured data from unstructured text. You want the model to output JSON like {"name": "John", "age": 30, "city": "Paris"}. With standard prompting, you might ask the model to output JSON, and it usually will—but sometimes it won't. It might add explanatory text before or after the JSON. It might produce invalid JSON (missing a closing brace, using single quotes instead of double quotes). It might hallucinate fields not in your schema. For production applications, "usually works" isn't good enough.

Structured generation solves this by constraining what tokens the model can generate at each step. If the output so far is {"name": , the next token must be a double quote (to start a string). If the output is {"name": "John", , the next token must be either another key (starting with ") or a closing brace. The model can only choose among valid continuations; invalid tokens are masked out before selection.

The concept is powerful because it separates concerns: the model provides semantic intelligence (what values make sense), while the grammar constraints ensure syntactic correctness. The model might "want" to generate invalid JSON, but the constraints prevent it. You get the model's knowledge shaped into a guaranteed-valid format.

This approach extends beyond JSON. You can constrain generation to any context-free grammar—SQL queries, XML, Python code, regular expressions, or custom formats. Some systems even support more complex constraints like type-checking or semantic validation.

The practical impact is significant. Without structured generation, LLM applications need extensive error handling, retries, and fallbacks for parsing failures. With structured generation, parsing never fails because the output is guaranteed valid. This simplifies application logic, improves reliability, and often reduces latency (no retries needed).

Several mature tools exist for structured generation: Outlines (grammar-based, integrates with HuggingFace), Guidance (template-based with constraints), and Instructor (Pydantic-based for API models). Major API providers have also added native support: OpenAI's JSON mode and function calling, Anthropic's tool use—these use similar techniques internally to guarantee structured outputs.

The tradeoff is computational overhead: tracking grammar state and computing valid tokens at each step adds latency. But for applications requiring reliable structured output, this overhead is usually worthwhile. The alternative—parsing unreliable free-form text—is worse in every dimension.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    STRUCTURED GENERATION                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROBLEM:                                                            │
│  ────────────                                                            │
│  LLMs output free-form text, but we often need structured data:       │
│  • JSON for APIs                                                       │
│  • SQL for databases                                                   │
│  • Code that compiles                                                  │
│  • Specific formats (dates, phone numbers)                            │
│                                                                          │
│  Prompting helps but doesn't guarantee valid output.                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION: CONSTRAINED DECODING                                         │
│  ───────────────────────────────                                         │
│                                                                          │
│  At each step, mask out tokens that would make output invalid.        │
│  Only allow tokens that keep the output on a valid path.             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE: JSON GENERATION                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Schema: {"name": string, "age": integer}                              │
│                                                                          │
│  Step 1: Output so far: ""                                             │
│          Valid next tokens: {"{"}                                      │
│          Model picks: "{"                                              │
│                                                                          │
│  Step 2: Output so far: "{"                                            │
│          Valid next tokens: {'"name"', '"age"'}                        │
│          Model picks: '"name"'                                         │
│                                                                          │
│  Step 3: Output so far: '{"name"'                                      │
│          Valid next tokens: {':'}                                       │
│          Model picks: ':'                                              │
│                                                                          │
│  Step 4: Output so far: '{"name":'                                     │
│          Valid next tokens: {'"'} (must start string)                  │
│          Model picks: '"'                                              │
│                                                                          │
│  Step 5: Output so far: '{"name":"'                                    │
│          Valid next tokens: {any string chars, '"' to end}            │
│          Model picks: 'J', 'o', 'h', 'n', '"'                         │
│                                                                          │
│  ...and so on, always following valid JSON paths.                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  GRAMMAR-BASED GENERATION:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Define a grammar (like BNF) and generate only valid strings:         │
│                                                                          │
│  json := object | array                                                │
│  object := "{" (pair ("," pair)*)? "}"                                │
│  pair := string ":" value                                              │
│  value := string | number | object | array | "true" | "false" | "null"│
│  ...                                                                    │
│                                                                          │
│  At each position, compute which tokens are valid according to        │
│  the grammar state, and mask everything else.                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOOLS FOR STRUCTURED GENERATION:                                       │
│  ─────────────────────────────────                                       │
│                                                                          │
│  OUTLINES (dottxt-ai/outlines):                                        │
│  Grammar-based, integrates with HuggingFace                           │
│  from outlines import models, generate                                 │
│  model = models.transformers("mistralai/Mistral-7B-v0.1")             │
│  generator = generate.json(model, schema)                             │
│  result = generator("Extract info from: John, 25 years old")          │
│                                                                          │
│  GUIDANCE (guidance-ai/guidance):                                       │
│  Template-based with constraints                                       │
│  from guidance import models, gen                                      │
│  lm = models.LlamaCpp(model_path)                                      │
│  lm += '{"name": "' + gen(stop='"') + '", "age": ' + gen(regex='\d+')│
│                                                                          │
│  INSTRUCTOR:                                                            │
│  Pydantic-based for OpenAI/Anthropic APIs                             │
│  from instructor import patch                                          │
│  from pydantic import BaseModel                                        │
│  class User(BaseModel):                                                │
│      name: str                                                         │
│      age: int                                                          │
│  client = patch(openai.OpenAI())                                       │
│  user = client.chat.completions.create(                               │
│      response_model=User, ...                                          │
│  )                                                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  API-LEVEL JSON MODE:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  OpenAI and Anthropic offer built-in JSON mode:                       │
│                                                                          │
│  # OpenAI                                                               │
│  response = client.chat.completions.create(                            │
│      model="gpt-4",                                                    │
│      response_format={"type": "json_object"},                         │
│      messages=[...]                                                    │
│  )                                                                      │
│                                                                          │
│  # Anthropic                                                           │
│  response = client.messages.create(                                    │
│      model="claude-3-opus",                                            │
│      tool_choice={"type": "tool", "name": "extract_data"},           │
│      tools=[{                                                          │
│          "name": "extract_data",                                       │
│          "input_schema": {...json schema...}                          │
│      }],                                                               │
│      messages=[...]                                                    │
│  )                                                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PERFORMANCE CONSIDERATIONS:                                             │
│  ───────────────────────────                                             │
│                                                                          │
│  Grammar-constrained decoding can be slower:                           │
│  • Must track grammar state at each step                              │
│  • Must compute valid tokens (can be expensive for complex grammars) │
│  • Some tokens may be forced (no sampling needed, slightly faster)    │
│                                                                          │
│  But output is GUARANTEED valid—no parsing errors, no retries.        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part XI: Contrastive Decoding

Improve quality by contrasting with a weaker model.

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    CONTRASTIVE DECODING                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE INSIGHT:                                                            │
│  ────────────                                                            │
│  Bad patterns (repetition, generic text) are often high probability   │
│  in BOTH large and small models. Good patterns are high probability   │
│  mainly in large models.                                               │
│                                                                          │
│  By subtracting small model logits from large model logits,           │
│  we emphasize what the large model "uniquely knows."                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  THE ALGORITHM:                                                          │
│  ───────────────                                                         │
│                                                                          │
│  contrastive_logits = large_logits - α × small_logits                 │
│                                                                          │
│  Token         Large   Small   Contrast (α=0.5)                        │
│  ──────────────────────────────────────────────                         │
│  "the"         4.0     3.8     4.0 - 1.9 = 2.1  (common, reduced)     │
│  "Paris"       3.5     1.0     3.5 - 0.5 = 3.0  (knowledge, boosted)  │
│  "I think"     3.0     2.9     3.0 - 1.45 = 1.55 (generic, reduced)   │
│  "fascinating" 2.0     0.3     2.0 - 0.15 = 1.85 (unique, boosted)   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  def contrastive_decode(large_model, small_model, prompt,              │
│                         alpha=0.5, temperature=1.0):                   │
│      tokens = prompt                                                    │
│                                                                          │
│      while not done:                                                   │
│          large_logits = large_model(tokens)[-1]                       │
│          small_logits = small_model(tokens)[-1]                       │
│                                                                          │
│          # Contrastive logits                                          │
│          logits = large_logits - alpha * small_logits                 │
│                                                                          │
│          # Apply temperature and sample                                │
│          probs = F.softmax(logits / temperature, dim=-1)              │
│          next_token = torch.multinomial(probs, 1)                     │
│                                                                          │
│          tokens = tokens + [next_token]                                │
│                                                                          │
│      return tokens                                                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BENEFITS:                                                               │
│  ─────────                                                               │
│  ✓ Reduces repetition and generic outputs                             │
│  ✓ Improves factual accuracy (knowledge is "large model unique")     │
│  ✓ Makes outputs more interesting/specific                            │
│                                                                          │
│  DRAWBACKS:                                                              │
│  ──────────                                                              │
│  ✗ Requires running two models (2× compute)                           │
│  ✗ α tuning needed per task                                           │
│  ✗ Can produce overconfident/unusual outputs if α too high           │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  VARIANTS:                                                               │
│  ─────────                                                               │
│                                                                          │
│  DOOLA (Decoding by Contrasting Layers):                               │
│  Use early layers vs late layers of SAME model                        │
│  No need for second model!                                             │
│  contrast = late_layer_logits - α × early_layer_logits               │
│                                                                          │
│  CONTEXT DISTILLATION CONTRAST:                                         │
│  Contrast model with vs without instructions                          │
│  Emphasizes instruction-following behavior                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part XII: Practical Recommendations

Choosing the Right Strategy

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    DECODING STRATEGY DECISION GUIDE                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TASK TYPE → RECOMMENDED APPROACH:                                       │
│  ─────────────────────────────────                                       │
│                                                                          │
│  FACTUAL Q&A / MATH / LOGIC:                                            │
│  ────────────────────────────                                            │
│  temperature=0 (greedy)                                                │
│  OR temperature=0.1, top_p=0.9                                         │
│  Why: One correct answer, don't want randomness                       │
│                                                                          │
│  CODE GENERATION:                                                        │
│  ────────────────                                                        │
│  temperature=0.2, top_p=0.95                                           │
│  Why: Low temp for correctness, high top_p for rare but valid tokens  │
│  Consider: Speculative decoding for speed                             │
│                                                                          │
│  STRUCTURED OUTPUT (JSON, SQL):                                         │
│  ──────────────────────────────                                          │
│  temperature=0, constrained decoding (Outlines/Guidance)              │
│  Why: Must be syntactically valid                                      │
│                                                                          │
│  GENERAL ASSISTANT / CHAT:                                               │
│  ─────────────────────────                                               │
│  temperature=0.7, top_p=0.9                                            │
│  repetition_penalty=1.1                                                │
│  Why: Balanced creativity and coherence                                │
│                                                                          │
│  CREATIVE WRITING:                                                       │
│  ─────────────────                                                       │
│  temperature=0.9-1.0, top_p=0.95                                       │
│  Why: Want diversity and surprises                                     │
│                                                                          │
│  BRAINSTORMING / IDEA GENERATION:                                       │
│  ────────────────────────────────                                        │
│  temperature=1.0+, top_p=0.9, top_k=100                               │
│  presence_penalty=0.5 (encourage new topics)                          │
│  Why: Maximize diversity of ideas                                      │
│                                                                          │
│  TRANSLATION / SUMMARIZATION:                                            │
│  ────────────────────────────                                            │
│  beam_search with beam_width=4-5                                       │
│  OR temperature=0.3, top_p=0.9                                         │
│  Why: Clear "best" output, less need for creativity                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PRODUCTION CHECKLIST:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  □ Set max_tokens to reasonable limit (prevent runaway generation)    │
│  □ Add stop sequences if needed (e.g., "\n\nUser:" for chat)         │
│  □ Test with real inputs (parameters interact with content)          │
│  □ Monitor for repetition loops in production                         │
│  □ Consider caching for repeated prompts                              │
│  □ Log parameters with responses for debugging                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ANTI-PATTERNS:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  ✗ temperature=0 + top_p=0.5                                          │
│    (top_p does nothing when temp=0)                                   │
│                                                                          │
│  ✗ Very high temperature + no filtering                               │
│    (outputs become nonsense)                                           │
│                                                                          │
│  ✗ Aggressive top_k AND top_p together                                │
│    (over-constrains, hurts quality)                                   │
│                                                                          │
│  ✗ High repetition_penalty for factual tasks                          │
│    (prevents saying correct things twice)                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Summary

Code
┌─────────────────────────────────────────────────────────────────────────┐
│                    KEY TAKEAWAYS                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CORE DECODING STRATEGIES:                                               │
│  ─────────────────────────                                               │
│  • Greedy: Always pick max probability (deterministic, can loop)      │
│  • Beam Search: Track multiple candidates (good for translation)      │
│  • Sampling: Random selection from distribution (creative but noisy)  │
│                                                                          │
│  CONTROLLING RANDOMNESS:                                                 │
│  ───────────────────────                                                 │
│  • Temperature: Scale logits (low=focused, high=random)               │
│  • Top-k: Keep only k most likely tokens                              │
│  • Top-p: Keep tokens summing to probability p (adaptive!)            │
│  • Min-p: Keep tokens ≥ min_p × max_prob                             │
│                                                                          │
│  PREVENTING PROBLEMS:                                                    │
│  ────────────────────                                                    │
│  • Repetition penalty: Discourage repeated tokens                     │
│  • Frequency penalty: Scale penalty by occurrence count               │
│  • N-gram blocking: Prevent exact phrase repetition                   │
│                                                                          │
│  ADVANCED TECHNIQUES:                                                    │
│  ────────────────────                                                    │
│  • Speculative decoding: Draft+verify for 2-3× speedup                │
│  • Constrained generation: Grammar/schema enforcement                  │
│  • Contrastive decoding: Subtract small model to reduce generic text  │
│                                                                          │
│  PRACTICAL DEFAULTS:                                                     │
│  ───────────────────                                                     │
│  • General use: temperature=0.7, top_p=0.9                            │
│  • Factual: temperature=0-0.2                                         │
│  • Creative: temperature=0.9-1.0, top_p=0.95                          │
│  • Code: temperature=0.2, top_p=0.95                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles