How many tokens is my text?

Use the model's tokenizer to count. For OpenAI models, use tiktoken. For HuggingFace models, use the model's tokenizer. Rule of thumb: ~4 characters per token for English, ~1.5 characters per token for Chinese/Japanese, ~2 characters for other non-English.

Why not just use characters?

Character-level models need very long sequences (5× more tokens) and struggle to learn word-level patterns. A 2000-word document becomes 10,000 characters, far exceeding context limits. Subword tokenization compresses this to ~2,500 tokens while preserving semantic units.

Can I train my own tokenizer?

Yes, using SentencePiece or HuggingFace tokenizers. Train on your specific domain for better efficiency. But you'll need to train a new model—you can't use a custom tokenizer with existing pre-trained models because token IDs won't match.

Why do models have different vocabulary sizes?

Tradeoff between efficiency (larger vocab = shorter sequences = lower latency/cost) and embedding table size. Modern models trend toward larger vocabularies (100K-256K) because embedding tables are tiny compared to transformer layers. A 128K vocab with 4096-dim embeddings is only 2GB—small for a 70B model.

Does tokenization affect model quality?

Yes, significantly. Poor tokenization can: Fragment words, losing semantic relationships ("unhappiness" → "un", "hap", "pin", "ess"), Inflate token counts for non-English (2-5× more expensive), Split code awkwardly (variable names, indentation), Create inconsistent number representations.

Why does my Chinese/Arabic text use so many tokens?

Most tokenizers are trained on English-heavy data, so English patterns get merged aggressively while non-English characters often remain as individual bytes. Use larger vocabulary models (GPT-4o with o200k) or language-specific models (Qwen for Chinese) for better efficiency.

How do I handle special tokens in prompts?

Special tokens like ` ` or `[SEP]` have special meaning to models. By default, tiktoken encodes them as regular text. Use `allowed_special` to enable them:

What's the difference between cl100k_base and o200k_base?

cl100k_base (100K vocab) is used by GPT-4 and GPT-3.5-turbo. o200k_base (200K vocab) is used by GPT-4o and is ~15% more efficient, especially for non-English and code. GPT-4o costs less partially because of this tokenizer efficiency.

Why do emojis use so many tokens?

Emojis are encoded as UTF-8 bytes, and complex emojis are actually multiple Unicode code points: Avoid heavy emoji use in prompts if cost is a concern. Simple emoji (😀): 1 token, Emoji with skin tone (👋🏽): 3 tokens (base + modifier), Family emoji (👨👩👧): 7+ tokens (multiple people joined with ZWJ).

Can I estimate cost before making an API call?

Yes, count tokens locally first:

How do I debug weird tokenization?

Use the decode method to see individual tokens: This helps identify invisible characters, unexpected splits, and encoding issues.

Should I strip whitespace from prompts?

Be consistent. Leading/trailing whitespace affects tokenization. `" Hello"` and `"Hello"` tokenize differently. Most APIs handle this automatically, but for token counting and caching, normalize your inputs.

What's Fill-in-the-Middle (FIM) tokenization?

FIM uses special tokens (` `, ` `, ` `) to enable code completion in the middle of existing code. Instead of just predicting the next token, the model can fill in gaps. Used by StarCoder, CodeLlama, and DeepSeek-Coder.

How do vision models tokenize images?

Images are split into patches (e.g., 16×16 pixels), each patch becomes a "visual token" via linear projection. A 224×224 image becomes 196 visual tokens. Higher resolution = more tokens (quadratic scaling). Some models use discrete visual vocabularies (VQGAN) for image generation.

What's the future of tokenization?

Trends include: --- **Larger vocabularies** (256K+) for better multilingual/code efficiency, **Byte-level models** that skip tokenization entirely (like ByT5), **Multimodal tokenizers** that handle text, images, audio, and video, **Domain-specific tokenizers** for code, science, and other verticals.

Back to Blog

LLMs NLP Deep Learning Tutorials

Tokenization Deep Dive: BPE, WordPiece, and SentencePiece

Detailed walkthrough of tokenization—how LLMs convert text to numbers. Understand BPE, WordPiece, Unigram, and SentencePiece, and why tokenization matters for model performance.

December 27, 20257 min read

Why Tokenization Matters

Before a language model can process text, it must convert characters into numbers. This conversion—tokenization—is more important than it might seem. Tokenization determines:

Vocabulary size: How many unique tokens the model knows
Sequence length: How many tokens a text becomes (affects compute cost)
Out-of-vocabulary handling: What happens with unknown words
Multilingual ability: How well non-English text is represented
Efficiency: Tokens per character ratio affects cost

2025 vocabulary size trends: The field has seen dramatic vocabulary expansion. Qwen uses 151,646 tokens optimized for both English and Chinese (24,953 Chinese tokens). Llama-3 jumped to 128,256 tokens, up from Llama-2's 32K. Gemma-2 peaks at 256K tokens, while DeepSeek-V3 uses 128K tokens tuned for 128K context windows.

ByteLevel BPE dominates: According to recent analysis, models like GPT-2, GPT-3, Llama-3, Falcon, and Qwen use ByteLevel BPE, which operates at the byte level to ensure any input—regardless of language or special characters—can be tokenized and perfectly reversed. SentencePiece BPE remains popular in Mistral, Llama-2, and Yi.

Why vocabulary size is expanding: Research shows that larger vocabularies improve multilingual capability and reduce sequence length (fewer tokens per text). The tradeoff is embedding table size—151K vocab at 4096 dimensions = 620M parameters just for embeddings.

Poor tokenization can cripple an otherwise good model. Understanding tokenization helps you choose the right tokenizer, understand model limitations, and debug unexpected behaviors.

Part I: Tokenization Fundamentals

What is Tokenization?

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TOKENIZATION OVERVIEW                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROCESS:                                                            │
│  ────────────                                                            │
│                                                                          │
│  Text: "Hello, world!"                                                  │
│                                                                          │
│        │                                                                │
│        ▼ Tokenization                                                   │
│                                                                          │
│  Tokens: ["Hello", ",", " world", "!"]                                 │
│                                                                          │
│        │                                                                │
│        ▼ Token → ID mapping                                            │
│                                                                          │
│  IDs: [15496, 11, 995, 0]                                              │
│                                                                          │
│  These IDs are what the model actually processes.                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DIFFERENT GRANULARITIES:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  CHARACTER-LEVEL:                                                        │
│  "Hello" → ["H", "e", "l", "l", "o"] (5 tokens)                       │
│  + Small vocabulary (~100-300 characters)                              │
│  + Handles any text                                                    │
│  - Very long sequences                                                 │
│  - Hard for model to learn word meanings                              │
│                                                                          │
│  WORD-LEVEL:                                                            │
│  "Hello" → ["Hello"] (1 token)                                        │
│  + Semantic units preserved                                            │
│  - Huge vocabulary (100K+ words)                                      │
│  - Can't handle unknown words ("OOV problem")                         │
│                                                                          │
│  SUBWORD-LEVEL (Modern approach):                                       │
│  "unhappiness" → ["un", "happiness"] (2 tokens)                       │
│  + Moderate vocabulary (30K-100K)                                      │
│  + Handles unknown words via subword composition                      │
│  + Captures morphology (prefixes, suffixes)                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

The Out-of-Vocabulary Problem

Word-level tokenization has a fatal flaw:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE OOV PROBLEM                                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  WORD-LEVEL TOKENIZATION:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  Training vocabulary: {"the", "cat", "sat", "on", "mat", ...}         │
│  (100,000 most common words)                                           │
│                                                                          │
│  Input: "The ChatGPT model is transformative"                          │
│                                                                          │
│  "ChatGPT" → NOT IN VOCABULARY → [UNK] token                          │
│  "transformative" → NOT IN VOCABULARY → [UNK]                         │
│                                                                          │
│  Result: "The [UNK] model is [UNK]"                                    │
│                                                                          │
│  All meaning from new/rare words is LOST!                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY THIS IS SEVERE:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  • Technical terms: "PyTorch", "Kubernetes", "GraphQL"                │
│  • Names: "Enrico", "Anthropic", "OpenAI"                             │
│  • New words: "COVID-19", "NFT", "ChatGPT"                            │
│  • Misspellings: "teh", "recieve"                                     │
│  • Compound words: "unprecedented", "unbelievable"                    │
│  • Other languages: Any non-English word                              │
│                                                                          │
│  Real text is FULL of rare/new words!                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SUBWORD SOLUTION:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  "ChatGPT" → ["Chat", "G", "PT"]                                      │
│  "transformative" → ["transform", "ative"]                            │
│  "unprecedented" → ["un", "pre", "ced", "ented"]                      │
│                                                                          │
│  Unknown words decompose into known subwords.                         │
│  Meaning is partially preserved!                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part II: Byte Pair Encoding (BPE)

The Algorithm

BPE is the most widely used subword tokenization algorithm (GPT-2, GPT-3, GPT-4, LLaMA, etc.):

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    BYTE PAIR ENCODING (BPE)                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRAINING ALGORITHM:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  1. Start with character vocabulary                                    │
│  2. Count all adjacent pairs in corpus                                │
│  3. Merge most frequent pair into new token                           │
│  4. Repeat until desired vocabulary size                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE:                                                                │
│  ────────                                                                │
│                                                                          │
│  Training corpus: "low lower lowest"                                   │
│                                                                          │
│  Initial vocabulary: {l, o, w, e, r, s, t, ' '} (characters)          │
│                                                                          │
│  Step 0: Split words into characters                                   │
│  "low" → [l, o, w]                                                     │
│  "lower" → [l, o, w, e, r]                                            │
│  "lowest" → [l, o, w, e, s, t]                                        │
│                                                                          │
│  Step 1: Count pairs                                                    │
│  (l, o): 3    ← MOST FREQUENT                                         │
│  (o, w): 3                                                             │
│  (w, e): 2                                                             │
│  (e, r): 1                                                             │
│  (e, s): 1                                                             │
│  (s, t): 1                                                             │
│                                                                          │
│  Merge (l, o) → "lo"                                                   │
│  "low" → [lo, w]                                                       │
│  "lower" → [lo, w, e, r]                                              │
│  "lowest" → [lo, w, e, s, t]                                          │
│                                                                          │
│  Step 2: Count pairs again                                             │
│  (lo, w): 3   ← MOST FREQUENT                                         │
│  (w, e): 2                                                             │
│  (e, r): 1                                                             │
│  (e, s): 1                                                             │
│  (s, t): 1                                                             │
│                                                                          │
│  Merge (lo, w) → "low"                                                 │
│  "low" → [low]                                                         │
│  "lower" → [low, e, r]                                                │
│  "lowest" → [low, e, s, t]                                            │
│                                                                          │
│  Step 3: Count pairs                                                    │
│  (low, e): 2  ← MOST FREQUENT                                         │
│  (e, r): 1                                                             │
│  (e, s): 1                                                             │
│  (s, t): 1                                                             │
│                                                                          │
│  Merge (low, e) → "lowe"                                               │
│  "low" → [low]                                                         │
│  "lower" → [lowe, r]                                                  │
│  "lowest" → [lowe, s, t]                                              │
│                                                                          │
│  Continue until vocabulary size reached...                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FINAL VOCABULARY (after more merges):                                  │
│  {l, o, w, e, r, s, t, lo, low, lowe, lower, lowest, ...}            │
│                                                                          │
│  Each merge is recorded: [(l,o)→lo, (lo,w)→low, ...]                  │
│  These merges are applied in order during tokenization.               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

BPE Tokenization (Inference)

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    BPE TOKENIZATION                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TOKENIZATION ALGORITHM:                                                 │
│  ───────────────────────                                                 │
│                                                                          │
│  1. Split input into characters (or bytes)                            │
│  2. Apply learned merges in order                                     │
│  3. Stop when no more merges apply                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE:                                                                │
│  ────────                                                                │
│                                                                          │
│  Input: "lowest"                                                        │
│  Merges: [(l,o)→lo, (lo,w)→low, (low,e)→lowe, ...]                   │
│                                                                          │
│  Step 0: [l, o, w, e, s, t]                                           │
│  Apply (l,o)→lo: [lo, w, e, s, t]                                     │
│  Apply (lo,w)→low: [low, e, s, t]                                     │
│  Apply (low,e)→lowe: [lowe, s, t]                                     │
│  No more merges apply to remaining pairs                              │
│                                                                          │
│  Result: ["lowe", "s", "t"]                                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  HANDLING UNKNOWN SUBSTRINGS:                                            │
│  ────────────────────────────                                            │
│                                                                          │
│  Input: "xyzzy" (rare word)                                            │
│                                                                          │
│  If "x", "y", "z" are in vocabulary as individual characters:         │
│  [x, y, z, z, y]                                                       │
│                                                                          │
│  If some characters unknown: Use byte fallback                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BYTE-LEVEL BPE (GPT-2 and later):                                      │
│  ─────────────────────────────────                                       │
│                                                                          │
│  Start with 256 byte tokens instead of characters.                    │
│  Guarantees any input can be tokenized!                               │
│                                                                          │
│  Base vocabulary: {0x00, 0x01, ..., 0xFF}                             │
│  + Learned merges: {Ġthe, Ġand, ing, ...}                             │
│                                                                          │
│  "Ġ" represents a space preceding the token (common in GPT-2)        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

BPE Implementation

Python

from collections import Counter, defaultdict
import re

class BPETokenizer:
    """Simplified BPE tokenizer implementation."""

    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.merges = []  # List of (pair, merged) tuples
        self.vocab = {}   # Token to ID mapping

    def train(self, corpus: str):
        """Train BPE on a corpus."""
        # Tokenize corpus into words, then characters
        words = corpus.split()
        word_freqs = Counter(words)

        # Split words into characters with end-of-word marker
        splits = {
            word: list(word) + ['</w>']
            for word in word_freqs
        }

        # Start with character vocabulary
        vocab = set()
        for word in splits:
            vocab.update(splits[word])

        # Merge until desired vocab size
        while len(vocab) < self.vocab_size:
            # Count all pairs
            pair_freqs = defaultdict(int)
            for word, freq in word_freqs.items():
                symbols = splits[word]
                for i in range(len(symbols) - 1):
                    pair = (symbols[i], symbols[i + 1])
                    pair_freqs[pair] += freq

            if not pair_freqs:
                break

            # Find most frequent pair
            best_pair = max(pair_freqs, key=pair_freqs.get)
            merged = best_pair[0] + best_pair[1]

            # Record merge
            self.merges.append((best_pair, merged))
            vocab.add(merged)

            # Apply merge to all words
            for word in splits:
                symbols = splits[word]
                new_symbols = []
                i = 0
                while i < len(symbols):
                    if (i < len(symbols) - 1 and
                        symbols[i] == best_pair[0] and
                        symbols[i + 1] == best_pair[1]):
                        new_symbols.append(merged)
                        i += 2
                    else:
                        new_symbols.append(symbols[i])
                        i += 1
                splits[word] = new_symbols

        # Create vocab mapping
        self.vocab = {token: i for i, token in enumerate(sorted(vocab))}

    def tokenize(self, text: str) -> list:
        """Tokenize text using learned merges."""
        words = text.split()
        tokens = []

        for word in words:
            # Start with characters
            symbols = list(word) + ['</w>']

            # Apply merges in order
            for (pair, merged) in self.merges:
                new_symbols = []
                i = 0
                while i < len(symbols):
                    if (i < len(symbols) - 1 and
                        symbols[i] == pair[0] and
                        symbols[i + 1] == pair[1]):
                        new_symbols.append(merged)
                        i += 2
                    else:
                        new_symbols.append(symbols[i])
                        i += 1
                symbols = new_symbols

            tokens.extend(symbols)

        return tokens

    def encode(self, text: str) -> list:
        """Convert text to token IDs."""
        tokens = self.tokenize(text)
        return [self.vocab.get(t, self.vocab.get('<unk>', 0)) for t in tokens]

    def decode(self, ids: list) -> str:
        """Convert token IDs back to text."""
        id_to_token = {v: k for k, v in self.vocab.items()}
        tokens = [id_to_token.get(i, '<unk>') for i in ids]
        text = ''.join(tokens)
        text = text.replace('</w>', ' ')
        return text.strip()

Part III: WordPiece

The BERT Tokenizer

WordPiece is used by BERT and similar models. It's similar to BPE but uses a different merge criterion:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    WORDPIECE                                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  KEY DIFFERENCE FROM BPE:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  BPE: Merge most FREQUENT pair                                         │
│  WordPiece: Merge pair that maximizes LIKELIHOOD                       │
│                                                                          │
│  Likelihood score for merging (a, b) → ab:                            │
│                                                                          │
│  score(a, b) = freq(ab) / (freq(a) × freq(b))                         │
│                                                                          │
│  This prefers merges where ab appears more often than                 │
│  expected by chance given individual frequencies.                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE:                                                                │
│  ────────                                                                │
│                                                                          │
│  freq("th") = 100                                                       │
│  freq("t") = 500                                                        │
│  freq("h") = 300                                                        │
│  score("t", "h") = 100 / (500 × 300) = 0.00067                        │
│                                                                          │
│  freq("un") = 80                                                        │
│  freq("u") = 100                                                        │
│  freq("n") = 200                                                        │
│  score("u", "n") = 80 / (100 × 200) = 0.004                           │
│                                                                          │
│  "un" has higher score despite lower frequency!                       │
│  It's more "surprising" to see u and n together.                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WORDPIECE TOKENIZATION:                                                 │
│  ───────────────────────                                                 │
│                                                                          │
│  Input: "unhappiness"                                                   │
│  Output: ["un", "##happi", "##ness"]                                  │
│                                                                          │
│  The "##" prefix indicates continuation of a word.                    │
│  First subword has no prefix.                                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOKENIZATION ALGORITHM (greedy):                                        │
│  ─────────────────────────────────                                       │
│                                                                          │
│  1. For each word, find longest prefix in vocabulary                  │
│  2. Add token, continue with remainder                                 │
│  3. If single character not in vocab, use [UNK]                       │
│                                                                          │
│  "unhappiness"                                                          │
│  → "un" is in vocab, remainder "happiness"                            │
│  → "happiness" → longest match "##happi" (maybe), remainder "ness"   │
│  → "##ness" is in vocab                                                │
│  → Result: ["un", "##happi", "##ness"]                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

BPE vs WordPiece

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    BPE VS WORDPIECE                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MERGE CRITERION:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  BPE: Frequency                                                         │
│  • Merge pairs that appear most often                                  │
│  • Simple and fast                                                      │
│  • May create arbitrary subwords                                       │
│                                                                          │
│  WordPiece: Likelihood                                                  │
│  • Merge pairs that are "surprising" together                         │
│  • Tends to create more linguistically meaningful units               │
│  • Slightly more complex                                               │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOKENIZATION:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  BPE: Apply merges in learned order                                    │
│  • Deterministic: same merges → same tokenization                     │
│  • Can be slow for long sequences                                      │
│                                                                          │
│  WordPiece: Greedy longest-match                                       │
│  • Faster at inference                                                 │
│  • May produce different segmentations                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  USAGE:                                                                  │
│  ──────                                                                  │
│                                                                          │
│  BPE: GPT-2, GPT-3, GPT-4, LLaMA, most modern LLMs                    │
│  WordPiece: BERT, DistilBERT, ELECTRA                                  │
│                                                                          │
│  In practice, both work well. BPE is more common now.                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part IV: Unigram Language Model

A Different Approach

Unigram (used by SentencePiece) takes a fundamentally different approach:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    UNIGRAM LANGUAGE MODEL                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  KEY IDEA:                                                               │
│  ─────────                                                               │
│                                                                          │
│  BPE/WordPiece: Build vocabulary bottom-up (merge characters)          │
│  Unigram: Start with large vocabulary, PRUNE it down                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TRAINING:                                                               │
│  ─────────                                                               │
│                                                                          │
│  1. Start with large vocabulary (all substrings up to length N)       │
│  2. For each token, compute how much removing it hurts likelihood     │
│  3. Remove tokens that hurt least (keep most useful)                  │
│  4. Repeat until desired vocabulary size                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PROBABILISTIC MODEL:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  Each token has a probability P(token).                               │
│  Tokenization probability: P(x₁) × P(x₂) × ... × P(xₙ)               │
│                                                                          │
│  Given text "hello":                                                    │
│  Segmentation 1: ["hello"] → P("hello")                               │
│  Segmentation 2: ["hel", "lo"] → P("hel") × P("lo")                  │
│  Segmentation 3: ["h", "e", "l", "l", "o"] → P("h")×P("e")×...       │
│                                                                          │
│  Choose segmentation with highest probability!                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOKENIZATION (Viterbi algorithm):                                       │
│  ──────────────────────────────────                                      │
│                                                                          │
│  Find most probable segmentation using dynamic programming.           │
│                                                                          │
│  For each position, store best path to reach it.                      │
│  Time complexity: O(n × max_token_length)                             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ADVANTAGES:                                                             │
│  ───────────                                                             │
│                                                                          │
│  • Probabilistic framework (can sample different tokenizations)       │
│  • Often produces more linguistically meaningful tokens               │
│  • Used by: T5, mT5, ALBERT, XLNet (via SentencePiece)               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part V: SentencePiece

Language-Agnostic Tokenization

SentencePiece is a library that implements both BPE and Unigram, with a key innovation: it treats text as a raw byte stream, making it language-agnostic:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SENTENCEPIECE                                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  KEY INNOVATIONS:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  1. LANGUAGE-AGNOSTIC                                                   │
│     No pre-tokenization (no word splitting)                           │
│     Treats text as raw character/byte sequence                        │
│     Works for any language, including Chinese, Japanese               │
│                                                                          │
│  2. WHITESPACE AS TOKEN                                                 │
│     Whitespace is just another character                              │
│     Can be part of tokens: "▁the" (▁ = space)                        │
│     Enables lossless reconstruction                                    │
│                                                                          │
│  3. SUPPORTS MULTIPLE ALGORITHMS                                        │
│     BPE mode                                                           │
│     Unigram mode (default, often better)                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOKENIZATION EXAMPLE:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  Input: "Hello World"                                                   │
│                                                                          │
│  Traditional (GPT-2 style):                                            │
│  Pre-tokenize on spaces: ["Hello", "World"]                           │
│  Then BPE: ["Hello", "ĠWorld"]  (Ġ indicates preceding space)        │
│                                                                          │
│  SentencePiece:                                                         │
│  Raw input: "Hello World"                                              │
│  Tokenize: ["▁Hello", "▁World"]  (▁ is the space character)          │
│                                                                          │
│  Key difference: SentencePiece puts space at START of word.           │
│  This works better for languages without spaces (Chinese, Japanese).  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  USAGE:                                                                  │
│  ──────                                                                  │
│                                                                          │
│  import sentencepiece as spm                                           │
│                                                                          │
│  # Train                                                                │
│  spm.SentencePieceTrainer.train(                                       │
│      input='corpus.txt',                                               │
│      model_prefix='mymodel',                                           │
│      vocab_size=32000,                                                 │
│      model_type='unigram',  # or 'bpe'                                │
│  )                                                                      │
│                                                                          │
│  # Load and use                                                        │
│  sp = spm.SentencePieceProcessor()                                    │
│  sp.load('mymodel.model')                                              │
│                                                                          │
│  tokens = sp.encode_as_pieces('Hello World')                          │
│  # ['▁Hello', '▁World']                                               │
│                                                                          │
│  ids = sp.encode_as_ids('Hello World')                                │
│  # [1234, 5678]                                                        │
│                                                                          │
│  text = sp.decode_ids([1234, 5678])                                   │
│  # 'Hello World'                                                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Which Tokenizer Do Models Use?

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TOKENIZERS BY MODEL                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  MODEL               TOKENIZER         VOCAB SIZE                       │
│  ─────────────────────────────────────────────────────────────────────  │
│  GPT-2               BPE (byte-level)  50,257                           │
│  GPT-3               BPE (byte-level)  50,257                           │
│  GPT-4               BPE (cl100k)      100,256                          │
│                                                                          │
│  BERT                WordPiece         30,522                           │
│  RoBERTa             BPE               50,265                           │
│                                                                          │
│  T5                  SentencePiece     32,000                           │
│  LLaMA               SentencePiece     32,000                           │
│  LLaMA 2             SentencePiece     32,000                           │
│  LLaMA 3/4           BPE (tiktoken)    128,256                          │
│                                                                          │
│  Mistral             SentencePiece     32,000                           │
│  Mixtral             SentencePiece     32,000                           │
│                                                                          │
│  Qwen3 (2025)        BBPE              151,669  (119 languages)         │
│  Kimi K2 (2025)      BPE               ~128,000                         │
│  Claude              BPE variant       ~100,000                         │
│  Gemini 2            SentencePiece     ~256,000                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  VOCABULARY SIZE TRENDS (2025):                                          │
│  ─────────────────────────────                                           │
│                                                                          │
│  2020: 32K-50K typical (LLaMA, Mistral)                               │
│  2024: 100K-128K becoming standard (Llama 3, GPT-4)                   │
│  2025: 150K-256K for multilingual (Qwen3, Gemini 2)                   │
│                                                                          │
│  Larger vocabulary:                                                     │
│  + More tokens as single units (15% fewer tokens in Llama 3 vs 2)    │
│  + Shorter sequences (faster, cheaper inference)                      │
│  + Better multilingual (Qwen3: 119 languages)                        │
│  - Larger embedding matrix (but tiny vs model weights)               │
│                                                                          │
│  Qwen3 note: Large vocab can cause slower generation in some         │
│  languages (Hindi, Italian) due to less efficient tokenization.      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VI: Tokenization Quirks and Issues

Common Problems

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TOKENIZATION ISSUES                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. INCONSISTENT NUMBERS:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  "123" → ["123"]     (1 token)                                        │
│  "1234" → ["12", "34"]   (2 tokens)                                   │
│  "12345" → ["123", "45"]   (2 tokens)                                 │
│                                                                          │
│  Different numbers tokenize differently!                               │
│  This can hurt arithmetic performance.                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2. WHITESPACE SENSITIVITY:                                              │
│  ──────────────────────────                                              │
│                                                                          │
│  "Hello" → ["Hello"]                                                   │
│  " Hello" → ["Ġ", "Hello"] or ["ĠHello"]                              │
│  "  Hello" → ["Ġ", "Ġ", "Hello"]                                      │
│                                                                          │
│  Leading spaces change tokenization!                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  3. NON-ENGLISH INEFFICIENCY:                                            │
│  ────────────────────────────                                            │
│                                                                          │
│  English: "Hello" → 1-2 tokens                                        │
│  Chinese: "你好" → 2-4 tokens (each character separate)               │
│  Arabic: "مرحبا" → 4-8 tokens                                          │
│                                                                          │
│  Tokenizers trained on mostly English are inefficient for others.    │
│  Non-English text uses more tokens → higher cost, shorter context.   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  4. CODE TOKENIZATION:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  "def function():" → ["def", "Ġfunction", "():", ...]                │
│  "    return x" → ["Ġ", "Ġ", "Ġ", "Ġreturn", "Ġx"]                  │
│                                                                          │
│  Indentation creates many tokens!                                     │
│  Some tokenizers have special handling for spaces.                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  5. CONTEXT LENGTH IMPLICATIONS:                                         │
│  ───────────────────────────────                                         │
│                                                                          │
│  "4096 token context" means different things:                         │
│  • ~3000 English words                                                │
│  • ~1500 Chinese characters                                           │
│  • ~1000 lines of code                                                │
│                                                                          │
│  Token count ≠ word count ≠ character count                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Tokenization Best Practices

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    BEST PRACTICES                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. USE THE MODEL'S TOKENIZER                                           │
│     Always use the exact tokenizer the model was trained with.        │
│     Different tokenizers → different token IDs → garbage output.      │
│                                                                          │
│  2. COUNT TOKENS, NOT WORDS                                             │
│     When checking context limits, count tokens.                       │
│     Most APIs provide token counting methods.                         │
│                                                                          │
│     # OpenAI                                                           │
│     import tiktoken                                                    │
│     enc = tiktoken.encoding_for_model("gpt-4")                        │
│     tokens = enc.encode("Hello world")                                │
│     print(len(tokens))                                                 │
│                                                                          │
│  3. HANDLE SPECIAL TOKENS                                               │
│     Be aware of special tokens: [CLS], [SEP], <|endoftext|>, etc.    │
│     They affect sequence length and model behavior.                   │
│                                                                          │
│  4. TEST NON-ENGLISH AND CODE                                           │
│     If your application uses non-English text or code,               │
│     test tokenization efficiency and quality.                         │
│                                                                          │
│  5. CONSIDER TOKENIZATION IN PROMPTS                                    │
│     Token boundaries affect model behavior.                           │
│     " hello" vs "hello" may produce different outputs.               │
│     Be consistent with spacing.                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VII: Recent Innovations (2024-2025)

Llama 3's Tokenizer Switch

Meta made a significant change with Llama 3, switching from SentencePiece to a tiktoken-style BPE tokenizer:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    LLAMA 3 TOKENIZER CHANGES                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  LLAMA 1/2 vs LLAMA 3:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  LLaMA 1/2:                                                              │
│  • SentencePiece with Unigram model                                     │
│  • 32,000 vocabulary size                                               │
│  • Works well but limited for code and non-English                      │
│                                                                          │
│  LLaMA 3:                                                                │
│  • tiktoken-style BPE                                                   │
│  • 128,256 vocabulary size (4× larger!)                                │
│  • Better efficiency across languages and code                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY THE SWITCH?                                                         │
│  ───────────────                                                         │
│                                                                          │
│  1. EFFICIENCY:                                                          │
│     Same text → fewer tokens → faster inference                        │
│     "Hello world!" in Llama 2: ~4 tokens                               │
│     "Hello world!" in Llama 3: ~3 tokens                               │
│     15-25% reduction in token count on average                         │
│                                                                          │
│  2. CODE HANDLING:                                                       │
│     Larger vocab includes more programming patterns                    │
│     "def function():" tokenizes more efficiently                       │
│     Better whitespace/indentation handling                             │
│                                                                          │
│  3. MULTILINGUAL:                                                        │
│     More dedicated tokens for non-English scripts                      │
│     Chinese, Japanese, Korean much more efficient                      │
│     Still English-optimized but better balance                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  COMPATIBILITY NOTE:                                                     │
│  ───────────────────                                                     │
│                                                                          │
│  Llama 3's tokenizer is NOT compatible with Llama 2!                   │
│  Different tokenizers = different token IDs.                           │
│  Must use matching tokenizer for each model.                           │
│                                                                          │
│  # Llama 3                                                              │
│  from transformers import AutoTokenizer                                 │
│  tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")   │
│                                                                          │
│  # Llama 2 (different!)                                                 │
│  tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Modern Vocabulary Size Trends

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    VOCABULARY SIZE EVOLUTION                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Model           Year    Vocab Size    Notes                            │
│  ─────────────────────────────────────────────────────────────────────  │
│  GPT-2           2019    50,257        Baseline modern BPE              │
│  GPT-3           2020    50,257        Same as GPT-2                    │
│  LLaMA 1         2023    32,000        SentencePiece                    │
│  LLaMA 2         2023    32,000        SentencePiece                    │
│  GPT-4           2023    100,256       cl100k_base                      │
│  Mistral         2023    32,000        SentencePiece                    │
│  LLaMA 3         2024    128,256       tiktoken-style                   │
│  Gemma           2024    256,000       Large multilingual vocab         │
│  Gemini 1.5      2024    ~256,000      Large for multimodal             │
│  Claude 3        2024    ~100,000      Estimated                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TREND: Vocabularies getting larger                                     │
│  ─────────────────────────────────                                       │
│                                                                          │
│  Benefits of larger vocabulary:                                         │
│  • More words/phrases as single tokens                                 │
│  • Shorter sequences → lower latency                                   │
│  • Better multilingual coverage                                        │
│  • Better code handling                                                │
│                                                                          │
│  Cost: Larger embedding table                                          │
│  • 128K vocab × 4096 dim × 4 bytes = 2GB                             │
│  • Tiny compared to 8B+ model weights                                  │
│                                                                          │
│  Verdict: Worth it for efficiency gains.                               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Special Tokens in Modern LLMs

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SPECIAL TOKENS (2024-2025)                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CHAT/INSTRUCTION TOKENS:                                                │
│  ─────────────────────────                                               │
│                                                                          │
│  LLaMA 3 Chat:                                                          │
│  <|begin_of_text|>                                                      │
│  <|start_header_id|>system<|end_header_id|>                            │
│  You are a helpful assistant.                                           │
│  <|eot_id|>                                                             │
│  <|start_header_id|>user<|end_header_id|>                              │
│  Hello!                                                                  │
│  <|eot_id|>                                                             │
│  <|start_header_id|>assistant<|end_header_id|>                         │
│                                                                          │
│  Mistral/Mixtral:                                                       │
│  [INST] User message [/INST] Assistant response                        │
│                                                                          │
│  ChatML (many models):                                                  │
│  <|im_start|>system\n...<|im_end|>                                     │
│  <|im_start|>user\n...<|im_end|>                                       │
│  <|im_start|>assistant\n...<|im_end|>                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOOL/FUNCTION CALLING TOKENS:                                          │
│  ──────────────────────────────                                          │
│                                                                          │
│  Many models now have dedicated tokens for:                            │
│  • <tool_call> ... </tool_call>                                        │
│  • <function> ... </function>                                          │
│  • <|python_tag|> (for code execution)                                 │
│                                                                          │
│  These help models distinguish structured output from free text.       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MULTIMODAL TOKENS:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  Vision-language models add:                                            │
│  • <image> or <|image|> for image placeholders                        │
│  • <video> for video                                                   │
│  • Special tokens for image patch positions                           │
│                                                                          │
│  LLaVA: <image> expands to 576 visual tokens                          │
│  Qwen-VL: Uses <img>...</img> tags                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Tokenization for Code

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    CODE-OPTIMIZED TOKENIZATION                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CHALLENGES WITH CODE:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  1. Indentation creates many tokens                                    │
│  2. Variable names often split awkwardly                               │
│  3. Special characters less efficiently encoded                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  STARCODER / CODELLAMA IMPROVEMENTS:                                    │
│  ────────────────────────────────────                                    │
│                                                                          │
│  • Include common code patterns in vocabulary                          │
│  • "def ", "return ", "import " as single tokens                      │
│  • Better handling of indentation (tabs vs spaces)                    │
│  • Repository-aware tokenization (file paths, imports)                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FILL-IN-THE-MIDDLE (FIM) TOKENS:                                       │
│  ─────────────────────────────────                                       │
│                                                                          │
│  Special tokens for code completion:                                   │
│                                                                          │
│  <fim_prefix>def hello():                                              │
│  <fim_suffix>                                                          │
│      return result                                                      │
│  <fim_middle>                                                           │
│                                                                          │
│  Model fills in the middle given prefix and suffix.                   │
│  Used by: StarCoder, CodeLlama, DeepSeek-Coder                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  TOKEN EFFICIENCY (Python, approximate):                                │
│  ────────────────────────────────────────                                │
│                                                                          │
│  Model              Tokens per 1K chars    Relative                    │
│  ─────────────────────────────────────────────────────────────         │
│  GPT-3.5            400                    1.0×                        │
│  GPT-4              350                    0.88×                       │
│  CodeLlama          320                    0.80×                       │
│  LLaMA 3            300                    0.75×                       │
│  DeepSeek-Coder     290                    0.73×                       │
│                                                                          │
│  Code-specialized models are ~25% more efficient on code.             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Updated Tokenizer Recommendations

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TOKENIZER SELECTION (2025)                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FOR TRAINING NEW MODELS:                                                │
│  ─────────────────────────                                               │
│                                                                          │
│  English-focused:                                                       │
│  • Use tiktoken-style BPE with 100K+ vocabulary                       │
│  • Follow Llama 3's approach                                           │
│                                                                          │
│  Multilingual:                                                          │
│  • Use SentencePiece with large vocabulary (256K+)                    │
│  • Ensure balanced language coverage in training data                  │
│                                                                          │
│  Code-specialized:                                                      │
│  • Include code patterns in vocabulary training data                  │
│  • Add FIM special tokens                                              │
│  • Consider separate code tokenizer                                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  FOR USING EXISTING MODELS:                                              │
│  ───────────────────────────                                             │
│                                                                          │
│  • ALWAYS use the model's exact tokenizer                             │
│  • Never mix tokenizers between models                                 │
│  • Use the tokenizer's encode/decode methods, not string manipulation │
│                                                                          │
│  # Correct                                                              │
│  from transformers import AutoTokenizer                                 │
│  tokenizer = AutoTokenizer.from_pretrained("model-name")              │
│  tokens = tokenizer.encode(text)                                       │
│                                                                          │
│  # Wrong (don't do this!)                                              │
│  tokens = text.split()  # Not tokenization!                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part VIII: tiktoken Deep Dive

OpenAI's Tokenization Library

tiktoken is OpenAI's fast BPE tokenizer implementation, written in Rust with Python bindings. It's used by all OpenAI models and has become the de facto standard for high-performance tokenization:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TIKTOKEN ARCHITECTURE                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  WHY TIKTOKEN?                                                           │
│  ─────────────                                                           │
│                                                                          │
│  Performance comparison (encoding 1M tokens):                           │
│                                                                          │
│  Library              Time        Relative                              │
│  ─────────────────────────────────────────                              │
│  tiktoken (Rust)      0.8s        1.0×                                  │
│  HF tokenizers        1.2s        1.5×                                  │
│  SentencePiece        3.5s        4.4×                                  │
│  Python BPE           45s         56×                                   │
│                                                                          │
│  tiktoken is 50× faster than pure Python implementations!              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CORE ARCHITECTURE:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      tiktoken                                    │   │
│  ├─────────────────────────────────────────────────────────────────┤   │
│  │  Python API (tiktoken package)                                  │   │
│  │       │                                                         │   │
│  │       ▼                                                         │   │
│  │  Rust Core (tiktoken-rs)                                        │   │
│  │  • Regex-based pre-tokenization                                │   │
│  │  • BPE merge application                                       │   │
│  │  • Byte-level encoding                                         │   │
│  │       │                                                         │   │
│  │       ▼                                                         │   │
│  │  Encoding Files (.tiktoken)                                    │   │
│  │  • Vocabulary mappings                                         │   │
│  │  • Merge rules                                                 │   │
│  │  • Special tokens                                              │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

tiktoken Encodings

tiktoken provides several pre-built encodings for different model families:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TIKTOKEN ENCODINGS                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ENCODING          VOCAB SIZE   MODELS                                  │
│  ─────────────────────────────────────────────────────────────────────  │
│  gpt2              50,257       GPT-2                                   │
│  r50k_base         50,257       text-davinci-002, code-davinci-002     │
│  p50k_base         50,281       text-davinci-003, code-davinci-002     │
│  p50k_edit         50,281       text-davinci-edit-001                  │
│  cl100k_base       100,256      GPT-3.5-turbo, GPT-4, text-embedding  │
│  o200k_base        200,019      GPT-4o, GPT-4o-mini                    │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EVOLUTION:                                                              │
│  ──────────                                                              │
│                                                                          │
│  gpt2 (2019)                                                            │
│  • Original GPT-2 tokenizer                                            │
│  • 50K vocabulary                                                       │
│  • Basic byte-level BPE                                                │
│                                                                          │
│  cl100k_base (2022)                                                     │
│  • Doubled vocabulary to 100K                                          │
│  • Better multilingual support                                         │
│  • Improved code handling                                              │
│  • Used by GPT-4                                                       │
│                                                                          │
│  o200k_base (2024)                                                      │
│  • Doubled again to 200K                                               │
│  • Optimized for GPT-4o efficiency                                    │
│  • ~10-15% fewer tokens for typical text                              │
│  • Much better non-English efficiency                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Using tiktoken

Python

import tiktoken

# Get encoding for a specific model
enc = tiktoken.encoding_for_model("gpt-4")
# Or by encoding name
enc = tiktoken.get_encoding("cl100k_base")

# Basic encoding/decoding
text = "Hello, world! How are you?"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
# Text: Hello, world! How are you?
# Tokens: [9906, 11, 1917, 0, 2650, 527, 499, 30]
# Token count: 8

# Decode back to text
decoded = enc.decode(tokens)
print(f"Decoded: {decoded}")
# Decoded: Hello, world! How are you?

# Decode individual tokens (useful for debugging)
for token_id in tokens:
    token_bytes = enc.decode_single_token_bytes(token_id)
    print(f"  {token_id} -> {token_bytes} -> {token_bytes.decode('utf-8', errors='replace')}")
# 9906 -> b'Hello' -> Hello
# 11 -> b',' -> ,
# 1917 -> b' world' -> world
# ...

tiktoken with Special Tokens

Python

import tiktoken

# cl100k_base special tokens
enc = tiktoken.get_encoding("cl100k_base")

# Default special tokens are NOT encoded
text_with_special = "Hello <|endoftext|> World"
tokens = enc.encode(text_with_special)
# Encodes "<|endoftext|>" as regular text tokens!

# To handle special tokens, use allowed_special or disallowed_special
tokens = enc.encode(
    text_with_special,
    allowed_special={"<|endoftext|>"}
)
# Now <|endoftext|> becomes token 100257

# Allow all special tokens
tokens = enc.encode(
    text_with_special,
    allowed_special="all"
)

# Disallow specific patterns (raises error if found)
try:
    tokens = enc.encode(
        text_with_special,
        disallowed_special={"<|endoftext|>"}
    )
except ValueError as e:
    print(f"Error: {e}")

Creating Custom tiktoken Encodings

Python

import tiktoken
from tiktoken import Encoding

# Create a custom encoding with additional special tokens
cl100k = tiktoken.get_encoding("cl100k_base")

# Add custom special tokens for your application
custom_special_tokens = {
    "<|system|>": 100264,
    "<|user|>": 100265,
    "<|assistant|>": 100266,
    "<|tool_call|>": 100267,
    "<|tool_result|>": 100268,
}

# Merge with existing special tokens
all_special_tokens = {
    **cl100k._special_tokens,
    **custom_special_tokens
}

# Create new encoding
custom_enc = Encoding(
    name="custom_cl100k",
    pat_str=cl100k._pat_str,  # Same regex pattern
    mergeable_ranks=cl100k._mergeable_ranks,  # Same BPE merges
    special_tokens=all_special_tokens
)

# Use custom encoding
text = "<|system|>You are helpful.<|user|>Hello!"
tokens = custom_enc.encode(text, allowed_special="all")
print(tokens)
# [100264, 2675, 527, 11190, 13, 100265, 9906, 0]

tiktoken Regex Patterns

The key innovation in tiktoken is its regex-based pre-tokenization:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TIKTOKEN PRE-TOKENIZATION                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PRE-TOKENIZATION REGEX (cl100k_base):                                  │
│  ─────────────────────────────────────                                   │
│                                                                          │
│  The regex splits text into chunks BEFORE BPE:                         │
│                                                                          │
│  (?i:'s|'t|'re|'ve|'m|'ll|'d)                                          │
│  |[^\r\n\p{L}\p{N}]?\p{L}+                                              │
│  |\p{N}{1,3}                                                             │
│  | ?[^\s\p{L}\p{N}]+[\r\n]*                                             │
│  |\s*[\r\n]+                                                             │
│  |\s+(?!\S)                                                              │
│  |\s+                                                                    │
│                                                                          │
│  WHAT THIS DOES:                                                         │
│  ────────────────                                                        │
│                                                                          │
│  Pattern                          Matches                               │
│  ─────────────────────────────────────────────────────────────────────  │
│  (?i:'s|'t|'re|'ve|'m|'ll|'d)   Contractions: "don't" → "don", "'t"   │
│  [^\r\n\p{L}\p{N}]?\p{L}+        Words with optional prefix            │
│  \p{N}{1,3}                       Numbers in groups of 1-3 digits      │
│  ?[^\s\p{L}\p{N}]+[\r\n]*        Punctuation and symbols               │
│  \s*[\r\n]+                       Newlines (with leading space)        │
│  \s+(?!\S)                        Trailing whitespace                  │
│  \s+                              Other whitespace                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE:                                                                │
│  ────────                                                                │
│                                                                          │
│  Input: "Hello, world! I can't wait for 2024."                         │
│                                                                          │
│  Pre-tokenization splits:                                               │
│  ["Hello", ",", " world", "!", " I", " can", "'t", " wait",            │
│   " for", " 202", "4", "."]                                            │
│                                                                          │
│  Then BPE is applied to each chunk independently.                      │
│  This prevents merges across word boundaries.                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY PRE-TOKENIZE?                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  1. Prevents weird merges: " the" won't merge with "n" to form "then" │
│  2. Consistent handling of contractions                                │
│  3. Numbers split into manageable chunks (202, 4 not 2024)            │
│  4. Improves tokenization quality and consistency                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

tiktoken Performance Optimization

Python

import tiktoken
import time

enc = tiktoken.get_encoding("cl100k_base")

# Batch encoding for better performance
texts = ["Hello world"] * 10000

# Method 1: Loop (slower)
start = time.time()
tokens_list = [enc.encode(text) for text in texts]
print(f"Loop: {time.time() - start:.3f}s")

# Method 2: encode_batch (faster, uses parallelism)
start = time.time()
tokens_list = enc.encode_batch(texts)
print(f"Batch: {time.time() - start:.3f}s")

# Method 3: encode_batch with num_threads
start = time.time()
tokens_list = enc.encode_batch(texts, num_threads=8)
print(f"Batch (8 threads): {time.time() - start:.3f}s")

# Typical results:
# Loop: 0.45s
# Batch: 0.12s
# Batch (8 threads): 0.04s

Part IX: HuggingFace Tokenizers

The Fast Tokenizers Library

HuggingFace's tokenizers library provides a unified, high-performance interface for all tokenization algorithms:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    HUGGINGFACE TOKENIZERS                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ARCHITECTURE:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                 Tokenization Pipeline                            │   │
│  ├─────────────────────────────────────────────────────────────────┤   │
│  │                                                                  │   │
│  │  Input Text                                                     │   │
│  │       │                                                         │   │
│  │       ▼                                                         │   │
│  │  ┌─────────────┐                                                │   │
│  │  │ Normalizer  │  Unicode normalization, lowercase, etc.       │   │
│  │  └─────────────┘                                                │   │
│  │       │                                                         │   │
│  │       ▼                                                         │   │
│  │  ┌─────────────┐                                                │   │
│  │  │PreTokenizer │  Split into words/chunks                      │   │
│  │  └─────────────┘                                                │   │
│  │       │                                                         │   │
│  │       ▼                                                         │   │
│  │  ┌─────────────┐                                                │   │
│  │  │   Model     │  BPE / WordPiece / Unigram                    │   │
│  │  └─────────────┘                                                │   │
│  │       │                                                         │   │
│  │       ▼                                                         │   │
│  │  ┌─────────────┐                                                │   │
│  │  │PostProcessor│  Add special tokens ([CLS], [SEP])            │   │
│  │  └─────────────┘                                                │   │
│  │       │                                                         │   │
│  │       ▼                                                         │   │
│  │  ┌─────────────┐                                                │   │
│  │  │  Decoder    │  Convert tokens back to string                │   │
│  │  └─────────────┘                                                │   │
│  │       │                                                         │   │
│  │       ▼                                                         │   │
│  │  Output Encoding                                                │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  KEY FEATURES:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  • Written in Rust for speed                                           │
│  • Unified API for all algorithms (BPE, WordPiece, Unigram)           │
│  • Training from scratch                                               │
│  • Customizable pipeline components                                    │
│  • Offset tracking (character positions)                               │
│  • Batched encoding with parallelism                                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Building a Tokenizer from Scratch

Python

from tokenizers import Tokenizer
from tokenizers.models import BPE, WordPiece, Unigram
from tokenizers.trainers import BpeTrainer, WordPieceTrainer, UnigramTrainer
from tokenizers.pre_tokenizers import Whitespace, ByteLevel
from tokenizers.normalizers import NFD, Lowercase, StripAccents, Sequence
from tokenizers.processors import TemplateProcessing

# ═══════════════════════════════════════════════════════════════════════
# OPTION 1: BPE Tokenizer (like GPT-2)
# ═══════════════════════════════════════════════════════════════════════

# Initialize with BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

# Set up the pipeline
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)

# Configure trainer
trainer = BpeTrainer(
    vocab_size=30000,
    min_frequency=2,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    show_progress=True
)

# Train on files
files = ["wiki.txt", "books.txt"]
tokenizer.train(files, trainer)

# Or train on iterator
def batch_iterator(dataset):
    for i in range(0, len(dataset), 1000):
        yield dataset[i:i+1000]["text"]

tokenizer.train_from_iterator(batch_iterator(dataset), trainer)

# Save
tokenizer.save("my-bpe-tokenizer.json")

# ═══════════════════════════════════════════════════════════════════════
# OPTION 2: WordPiece Tokenizer (like BERT)
# ═══════════════════════════════════════════════════════════════════════

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer(
    vocab_size=30000,
    min_frequency=2,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    continuing_subword_prefix="##"
)

tokenizer.train(files, trainer)

# ═══════════════════════════════════════════════════════════════════════
# OPTION 3: Unigram Tokenizer (like T5/LLaMA)
# ═══════════════════════════════════════════════════════════════════════

tokenizer = Tokenizer(Unigram())
tokenizer.normalizer = Sequence([NFD()])
tokenizer.pre_tokenizer = Whitespace()

trainer = UnigramTrainer(
    vocab_size=32000,
    special_tokens=["<pad>", "<eos>", "<unk>"],
    unk_token="<unk>"
)

tokenizer.train(files, trainer)

Pre-Tokenizers in Detail

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    PRE-TOKENIZERS                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PRE-TOKENIZER        DESCRIPTION              EXAMPLE                  │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  Whitespace           Split on whitespace      "Hello world"            │
│                                                → ["Hello", "world"]     │
│                                                                          │
│  WhitespaceSplit      Split, keep whitespace   "Hello  world"           │
│                                                → ["Hello", "  ", "world"]│
│                                                                          │
│  Punctuation          Split on punctuation     "Hello, world!"          │
│                                                → ["Hello", ",", "world", │
│                                                   "!"]                   │
│                                                                          │
│  ByteLevel            Convert to bytes         "Hello" → bytes          │
│                       (GPT-2 style)            Ġ prefix for spaces      │
│                                                                          │
│  Metaspace            SentencePiece-like       " Hello" → "▁Hello"      │
│                       space handling                                    │
│                                                                          │
│  CharDelimiterSplit   Split on specific char   "a|b|c" (delim="|")     │
│                                                → ["a", "b", "c"]        │
│                                                                          │
│  Digits               Split digits             "test123" → ["test",     │
│                                                            "1", "2", "3"]│
│                                                                          │
│  Split                Regex-based split        Custom patterns          │
│                                                                          │
│  Sequence             Chain multiple           Combine any of above     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE COMBINATIONS:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  # GPT-2 style (byte-level)                                             │
│  pre_tokenizer = ByteLevel(add_prefix_space=True)                       │
│                                                                          │
│  # BERT style (whitespace + punctuation)                                │
│  pre_tokenizer = Sequence([Whitespace(), Punctuation()])                │
│                                                                          │
│  # SentencePiece style                                                  │
│  pre_tokenizer = Metaspace(replacement="▁", add_prefix_space=True)     │
│                                                                          │
│  # Code-aware (split on digits and punctuation)                        │
│  pre_tokenizer = Sequence([                                             │
│      Whitespace(),                                                      │
│      Punctuation(),                                                     │
│      Digits(individual_digits=True)                                    │
│  ])                                                                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Post-Processors for Special Tokens

Python

from tokenizers.processors import TemplateProcessing, BertProcessing

# BERT-style: [CLS] ... [SEP] for single, [CLS] ... [SEP] ... [SEP] for pairs
tokenizer.post_processor = BertProcessing(
    sep=("[SEP]", tokenizer.token_to_id("[SEP]")),
    cls=("[CLS]", tokenizer.token_to_id("[CLS]"))
)

# Custom template
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B [SEP]",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]"))
    ]
)

# LLaMA-style (no special tokens added by default)
# Just encode as-is, special tokens handled in chat template

Encoding with Offset Tracking

Python

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Encode with offset tracking
encoding = tokenizer.encode("Hello, world!")

# Get tokens
print(encoding.tokens)
# ['[CLS]', 'hello', ',', 'world', '!', '[SEP]']

# Get token IDs
print(encoding.ids)
# [101, 7592, 1010, 2088, 999, 102]

# Get attention mask
print(encoding.attention_mask)
# [1, 1, 1, 1, 1, 1]

# Get offsets (character positions in original text)
print(encoding.offsets)
# [(0, 0), (0, 5), (5, 6), (7, 12), (12, 13), (0, 0)]
#  [CLS]    Hello   ,       world    !        [SEP]

# Use offsets to map back to original text
text = "Hello, world!"
for token, (start, end) in zip(encoding.tokens, encoding.offsets):
    if start != end:  # Skip special tokens
        print(f"{token}: '{text[start:end]}'")
# hello: 'Hello'
# ,: ','
# world: 'world'
# !: '!'

Batch Encoding with Padding and Truncation

Python

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Enable padding and truncation
tokenizer.enable_padding(
    pad_id=tokenizer.token_to_id("[PAD]"),
    pad_token="[PAD]",
    length=128  # Pad to fixed length, or None for dynamic
)
tokenizer.enable_truncation(max_length=128)

# Batch encode
texts = [
    "Short text",
    "This is a much longer text that will be truncated if necessary",
    "Medium length"
]

encodings = tokenizer.encode_batch(texts)

for enc in encodings:
    print(f"Length: {len(enc.ids)}, Tokens: {enc.tokens[:5]}...")
# Length: 128, Tokens: ['[CLS]', 'short', 'text', '[SEP]', '[PAD]']...
# Length: 128, Tokens: ['[CLS]', 'this', 'is', 'a', 'much']...
# Length: 128, Tokens: ['[CLS]', 'medium', 'length', '[SEP]', '[PAD]']...

Part X: Unicode and Byte-Level BPE

The Unicode Challenge

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    UNICODE TOKENIZATION CHALLENGES                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROBLEM:                                                            │
│  ────────────                                                            │
│                                                                          │
│  Unicode has 150,000+ characters across scripts:                        │
│  • Latin: A-Z, a-z, àáâãäå...                                          │
│  • CJK: 你好世界 (Chinese), こんにちは (Japanese), 안녕 (Korean)           │
│  • Arabic: مرحبا                                                         │
│  • Cyrillic: Привет                                                     │
│  • Emojis: 😀🎉🚀💻                                                      │
│  • Mathematical: ∑∫∂∇                                                   │
│                                                                          │
│  Character-level vocabulary would be HUGE!                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION: BYTE-LEVEL BPE                                                │
│  ─────────────────────────                                               │
│                                                                          │
│  Convert everything to UTF-8 bytes first:                               │
│  • Only 256 possible byte values (0x00-0xFF)                           │
│  • Any character can be represented                                    │
│  • Base vocabulary is just 256 tokens!                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  UTF-8 ENCODING:                                                         │
│  ───────────────                                                         │
│                                                                          │
│  Character    UTF-8 Bytes            Num Bytes                          │
│  ─────────────────────────────────────────────────────────────────────  │
│  'A'          0x41                   1                                  │
│  'ñ'          0xC3 0xB1              2                                  │
│  '中'         0xE4 0xB8 0xAD         3                                  │
│  '😀'         0xF0 0x9F 0x98 0x80    4                                  │
│                                                                          │
│  ASCII (0-127): 1 byte                                                  │
│  Latin extended: 2 bytes                                                │
│  CJK: 3 bytes                                                           │
│  Emojis: 4 bytes                                                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

GPT-2's Byte-to-Character Mapping

GPT-2 introduced a clever trick to make byte sequences readable:

Python

# GPT-2's byte-to-character mapping
# Maps bytes 0-255 to printable Unicode characters

def bytes_to_unicode():
    """
    Returns a mapping from bytes to Unicode characters.
    Avoids control characters and whitespace issues.
    """
    # Printable ASCII characters
    bs = list(range(ord("!"), ord("~")+1))  # 33-126
    bs += list(range(ord("¡"), ord("¬")+1))  # 161-172
    bs += list(range(ord("®"), ord("ÿ")+1))  # 174-255

    cs = bs[:]
    n = 0
    # Map remaining bytes (0-32, 127-160, 173) to higher Unicode
    for b in range(256):
        if b not in bs:
            bs.append(b)
            cs.append(256 + n)
            n += 1

    cs = [chr(n) for n in cs]
    return dict(zip(bs, cs))

byte_encoder = bytes_to_unicode()
byte_decoder = {v: k for k, v in byte_encoder.items()}

# Example: Space (byte 32) maps to 'Ġ' (Unicode 288)
print(byte_encoder[32])  # 'Ġ'

# This is why GPT-2 tokens look like:
# "Ġthe" = " the" (space + the)
# "Ġhello" = " hello"

Handling Multilingual Text

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MULTILINGUAL TOKENIZATION                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TOKEN EFFICIENCY BY LANGUAGE (approximate):                            │
│  ───────────────────────────────────────────                            │
│                                                                          │
│  Same meaning, different token counts:                                  │
│                                                                          │
│  Language    Text              GPT-4      LLaMA 3    Gemini            │
│                                (cl100k)   (128K)     (256K)            │
│  ─────────────────────────────────────────────────────────────────────  │
│  English     "Hello world"    2          2          2                   │
│  Spanish     "Hola mundo"     3          2          2                   │
│  French      "Bonjour monde"  3          2          2                   │
│  German      "Hallo Welt"     3          2          2                   │
│  Russian     "Привет мир"     5          4          2                   │
│  Chinese     "你好世界"        4          4          2                   │
│  Japanese    "こんにちは"      5          5          3                   │
│  Arabic      "مرحبا بالعالم"   8          7          4                   │
│  Hindi       "नमस्ते दुनिया"    12         10         5                   │
│                                                                          │
│  Non-English typically uses 2-5× more tokens!                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHY THE DISPARITY?                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  1. Training data is English-heavy                                      │
│     • English patterns merged more aggressively                        │
│     • "the" is one token, but "的" might not be                        │
│                                                                          │
│  2. UTF-8 byte count varies                                             │
│     • ASCII: 1 byte per character                                      │
│     • CJK: 3 bytes per character                                       │
│     • More bytes = potentially more tokens                             │
│                                                                          │
│  3. Vocabulary allocation                                               │
│     • 50K vocab mostly English subwords                                │
│     • Non-English shares remaining capacity                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTIONS:                                                              │
│  ──────────                                                              │
│                                                                          │
│  1. Larger vocabularies (100K-256K)                                     │
│     • More room for non-English tokens                                 │
│     • GPT-4o's o200k_base is much better                              │
│                                                                          │
│  2. Balanced training data                                              │
│     • Include more non-English text in tokenizer training             │
│     • Gemini trained on multilingual data                              │
│                                                                          │
│  3. Dedicated multilingual tokenizers                                   │
│     • XLM-RoBERTa, mT5 optimized for many languages                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Unicode Normalization

Python

import unicodedata

# Different ways to represent the same character
s1 = "café"  # 'é' as single character (U+00E9)
s2 = "café"  # 'e' + combining accent (U+0065 U+0301)

print(len(s1), len(s2))  # 4, 5 - different lengths!
print(s1 == s2)  # False!

# Normalization forms
nfc = unicodedata.normalize('NFC', s2)   # Composed: é
nfd = unicodedata.normalize('NFD', s1)   # Decomposed: e + ́

print(s1 == nfc)  # True

# Why this matters for tokenization:
# Without normalization, "café" might tokenize differently
# depending on how the accent was input!

# SentencePiece does NFKC normalization by default
# tiktoken does NOT normalize (byte-level handles this)

# Recommendation: Normalize before tokenization if using
# character-level approaches. Byte-level BPE is more robust.

Emoji Tokenization

Python

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

# Single emoji
print(enc.encode("😀"))
# [76460] - one token

# Emoji with skin tone modifier
print(enc.encode("👋🏽"))
# [54959, 243, 234] - multiple tokens!
# 👋 (wave) + 🏽 (medium skin tone) = 2 Unicode code points

# Flag emoji (regional indicators)
print(enc.encode("🇺🇸"))
# [155, 232, 161, 248] - 4 tokens
# Flags are two regional indicator letters

# ZWJ sequences (complex emoji)
print(enc.encode("👨‍👩‍👧‍👦"))  # Family emoji
# Many tokens - combines man + woman + girl + boy with ZWJ

# Emoji are expensive! A family emoji can be 10+ tokens.

# Token counts for various emoji (cl100k_base):
emoji_tokens = {
    "😀": 1,
    "👋": 1,
    "👋🏽": 3,
    "🇺🇸": 4,
    "👨‍👩‍👧‍👦": 11,
    "🏳️‍🌈": 6,
}

Part XI: Tokenization Benchmarks

Token Efficiency Comparison

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TOKEN EFFICIENCY BENCHMARKS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ENGLISH TEXT (Wikipedia sample):                                        │
│  ─────────────────────────────────                                       │
│                                                                          │
│  Tokenizer           Tokens/1K chars   Tokens/1K words   Compression    │
│  ─────────────────────────────────────────────────────────────────────  │
│  GPT-2 (gpt2)        280               1,340             3.57×          │
│  GPT-3.5 (cl100k)    250               1,200             4.00×          │
│  GPT-4o (o200k)      220               1,050             4.55×          │
│  LLaMA 2             290               1,380             3.45×          │
│  LLaMA 3             230               1,100             4.35×          │
│  Mistral             285               1,360             3.51×          │
│  Claude 3            245               1,170             4.08×          │
│                                                                          │
│  Compression = chars/tokens (higher = more efficient)                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PYTHON CODE:                                                            │
│  ────────────                                                            │
│                                                                          │
│  Tokenizer           Tokens/1K chars   vs English                       │
│  ─────────────────────────────────────────────────────────────────────  │
│  GPT-2               350               1.25× more                       │
│  GPT-3.5 (cl100k)    300               1.20× more                       │
│  GPT-4o (o200k)      260               1.18× more                       │
│  LLaMA 3             250               1.09× more                       │
│  CodeLlama           230               1.00× (optimized)                │
│  DeepSeek-Coder      220               0.96× (better!)                  │
│                                                                          │
│  Code-specialized models are more efficient on code.                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  CHINESE TEXT:                                                           │
│  ─────────────                                                           │
│                                                                          │
│  Tokenizer           Tokens/1K chars   vs English                       │
│  ─────────────────────────────────────────────────────────────────────  │
│  GPT-2               800               2.86× more                       │
│  GPT-3.5 (cl100k)    550               2.20× more                       │
│  GPT-4o (o200k)      380               1.73× more                       │
│  Qwen                250               1.00× (optimized)                │
│  Yi                  260               1.04×                            │
│  Baichuan            240               0.96× (better)                   │
│                                                                          │
│  Chinese-optimized models achieve parity with English.                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  JSON/STRUCTURED DATA:                                                   │
│  ─────────────────────                                                   │
│                                                                          │
│  Tokenizer           Tokens/1K chars   Notes                            │
│  ─────────────────────────────────────────────────────────────────────  │
│  GPT-2               320               Brackets, quotes costly          │
│  GPT-3.5 (cl100k)    280               Better structural tokens         │
│  GPT-4o (o200k)      240               "{" often single token          │
│                                                                          │
│  JSON typically 10-20% more tokens than plain text.                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Cost Implications

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TOKENIZATION COST IMPACT                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  API PRICING (per 1M tokens):                                           │
│  ────────────────────────────                                           │
│                                                                          │
│  Model              Input       Output      Effective $/1K chars        │
│  ─────────────────────────────────────────────────────────────────────  │
│  GPT-4o             $2.50       $10.00      ~$0.0006 (English)          │
│  GPT-4o-mini        $0.15       $0.60       ~$0.00003 (English)         │
│  Claude 3.5 Sonnet  $3.00       $15.00      ~$0.0007 (English)          │
│  Claude 3 Haiku     $0.25       $1.25       ~$0.00006 (English)         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LANGUAGE COST MULTIPLIER:                                               │
│  ─────────────────────────                                               │
│                                                                          │
│  Processing 10,000 characters of text:                                  │
│                                                                          │
│  Language    Tokens (GPT-4o)   Cost (at $2.50/M input)                 │
│  ─────────────────────────────────────────────────────────────────────  │
│  English     2,200             $0.0055                                  │
│  Spanish     2,600             $0.0065       (1.2×)                     │
│  Chinese     3,800             $0.0095       (1.7×)                     │
│  Arabic      5,200             $0.0130       (2.4×)                     │
│  Hindi       6,500             $0.0163       (3.0×)                     │
│                                                                          │
│  Non-English users pay significantly more for the same content!        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  OPTIMIZATION STRATEGIES:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  1. Use GPT-4o (o200k) over GPT-4 (cl100k) for multilingual            │
│     • ~20% fewer tokens for non-English                                │
│                                                                          │
│  2. For Chinese/Japanese: Consider Qwen or Yi models                   │
│     • Purpose-built tokenizers                                         │
│     • Same cost as English                                             │
│                                                                          │
│  3. Compress prompts where possible                                    │
│     • Remove unnecessary whitespace                                    │
│     • Use abbreviations in system prompts                              │
│                                                                          │
│  4. Cache and reuse common prefixes                                    │
│     • OpenAI/Anthropic offer prefix caching discounts                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Benchmark Code

Python

import tiktoken
from transformers import AutoTokenizer
import time

def benchmark_tokenizer(name, tokenizer, texts, encode_fn):
    """Benchmark a tokenizer on given texts."""
    start = time.time()

    total_chars = sum(len(t) for t in texts)
    total_tokens = 0

    for text in texts:
        tokens = encode_fn(text)
        total_tokens += len(tokens)

    elapsed = time.time() - start

    return {
        "name": name,
        "total_chars": total_chars,
        "total_tokens": total_tokens,
        "tokens_per_1k_chars": (total_tokens / total_chars) * 1000,
        "compression_ratio": total_chars / total_tokens,
        "time_seconds": elapsed,
        "chars_per_second": total_chars / elapsed
    }

# Sample texts
english_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is transforming how we build software.",
    # ... more samples
] * 100

chinese_texts = [
    "机器学习正在改变我们构建软件的方式。",
    "人工智能将在未来几年内带来巨大变革。",
    # ... more samples
] * 100

code_texts = [
    "def fibonacci(n):\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)",
    "class DataProcessor:\n    def __init__(self, config):\n        self.config = config",
    # ... more samples
] * 100

# Benchmark tiktoken
cl100k = tiktoken.get_encoding("cl100k_base")
o200k = tiktoken.get_encoding("o200k_base")

results = []
results.append(benchmark_tokenizer(
    "GPT-3.5/4 (cl100k)", cl100k, english_texts, cl100k.encode
))
results.append(benchmark_tokenizer(
    "GPT-4o (o200k)", o200k, english_texts, o200k.encode
))

# Benchmark HuggingFace tokenizers
llama_tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
results.append(benchmark_tokenizer(
    "LLaMA 2", llama_tok, english_texts, llama_tok.encode
))

# Print results
print(f"{'Tokenizer':<25} {'Tokens/1K chars':<18} {'Compression':<12} {'Speed':<15}")
print("-" * 70)
for r in results:
    print(f"{r['name']:<25} {r['tokens_per_1k_chars']:<18.1f} {r['compression_ratio']:<12.2f}x {r['chars_per_second']:<15,.0f}")

Part XII: Training Your Own Tokenizer

When to Train Custom Tokenizers

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    WHEN TO TRAIN CUSTOM TOKENIZERS                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TRAIN CUSTOM TOKENIZER WHEN:                                           │
│  ────────────────────────────                                           │
│                                                                          │
│  ✓ Training a new model from scratch                                   │
│  ✓ Domain has very specialized vocabulary                              │
│    • Medical: "electroencephalography", "bronchopneumonia"             │
│    • Legal: "indemnification", "notwithstanding"                       │
│    • Scientific: "CRISPR-Cas9", "immunohistochemistry"                │
│  ✓ Non-English language dominates your use case                        │
│  ✓ Highly specialized format (DNA sequences, chemical formulas)        │
│  ✓ Need maximum efficiency for specific domain                         │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  DON'T TRAIN CUSTOM TOKENIZER WHEN:                                     │
│  ───────────────────────────────────                                    │
│                                                                          │
│  ✗ Using pre-trained models (MUST use their tokenizer)                 │
│  ✗ General-purpose text (existing tokenizers are fine)                 │
│  ✗ Small fine-tuning dataset                                           │
│  ✗ Want to leverage transfer learning                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  HYBRID APPROACH:                                                        │
│  ────────────────                                                        │
│                                                                          │
│  Extend existing tokenizer with domain terms:                          │
│  • Add special tokens for domain-specific terms                        │
│  • Keep base vocabulary for transfer learning                          │
│  • Requires additional embedding training                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Complete Training Pipeline

Python

from tokenizers import (
    Tokenizer, models, pre_tokenizers, decoders,
    trainers, processors, normalizers
)
from pathlib import Path
import json

def train_bpe_tokenizer(
    corpus_files: list,
    vocab_size: int = 32000,
    min_frequency: int = 2,
    output_path: str = "tokenizer.json"
):
    """
    Train a production-quality BPE tokenizer.

    Args:
        corpus_files: List of paths to training text files
        vocab_size: Target vocabulary size
        min_frequency: Minimum token frequency to include
        output_path: Where to save the trained tokenizer
    """

    # ═══════════════════════════════════════════════════════════════════
    # Step 1: Initialize the model
    # ═══════════════════════════════════════════════════════════════════
    tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))

    # ═══════════════════════════════════════════════════════════════════
    # Step 2: Set up normalizer (text preprocessing)
    # ═══════════════════════════════════════════════════════════════════
    tokenizer.normalizer = normalizers.Sequence([
        normalizers.NFC(),  # Unicode normalization
        normalizers.Replace(r"\s+", " "),  # Collapse whitespace
        normalizers.Strip(),  # Trim
    ])

    # ═══════════════════════════════════════════════════════════════════
    # Step 3: Set up pre-tokenizer (how to split before BPE)
    # ═══════════════════════════════════════════════════════════════════
    # GPT-2 style: byte-level with space prefix
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(
        add_prefix_space=True,
        use_regex=True
    )

    # ═══════════════════════════════════════════════════════════════════
    # Step 4: Configure trainer
    # ═══════════════════════════════════════════════════════════════════
    special_tokens = [
        "<unk>",      # Unknown token
        "<s>",        # Beginning of sequence
        "</s>",       # End of sequence
        "<pad>",      # Padding
        "<mask>",     # For masked language modeling
        # Add custom special tokens
        "<|system|>",
        "<|user|>",
        "<|assistant|>",
    ]

    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        min_frequency=min_frequency,
        special_tokens=special_tokens,
        show_progress=True,
        initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
    )

    # ═══════════════════════════════════════════════════════════════════
    # Step 5: Train
    # ═══════════════════════════════════════════════════════════════════
    print(f"Training on {len(corpus_files)} files...")
    tokenizer.train(corpus_files, trainer)

    # ═══════════════════════════════════════════════════════════════════
    # Step 6: Set up decoder (for converting back to text)
    # ═══════════════════════════════════════════════════════════════════
    tokenizer.decoder = decoders.ByteLevel()

    # ═══════════════════════════════════════════════════════════════════
    # Step 7: Set up post-processor (add special tokens)
    # ═══════════════════════════════════════════════════════════════════
    tokenizer.post_processor = processors.TemplateProcessing(
        single="<s> $A </s>",
        pair="<s> $A </s> $B </s>",
        special_tokens=[
            ("<s>", tokenizer.token_to_id("<s>")),
            ("</s>", tokenizer.token_to_id("</s>")),
        ]
    )

    # ═══════════════════════════════════════════════════════════════════
    # Step 8: Save
    # ═══════════════════════════════════════════════════════════════════
    tokenizer.save(output_path)
    print(f"Tokenizer saved to {output_path}")

    # Print statistics
    print(f"\nTokenizer Statistics:")
    print(f"  Vocabulary size: {tokenizer.get_vocab_size()}")
    print(f"  Special tokens: {special_tokens}")

    return tokenizer

def train_from_dataset(dataset, output_path: str, vocab_size: int = 32000):
    """
    Train tokenizer from a HuggingFace dataset.
    """
    from tokenizers import Tokenizer, models, trainers, pre_tokenizers

    tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)

    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        special_tokens=["<unk>", "<s>", "</s>", "<pad>"]
    )

    # Train from iterator (memory efficient)
    def batch_iterator(batch_size=1000):
        for i in range(0, len(dataset), batch_size):
            yield dataset[i:i+batch_size]["text"]

    tokenizer.train_from_iterator(batch_iterator(), trainer)
    tokenizer.save(output_path)

    return tokenizer

# Example usage
if __name__ == "__main__":
    # Train on local files
    corpus_files = [
        "data/wikipedia.txt",
        "data/books.txt",
        "data/code.txt"
    ]

    tokenizer = train_bpe_tokenizer(
        corpus_files=corpus_files,
        vocab_size=32000,
        min_frequency=2,
        output_path="my_tokenizer.json"
    )

    # Test the tokenizer
    text = "Hello, world! This is a test."
    encoding = tokenizer.encode(text)
    print(f"\nTest encoding:")
    print(f"  Input: {text}")
    print(f"  Tokens: {encoding.tokens}")
    print(f"  IDs: {encoding.ids}")

Vocabulary Size Selection

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    VOCABULARY SIZE GUIDE                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  FACTORS TO CONSIDER:                                                    │
│  ────────────────────                                                    │
│                                                                          │
│  1. Model Size                                                           │
│     • Embedding table = vocab_size × embedding_dim × 4 bytes           │
│     • 32K × 4096 × 4 = 512 MB                                          │
│     • 128K × 4096 × 4 = 2 GB                                           │
│     • For small models (<1B), keep vocab smaller                       │
│                                                                          │
│  2. Training Data Size                                                   │
│     • Small data: smaller vocab (rare tokens won't learn well)         │
│     • Large data: larger vocab (can afford more tokens)                │
│                                                                          │
│  3. Language Coverage                                                    │
│     • English only: 32K-50K sufficient                                 │
│     • Multilingual: 100K-256K recommended                              │
│                                                                          │
│  4. Domain Specificity                                                   │
│     • General: standard vocab sizes                                    │
│     • Specialized (code, medical): may need larger                     │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  RECOMMENDED SIZES:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  Use Case                        Vocab Size   Rationale                 │
│  ─────────────────────────────────────────────────────────────────────  │
│  Small model (<500M)             16K-32K      Minimize embedding cost   │
│  Medium model (1B-7B)            32K-64K      Balance efficiency/size   │
│  Large model (7B+)               64K-128K     Efficiency matters more   │
│  English-only                    32K-50K      Sufficient coverage       │
│  Multilingual                    100K-256K    Cover multiple scripts    │
│  Code-focused                    32K-64K      + FIM special tokens      │
│  Code + multilingual             128K+        Llama 3 approach          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EMPIRICAL TUNING:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  1. Train tokenizers with different vocab sizes                        │
│  2. Measure tokens/character on held-out data                          │
│  3. Plot compression ratio vs vocab size                               │
│  4. Find knee of curve (diminishing returns)                           │
│                                                                          │
│  Typically: Beyond 64K, gains are marginal for English.               │
│             Multilingual benefits from 100K+.                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Data Preparation for Tokenizer Training

Python

import os
from pathlib import Path
from typing import Iterator
import random

def prepare_corpus(
    input_paths: list,
    output_path: str,
    sample_ratio: float = 1.0,
    min_length: int = 100,
    max_length: int = 100000,
    shuffle: bool = True
):
    """
    Prepare and clean corpus for tokenizer training.

    Key considerations:
    - Balance across domains/languages
    - Remove very short/long texts
    - Deduplicate
    - Sample if corpus too large
    """

    all_texts = []

    for input_path in input_paths:
        path = Path(input_path)

        if path.is_file():
            with open(path, 'r', encoding='utf-8', errors='ignore') as f:
                for line in f:
                    line = line.strip()
                    if min_length <= len(line) <= max_length:
                        all_texts.append(line)

        elif path.is_dir():
            for file_path in path.glob('**/*.txt'):
                with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                    text = f.read().strip()
                    if min_length <= len(text) <= max_length:
                        all_texts.append(text)

    print(f"Collected {len(all_texts)} texts")

    # Sample if needed
    if sample_ratio < 1.0:
        sample_size = int(len(all_texts) * sample_ratio)
        all_texts = random.sample(all_texts, sample_size)
        print(f"Sampled {len(all_texts)} texts")

    # Shuffle for better training
    if shuffle:
        random.shuffle(all_texts)

    # Write to output
    with open(output_path, 'w', encoding='utf-8') as f:
        for text in all_texts:
            f.write(text + '\n')

    print(f"Written to {output_path}")
    print(f"Total characters: {sum(len(t) for t in all_texts):,}")

    return output_path

def balanced_multilingual_corpus(
    language_files: dict,
    output_path: str,
    samples_per_language: int = 100000
):
    """
    Create a balanced multilingual corpus.

    Args:
        language_files: {"en": "english.txt", "zh": "chinese.txt", ...}
        samples_per_language: Max samples per language for balance
    """

    all_texts = []

    for lang, file_path in language_files.items():
        texts = []
        with open(file_path, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if len(line) > 50:  # Minimum length
                    texts.append(line)

        # Sample for balance
        if len(texts) > samples_per_language:
            texts = random.sample(texts, samples_per_language)

        print(f"  {lang}: {len(texts)} samples")
        all_texts.extend(texts)

    random.shuffle(all_texts)

    with open(output_path, 'w', encoding='utf-8') as f:
        for text in all_texts:
            f.write(text + '\n')

    return output_path

Part XIII: Domain-Specific Tokenization

Medical/Scientific Tokenization

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    MEDICAL/SCIENTIFIC TOKENIZATION                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE PROBLEM:                                                            │
│  ────────────                                                            │
│                                                                          │
│  Standard tokenizers fragment medical terms:                            │
│                                                                          │
│  "electroencephalography"                                               │
│  GPT-4: ["elect", "ro", "ence", "phal", "ography"] (5 tokens)          │
│                                                                          │
│  "immunohistochemistry"                                                 │
│  GPT-4: ["imm", "uno", "hist", "ochemistry"] (4 tokens)                │
│                                                                          │
│  Fragmentation hurts model understanding of domain terms!              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION: DOMAIN-AWARE TOKENIZER                                       │
│  ─────────────────────────────────                                       │
│                                                                          │
│  1. Include medical/scientific text in training corpus                 │
│  2. Use larger vocabulary to capture domain terms                      │
│  3. Consider adding high-frequency terms as special tokens             │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  EXAMPLE: PUBMED-TRAINED TOKENIZER                                      │
│  ─────────────────────────────────                                       │
│                                                                          │
│  "electroencephalography"                                               │
│  PubMed tokenizer: ["electroencephalography"] (1 token!)               │
│                                                                          │
│  "CRISPR-Cas9"                                                          │
│  Standard: ["CR", "ISP", "R", "-", "C", "as", "9"] (7 tokens)          │
│  Domain: ["CRISPR", "-", "Cas9"] (3 tokens)                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Code Tokenization

Python

# Challenges with code tokenization

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

# Problem 1: Indentation creates many tokens
code = """def factorial(n):
    if n <= 1:
        return 1
    return n * factorial(n - 1)"""

tokens = enc.encode(code)
print(f"Token count: {len(tokens)}")  # More than you'd expect

# Problem 2: Variable names split awkwardly
print(enc.encode("getUserById"))
# ['get', 'User', 'By', 'Id'] - 4 tokens, loses semantic connection

print(enc.encode("calculate_total_price"))
# ['calculate', '_', 'total', '_', 'price'] - underscores are separate

# Problem 3: Numbers in code
print(enc.encode("0x1234ABCD"))  # Hex literals
print(enc.encode("192.168.1.1"))  # IP addresses
# Often split unpredictably

# ═══════════════════════════════════════════════════════════════════════
# SOLUTIONS FOR CODE TOKENIZATION
# ═══════════════════════════════════════════════════════════════════════

# Solution 1: Use code-optimized models
# StarCoder, CodeLlama, DeepSeek-Coder have better code tokenizers

# Solution 2: Pre-process code
def normalize_code_for_tokenization(code: str) -> str:
    """Normalize code to improve tokenization."""
    # Convert tabs to spaces (consistent indentation)
    code = code.replace('\t', '    ')

    # Normalize line endings
    code = code.replace('\r\n', '\n')

    # Remove trailing whitespace
    lines = [line.rstrip() for line in code.split('\n')]
    code = '\n'.join(lines)

    return code

# Solution 3: Train domain-specific tokenizer
# Include large code corpus in tokenizer training

Fill-in-the-Middle (FIM) Tokens

Python

# FIM tokens for code completion

# Standard autoregressive: predict next token
# FIM: predict MIDDLE given prefix and suffix

# Example FIM format (StarCoder/CodeLlama):
#
# Original code:
# def hello():
#     print("Hello")  <- want model to generate this
#     return True
#
# FIM transformation:
# <fim_prefix>def hello():
# <fim_suffix>
#     return True
# <fim_middle>    print("Hello")

def convert_to_fim(code: str, cursor_pos: int) -> str:
    """Convert code to FIM format for training."""
    prefix = code[:cursor_pos]
    suffix = code[cursor_pos:]

    # FIM format
    return f"<fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>"

def fim_infill(model, tokenizer, prefix: str, suffix: str) -> str:
    """Use FIM to generate code between prefix and suffix."""
    prompt = f"<fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>"

    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    output = model.generate(
        input_ids,
        max_new_tokens=256,
        eos_token_id=tokenizer.encode("<fim_suffix>")[0],
        pad_token_id=tokenizer.pad_token_id
    )

    generated = tokenizer.decode(output[0], skip_special_tokens=False)

    # Extract the middle part
    middle_start = generated.find("<fim_middle>") + len("<fim_middle>")
    middle_end = generated.find("<fim_suffix>", middle_start)
    if middle_end == -1:
        middle_end = len(generated)

    return generated[middle_start:middle_end]

Chemical and Mathematical Notation

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SPECIALIZED NOTATION TOKENIZATION                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CHEMICAL FORMULAS:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  Formula              Standard Tokenizer         Ideal                  │
│  ─────────────────────────────────────────────────────────────────────  │
│  H2O                  ["H", "2", "O"]           ["H2O"]                 │
│  C6H12O6              ["C", "6", "H", ...]      ["C6H12O6"]             │
│  NaCl                 ["Na", "Cl"]              ["NaCl"]                │
│  CH3COOH              ["CH", "3", "CO", "OH"]   ["CH3COOH"]             │
│                                                                          │
│  SMILES (molecular representation):                                     │
│  CC(=O)OC1=CC=CC=C1C(=O)O  (Aspirin)                                   │
│  Breaks into many small tokens, loses structure                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MATHEMATICAL NOTATION:                                                  │
│  ──────────────────────                                                  │
│                                                                          │
│  LaTeX: $\sum_{i=1}^{n} x_i$                                           │
│  Standard: ["$", "\\", "sum", "_", "{", "i", "=", "1", ...] (many!)    │
│                                                                          │
│  Greek letters: α, β, γ, θ, Σ, ∫                                       │
│  Often multi-token in standard tokenizers                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTIONS:                                                              │
│  ──────────                                                              │
│                                                                          │
│  1. Add domain terms as special tokens                                 │
│  2. Use character-level models for chemistry (SELFIES, SMILES)         │
│  3. Train on domain-specific corpora                                   │
│  4. Consider multi-modal approaches (image of formula)                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part XIV: Multimodal Tokenization

Vision Tokenization

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    VISION TOKENIZATION                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE CHALLENGE:                                                          │
│  ──────────────                                                          │
│                                                                          │
│  Text: discrete tokens from finite vocabulary                          │
│  Images: continuous pixel values, high dimensionality                  │
│                                                                          │
│  224×224 RGB image = 150,528 values                                    │
│  Can't feed raw pixels to transformer!                                 │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION 1: PATCH EMBEDDINGS (ViT style)                               │
│  ─────────────────────────────────────────                               │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                 Image Tokenization (ViT)                         │   │
│  ├─────────────────────────────────────────────────────────────────┤   │
│  │                                                                  │   │
│  │  Input Image (224×224)                                          │   │
│  │       │                                                         │   │
│  │       ▼ Split into patches                                      │   │
│  │                                                                  │   │
│  │  ┌────┬────┬────┬────┐                                          │   │
│  │  │ P1 │ P2 │ P3 │ P4 │  14×14 patches                          │   │
│  │  ├────┼────┼────┼────┤  (16×16 pixels each)                     │   │
│  │  │ P5 │ P6 │ P7 │ P8 │                                          │   │
│  │  ├────┼────┼────┼────┤  = 196 patches                          │   │
│  │  │... │... │... │... │                                          │   │
│  │  └────┴────┴────┴────┘                                          │   │
│  │       │                                                         │   │
│  │       ▼ Linear projection                                       │   │
│  │                                                                  │   │
│  │  Each patch → embedding vector (e.g., 768-dim)                 │   │
│  │  196 patches → 196 "visual tokens"                             │   │
│  │       │                                                         │   │
│  │       ▼ Add position embeddings                                 │   │
│  │                                                                  │   │
│  │  [CLS] + 196 patch tokens → Transformer                        │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  Token count: 196 (fixed) + special tokens                             │
│  Higher resolution → more tokens (quadratic!)                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION 2: VISION ENCODER + PROJECTOR (LLaVA style)                  │
│  ─────────────────────────────────────────────────────                  │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                                                                  │   │
│  │  Image → CLIP ViT → [v1, v2, ..., v576]  (576 visual tokens)  │   │
│  │                 │                                               │   │
│  │                 ▼ MLP Projector                                 │   │
│  │                                                                  │   │
│  │  [v1', v2', ..., v576']  (projected to LLM embedding space)   │   │
│  │                 │                                               │   │
│  │                 ▼ Concatenate with text                        │   │
│  │                                                                  │   │
│  │  [BOS] + visual_tokens + text_tokens → LLM                     │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  LLaVA-1.5: 576 visual tokens per image                                │
│  Higher resolution variants: 1000+ tokens                              │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  SOLUTION 3: DISCRETE VISUAL TOKENS (VQGAN/VQVAE)                      │
│  ─────────────────────────────────────────────────                      │
│                                                                          │
│  Image → Encoder → Quantize to codebook → Discrete tokens             │
│                                                                          │
│  Like text tokenization: finite vocabulary of visual "words"          │
│  Used by: DALL-E, Parti, some video models                            │
│                                                                          │
│  Codebook size: typically 8192-16384 visual tokens                    │
│  Can generate images autoregressively like text!                      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2025 INNOVATIONS:                                                       │
│  ─────────────────                                                       │
│                                                                          │
│  TokenFlow (CVPR 2025):                                                 │
│  • Unified image tokenizer for understanding AND generation            │
│  • Dual-codebook: semantic + pixel-level features                     │
│  • Decouples semantic and pixel learning while maintaining alignment  │
│                                                                          │
│  Harmonizer (May 2025):                                                 │
│  • FusionQuantizer for heterogeneous signals (text, audio, video)     │
│  • Unified tokenization across modalities                             │
│                                                                          │
│  Gemini 2.0 Token Counting:                                            │
│  • Images ≤384px: 258 tokens                                          │
│  • Larger images: 768×768 tiles, 258 tokens each                      │
│  • Video: 263 tokens/second                                           │
│  • Audio: 32 tokens/second                                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Audio Tokenization

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    AUDIO TOKENIZATION                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  APPROACH 1: MEL SPECTROGRAM + PATCHES                                  │
│  ─────────────────────────────────────                                   │
│                                                                          │
│  Audio waveform → Mel spectrogram → Patch embedding                    │
│  Similar to image tokenization                                         │
│  Used by: Whisper (OpenAI), AudioLM                                    │
│                                                                          │
│  Whisper:                                                               │
│  • 30s audio → 80 mel bins × 3000 frames                              │
│  • 2D convolutions → 1500 audio tokens                                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  APPROACH 2: NEURAL AUDIO CODEC (EnCodec/DAC)                          │
│  ─────────────────────────────────────────────                          │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                                                                  │   │
│  │  Audio → Encoder → RVQ → Discrete codes                        │   │
│  │                     │                                           │   │
│  │                     ▼                                           │   │
│  │  Residual Vector Quantization (RVQ):                           │   │
│  │  Layer 1: Coarse features (prosody, speaker)                   │   │
│  │  Layer 2: Mid-level features                                   │   │
│  │  ...                                                            │   │
│  │  Layer 8: Fine details                                         │   │
│  │                                                                  │   │
│  │  Each layer: 1024-codebook quantization                        │   │
│  │  Result: 8 × (audio_length / stride) tokens                   │   │
│  │                                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  EnCodec (Meta):                                                        │
│  • 24kHz audio → 75 tokens/second (at 6kbps)                         │
│  • 8 codebook layers                                                   │
│  • Can reconstruct high-quality audio                                 │
│                                                                          │
│  Used by: MusicGen, AudioCraft, Bark                                  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  APPROACH 3: SEMANTIC + ACOUSTIC TOKENS                                 │
│  ───────────────────────────────────────                                │
│                                                                          │
│  Two-stage tokenization (AudioLM, VALL-E):                            │
│                                                                          │
│  1. Semantic tokens: Content/meaning                                   │
│     • From w2v-BERT or HuBERT                                         │
│     • ~50 tokens/second                                               │
│     • Language-model-friendly                                         │
│                                                                          │
│  2. Acoustic tokens: Sound quality                                     │
│     • From EnCodec/SoundStream                                        │
│     • ~200-600 tokens/second                                          │
│     • For high-fidelity reconstruction                                │
│                                                                          │
│  Generate semantic first, then acoustic (coarse-to-fine)              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Video Tokenization

Python

# Video tokenization approaches

# Approach 1: Frame-by-frame (simple but expensive)
# Each frame → image tokenizer → many tokens
# 30fps × 10 seconds × 576 tokens/frame = 172,800 tokens!

# Approach 2: Temporal compression
# Sample keyframes, use temporal transformer
# Typical: 1-4 fps sampling

# Approach 3: 3D patches (VideoMAE, InternVideo)
def tokenize_video_3d_patches(
    video,           # (T, H, W, C) tensor
    patch_size=16,   # Spatial patch size
    temporal_patch=2 # Temporal patch size
):
    """
    Tokenize video using 3D patches.

    Example: 16 frames × 224×224 video
    Patches: (16/2) × (224/16) × (224/16) = 8 × 14 × 14 = 1568 tokens
    """
    T, H, W, C = video.shape

    # Number of patches in each dimension
    n_t = T // temporal_patch  # 8
    n_h = H // patch_size      # 14
    n_w = W // patch_size      # 14

    # Reshape into patches
    patches = video.reshape(
        n_t, temporal_patch,
        n_h, patch_size,
        n_w, patch_size,
        C
    )
    patches = patches.transpose(0, 2, 4, 1, 3, 5, 6)
    patches = patches.reshape(n_t * n_h * n_w, -1)

    # Linear projection to embeddings
    # patches: (1568, temporal_patch × patch_size × patch_size × C)
    # Project to: (1568, embed_dim)

    return patches  # 1568 video tokens

# Approach 4: Video codec tokens (VideoGPT, MAGVIT)
# Use VQ-VAE trained on video
# Discrete codebook for video "words"

Part XV: Production Considerations

Tokenization Caching

Python

import hashlib
from functools import lru_cache
from typing import List, Tuple
import tiktoken
import redis

class CachedTokenizer:
    """
    Production tokenizer with multi-level caching.
    """

    def __init__(
        self,
        model: str = "gpt-4",
        redis_url: str = None,
        local_cache_size: int = 10000
    ):
        self.enc = tiktoken.encoding_for_model(model)
        self.redis = redis.from_url(redis_url) if redis_url else None
        self._local_cache = {}
        self._cache_size = local_cache_size

    def _hash_text(self, text: str) -> str:
        """Create cache key from text."""
        return hashlib.sha256(text.encode()).hexdigest()[:16]

    def encode(self, text: str) -> List[int]:
        """Encode with caching."""
        cache_key = self._hash_text(text)

        # Level 1: Local memory cache
        if cache_key in self._local_cache:
            return self._local_cache[cache_key]

        # Level 2: Redis cache
        if self.redis:
            cached = self.redis.get(f"tok:{cache_key}")
            if cached:
                tokens = list(map(int, cached.decode().split(',')))
                self._local_cache[cache_key] = tokens
                return tokens

        # Level 3: Compute
        tokens = self.enc.encode(text)

        # Store in caches
        self._local_cache[cache_key] = tokens
        if len(self._local_cache) > self._cache_size:
            # Simple eviction (could use LRU)
            self._local_cache.pop(next(iter(self._local_cache)))

        if self.redis:
            self.redis.setex(
                f"tok:{cache_key}",
                3600,  # 1 hour TTL
                ','.join(map(str, tokens))
            )

        return tokens

    def encode_batch(
        self,
        texts: List[str],
        num_threads: int = 4
    ) -> List[List[int]]:
        """Batch encode with caching."""
        results = [None] * len(texts)
        uncached = []

        # Check cache first
        for i, text in enumerate(texts):
            cache_key = self._hash_text(text)
            if cache_key in self._local_cache:
                results[i] = self._local_cache[cache_key]
            else:
                uncached.append((i, text))

        # Batch encode uncached
        if uncached:
            uncached_texts = [text for _, text in uncached]
            encoded = self.enc.encode_batch(uncached_texts, num_threads=num_threads)

            for (i, text), tokens in zip(uncached, encoded):
                results[i] = tokens
                cache_key = self._hash_text(text)
                self._local_cache[cache_key] = tokens

        return results

    def count_tokens(self, text: str) -> int:
        """Quick token count."""
        return len(self.encode(text))

Streaming Tokenization

Python

from typing import Iterator, Generator
import tiktoken

def stream_tokenize(
    text_stream: Iterator[str],
    model: str = "gpt-4",
    chunk_size: int = 1000
) -> Generator[list, None, None]:
    """
    Tokenize streaming text efficiently.

    Handles the challenge of tokens that span chunk boundaries.
    """
    enc = tiktoken.encoding_for_model(model)
    buffer = ""

    for chunk in text_stream:
        buffer += chunk

        if len(buffer) >= chunk_size:
            # Tokenize all but the last few characters
            # (in case a token spans the boundary)
            safe_boundary = len(buffer) - 50  # Keep 50 char buffer

            to_tokenize = buffer[:safe_boundary]
            buffer = buffer[safe_boundary:]

            tokens = enc.encode(to_tokenize)
            yield tokens

    # Tokenize remaining buffer
    if buffer:
        tokens = enc.encode(buffer)
        yield tokens

def count_tokens_streaming(text_stream: Iterator[str]) -> int:
    """Count tokens in a stream without loading full text."""
    total = 0
    for token_batch in stream_tokenize(text_stream):
        total += len(token_batch)
    return total

# Usage with file streaming
def tokenize_large_file(file_path: str):
    """Tokenize a large file without loading into memory."""
    def file_chunks(path, chunk_size=8192):
        with open(path, 'r', encoding='utf-8') as f:
            while chunk := f.read(chunk_size):
                yield chunk

    all_tokens = []
    for token_batch in stream_tokenize(file_chunks(file_path)):
        all_tokens.extend(token_batch)

    return all_tokens

Token Budget Management

Python

from dataclasses import dataclass
from typing import List, Optional
import tiktoken

@dataclass
class TokenBudget:
    """Manage token budgets for context windows."""
    max_tokens: int
    reserved_output: int = 1000

    @property
    def available_input(self) -> int:
        return self.max_tokens - self.reserved_output

class ContextManager:
    """
    Manage context window token budget.
    """

    def __init__(
        self,
        model: str,
        max_context: int,
        reserved_output: int = 2000
    ):
        self.enc = tiktoken.encoding_for_model(model)
        self.budget = TokenBudget(max_context, reserved_output)

    def count(self, text: str) -> int:
        """Count tokens in text."""
        return len(self.enc.encode(text))

    def fit_messages(
        self,
        messages: List[dict],
        system_prompt: Optional[str] = None
    ) -> List[dict]:
        """
        Fit messages into context window, truncating oldest if needed.

        Returns messages that fit within budget.
        """
        budget = self.budget.available_input

        # Account for system prompt
        if system_prompt:
            system_tokens = self.count(system_prompt) + 10  # overhead
            budget -= system_tokens

        # Calculate tokens for each message
        message_tokens = []
        for msg in messages:
            # Approximate token count including formatting
            tokens = self.count(msg["content"]) + 4  # role + formatting
            message_tokens.append(tokens)

        # Always keep the most recent message
        total = message_tokens[-1]
        kept_indices = [len(messages) - 1]

        # Add older messages until budget exhausted
        for i in range(len(messages) - 2, -1, -1):
            if total + message_tokens[i] <= budget:
                total += message_tokens[i]
                kept_indices.append(i)
            else:
                break

        # Return messages in original order
        kept_indices.sort()
        return [messages[i] for i in kept_indices]

    def truncate_to_fit(
        self,
        text: str,
        max_tokens: int,
        from_end: bool = False
    ) -> str:
        """
        Truncate text to fit within token limit.

        Args:
            text: Text to truncate
            max_tokens: Maximum tokens allowed
            from_end: If True, keep end of text; if False, keep start
        """
        tokens = self.enc.encode(text)

        if len(tokens) <= max_tokens:
            return text

        if from_end:
            truncated_tokens = tokens[-max_tokens:]
        else:
            truncated_tokens = tokens[:max_tokens]

        return self.enc.decode(truncated_tokens)

    def split_into_chunks(
        self,
        text: str,
        chunk_tokens: int,
        overlap_tokens: int = 0
    ) -> List[str]:
        """
        Split text into chunks of specified token size.
        """
        tokens = self.enc.encode(text)
        chunks = []

        start = 0
        while start < len(tokens):
            end = min(start + chunk_tokens, len(tokens))
            chunk_tokens_slice = tokens[start:end]
            chunks.append(self.enc.decode(chunk_tokens_slice))
            start = end - overlap_tokens

        return chunks

# Usage
manager = ContextManager(
    model="gpt-4",
    max_context=128000,
    reserved_output=4000
)

# Fit conversation into context
messages = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
    # ... many more messages
]

fitted = manager.fit_messages(
    messages,
    system_prompt="You are a helpful assistant."
)
print(f"Kept {len(fitted)} of {len(messages)} messages")

Tokenization Monitoring

Python

from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, List
import time

@dataclass
class TokenizationMetrics:
    """Track tokenization metrics for monitoring."""

    total_texts: int = 0
    total_tokens: int = 0
    total_chars: int = 0
    total_time_ms: float = 0
    errors: int = 0

    # Per-language stats
    by_language: Dict[str, dict] = field(default_factory=dict)

    def record(
        self,
        text: str,
        tokens: int,
        time_ms: float,
        language: str = "unknown"
    ):
        self.total_texts += 1
        self.total_tokens += tokens
        self.total_chars += len(text)
        self.total_time_ms += time_ms

        if language not in self.by_language:
            self.by_language[language] = {
                "texts": 0, "tokens": 0, "chars": 0
            }
        self.by_language[language]["texts"] += 1
        self.by_language[language]["tokens"] += tokens
        self.by_language[language]["chars"] += len(text)

    @property
    def avg_tokens_per_char(self) -> float:
        if self.total_chars == 0:
            return 0
        return self.total_tokens / self.total_chars

    @property
    def avg_time_per_text_ms(self) -> float:
        if self.total_texts == 0:
            return 0
        return self.total_time_ms / self.total_texts

    def report(self) -> dict:
        return {
            "total_texts": self.total_texts,
            "total_tokens": self.total_tokens,
            "total_chars": self.total_chars,
            "avg_tokens_per_char": self.avg_tokens_per_char,
            "avg_time_per_text_ms": self.avg_time_per_text_ms,
            "compression_ratio": self.total_chars / max(self.total_tokens, 1),
            "errors": self.errors,
            "by_language": self.by_language
        }

class MonitoredTokenizer:
    """Tokenizer with built-in monitoring."""

    def __init__(self, model: str = "gpt-4"):
        import tiktoken
        self.enc = tiktoken.encoding_for_model(model)
        self.metrics = TokenizationMetrics()

    def encode(self, text: str, language: str = "en") -> List[int]:
        start = time.time()

        try:
            tokens = self.enc.encode(text)
            time_ms = (time.time() - start) * 1000

            self.metrics.record(
                text=text,
                tokens=len(tokens),
                time_ms=time_ms,
                language=language
            )

            return tokens

        except Exception as e:
            self.metrics.errors += 1
            raise

    def get_metrics(self) -> dict:
        return self.metrics.report()

# Usage
tokenizer = MonitoredTokenizer("gpt-4")

# Process texts
for text, lang in texts_with_language:
    tokens = tokenizer.encode(text, language=lang)

# Get metrics
metrics = tokenizer.get_metrics()
print(f"Processed {metrics['total_texts']} texts")
print(f"Average compression: {metrics['compression_ratio']:.2f}x")
print(f"By language: {metrics['by_language']}")

Part XVI: Debugging Tokenization Issues

Common Issues and Solutions

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TOKENIZATION DEBUGGING                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ISSUE 1: UNEXPECTED TOKEN COUNT                                        │
│  ───────────────────────────────                                        │
│                                                                          │
│  Symptom: "My 100-word text uses 500 tokens!"                          │
│                                                                          │
│  Common causes:                                                         │
│  • Non-English text (2-5× token inflation)                             │
│  • Lots of code/special characters                                     │
│  • Emojis (1-11 tokens each!)                                          │
│  • Unusual whitespace/formatting                                       │
│                                                                          │
│  Debug:                                                                 │
│  tokens = enc.encode(text)                                             │
│  for t in tokens:                                                      │
│      print(repr(enc.decode([t])))  # See each token                   │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ISSUE 2: TOKENIZATION INCONSISTENCY                                    │
│  ───────────────────────────────────                                    │
│                                                                          │
│  Symptom: Same text tokenizes differently                              │
│                                                                          │
│  Common causes:                                                         │
│  • Unicode normalization differences (NFC vs NFD)                      │
│  • Invisible characters (zero-width spaces, RTL marks)                 │
│  • Different line endings (\n vs \r\n)                                 │
│  • Leading/trailing whitespace                                         │
│                                                                          │
│  Debug:                                                                 │
│  print([hex(ord(c)) for c in text])  # See exact characters           │
│  import unicodedata                                                    │
│  print([unicodedata.name(c, 'UNKNOWN') for c in text])                │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ISSUE 3: DECODE/ENCODE MISMATCH                                        │
│  ───────────────────────────────                                        │
│                                                                          │
│  Symptom: decode(encode(text)) != text                                 │
│                                                                          │
│  Common causes:                                                         │
│  • Invalid UTF-8 sequences in input                                    │
│  • Characters outside tokenizer's training data                       │
│  • Special tokens being interpreted                                    │
│                                                                          │
│  Debug:                                                                 │
│  original = "test text"                                                │
│  tokens = enc.encode(original)                                         │
│  decoded = enc.decode(tokens)                                          │
│  if original != decoded:                                               │
│      for i, (o, d) in enumerate(zip(original, decoded)):              │
│          if o != d:                                                    │
│              print(f"Diff at {i}: {repr(o)} vs {repr(d)}")            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  ISSUE 4: WRONG TOKENIZER FOR MODEL                                     │
│  ──────────────────────────────────                                     │
│                                                                          │
│  Symptom: Model output is garbage/repetitive                           │
│                                                                          │
│  Cause: Using tokenizer from different model                           │
│                                                                          │
│  Solution: ALWAYS match tokenizer to model                             │
│                                                                          │
│  # WRONG                                                                │
│  tok = AutoTokenizer.from_pretrained("bert-base")                     │
│  model = AutoModel.from_pretrained("gpt2")  # Different!              │
│                                                                          │
│  # RIGHT                                                                │
│  tok = AutoTokenizer.from_pretrained("gpt2")                          │
│  model = AutoModel.from_pretrained("gpt2")  # Same!                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Debugging Tools

Python

import tiktoken
from typing import List, Tuple

def debug_tokenization(text: str, model: str = "gpt-4") -> dict:
    """
    Comprehensive tokenization debugging.
    """
    enc = tiktoken.encoding_for_model(model)

    # Basic encoding
    tokens = enc.encode(text)

    # Decode each token
    token_details = []
    for t in tokens:
        decoded = enc.decode([t])
        token_bytes = enc.decode_single_token_bytes(t)
        token_details.append({
            "id": t,
            "decoded": decoded,
            "repr": repr(decoded),
            "bytes": token_bytes.hex(),
            "length": len(decoded)
        })

    # Find unusual tokens
    unusual = []
    for detail in token_details:
        decoded = detail["decoded"]
        # Check for non-printable characters
        if any(ord(c) < 32 or ord(c) > 126 for c in decoded if c not in '\n\t'):
            unusual.append(detail)

    # Roundtrip check
    decoded_text = enc.decode(tokens)
    roundtrip_ok = decoded_text == text

    return {
        "original_text": text,
        "original_length": len(text),
        "token_count": len(tokens),
        "tokens_per_char": len(tokens) / len(text) if text else 0,
        "token_details": token_details,
        "unusual_tokens": unusual,
        "roundtrip_ok": roundtrip_ok,
        "roundtrip_diff": None if roundtrip_ok else {
            "original": repr(text),
            "decoded": repr(decoded_text)
        }
    }

def visualize_tokens(text: str, model: str = "gpt-4") -> str:
    """
    Visualize tokenization with colored output.
    """
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)

    # ANSI color codes
    colors = [
        '\033[91m',  # Red
        '\033[92m',  # Green
        '\033[93m',  # Yellow
        '\033[94m',  # Blue
        '\033[95m',  # Magenta
        '\033[96m',  # Cyan
    ]
    reset = '\033[0m'

    result = []
    for i, t in enumerate(tokens):
        color = colors[i % len(colors)]
        decoded = enc.decode([t])
        # Show spaces explicitly
        display = decoded.replace(' ', '·').replace('\n', '↵\n')
        result.append(f"{color}[{display}]{reset}")

    return ''.join(result)

def find_token_boundaries(text: str, model: str = "gpt-4") -> List[int]:
    """
    Find character positions where tokens start.
    """
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)

    boundaries = [0]
    current_pos = 0

    for t in tokens:
        decoded = enc.decode([t])
        current_pos += len(decoded)
        boundaries.append(current_pos)

    return boundaries

def compare_tokenizers(text: str, models: List[str]) -> dict:
    """
    Compare tokenization across different models.
    """
    results = {}

    for model in models:
        try:
            enc = tiktoken.encoding_for_model(model)
            tokens = enc.encode(text)
            results[model] = {
                "token_count": len(tokens),
                "tokens": tokens[:20],  # First 20
                "decoded": [enc.decode([t]) for t in tokens[:20]]
            }
        except Exception as e:
            results[model] = {"error": str(e)}

    return results

# Usage examples
if __name__ == "__main__":
    test_text = "Hello, world! 你好世界 🎉"

    # Debug tokenization
    debug = debug_tokenization(test_text)
    print(f"Token count: {debug['token_count']}")
    print(f"Tokens per char: {debug['tokens_per_char']:.2f}")
    print(f"Unusual tokens: {debug['unusual_tokens']}")

    # Visualize
    print("\nVisualization:")
    print(visualize_tokens(test_text))

    # Compare tokenizers
    print("\nComparison:")
    comparison = compare_tokenizers(test_text, ["gpt-3.5-turbo", "gpt-4", "gpt-4o"])
    for model, result in comparison.items():
        print(f"  {model}: {result.get('token_count', 'N/A')} tokens")

Summary

Tokenization converts text to numbers that models can process. The key algorithms:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    KEY TAKEAWAYS                                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SUBWORD TOKENIZATION:                                                   │
│  • Balances vocabulary size and coverage                              │
│  • Handles unknown words via composition                              │
│  • Standard for all modern LLMs                                       │
│                                                                          │
│  BPE (Byte Pair Encoding):                                              │
│  • Iteratively merge most frequent pairs                              │
│  • Simple and effective                                                │
│  • Used by: GPT series, LLaMA 3                                       │
│                                                                          │
│  WordPiece:                                                              │
│  • Merge based on likelihood, not frequency                           │
│  • Used by: BERT family                                               │
│                                                                          │
│  Unigram:                                                                │
│  • Probabilistic model, prune vocabulary                              │
│  • Often more linguistically meaningful                               │
│  • Used by: T5, LLaMA 1/2, Mistral (via SentencePiece)               │
│                                                                          │
│  SentencePiece:                                                          │
│  • Language-agnostic (no pre-tokenization)                            │
│  • Supports BPE and Unigram                                           │
│  • Best for multilingual models                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  LIBRARIES:                                                              │
│  ──────────                                                              │
│                                                                          │
│  tiktoken:                                                               │
│  • OpenAI's fast BPE implementation (Rust)                            │
│  • Encodings: gpt2, cl100k_base, o200k_base                          │
│  • Best for OpenAI model compatibility                                │
│                                                                          │
│  HuggingFace Tokenizers:                                                │
│  • Unified API for all algorithms                                     │
│  • Train custom tokenizers from scratch                               │
│  • Offset tracking, batching, streaming                               │
│                                                                          │
│  SentencePiece:                                                          │
│  • Google's language-agnostic tokenizer                               │
│  • Best for multilingual and Asian languages                          │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  2024-2025 TRENDS:                                                       │
│  • Larger vocabularies (100K-256K) for efficiency                    │
│  • Llama 3 switched SentencePiece → tiktoken-style BPE              │
│  • Code-specialized tokenizers with FIM tokens                       │
│  • More special tokens for chat, tools, multimodal                   │
│  • Multimodal tokens for vision, audio, video                        │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  PRACTICAL ADVICE:                                                       │
│  • Always use the model's exact tokenizer                             │
│  • Count tokens, not words, for context limits                        │
│  • Test tokenization for your specific use case                       │
│  • Cache tokenization results in production                           │
│  • Monitor token efficiency by language/domain                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Part XVII: Advanced Topics

Token Healing

When generating text, token boundaries can cause issues. "Token healing" fixes incomplete tokens at prompt boundaries:

Python

# The problem: prompt ends mid-token
prompt = "The URL is https://example"
# Model might continue: ".com" but tokenizer split "example" oddly

# Token healing: re-tokenize the boundary
def heal_tokens(prompt: str, completion: str, tokenizer) -> str:
    """
    Fix token boundary issues between prompt and completion.

    Some models (like Guidance, LM Studio) do this automatically.
    """
    # Find overlap region
    overlap_chars = 10  # Check last N chars of prompt

    # Re-tokenize the boundary region together
    boundary = prompt[-overlap_chars:] + completion[:overlap_chars]
    healed_tokens = tokenizer.encode(boundary)

    # The "healed" version may differ from separate encoding
    prompt_tokens = tokenizer.encode(prompt[-overlap_chars:])
    completion_tokens = tokenizer.encode(completion[:overlap_chars])

    if healed_tokens != prompt_tokens + completion_tokens:
        # Boundary was split differently - use healed version
        return tokenizer.decode(healed_tokens)

    return completion

# Example: "hello" + "world" might tokenize differently than "helloworld"

Extending Existing Tokenizers

For fine-tuning, you may need to add domain-specific tokens:

Python

from transformers import AutoTokenizer

# Load existing tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
original_vocab_size = len(tokenizer)

# Add new tokens
new_tokens = [
    "<|tool_call|>",
    "<|tool_result|>",
    "<|thinking|>",
    "</|thinking|>",
    # Domain-specific terms
    "electroencephalography",
    "immunohistochemistry",
]

num_added = tokenizer.add_tokens(new_tokens)
print(f"Added {num_added} tokens. Vocab: {original_vocab_size} → {len(tokenizer)}")

# IMPORTANT: Resize model embeddings to match!
# model.resize_token_embeddings(len(tokenizer))

# New tokens get IDs starting from original vocab size
for token in new_tokens:
    token_id = tokenizer.convert_tokens_to_ids(token)
    print(f"  {token}: {token_id}")

# Save extended tokenizer
tokenizer.save_pretrained("my-extended-tokenizer")

# The new embeddings are randomly initialized
# They'll learn during fine-tuning

Subword Regularization

Training technique for better generalization:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    SUBWORD REGULARIZATION                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  THE IDEA:                                                               │
│  ──────────                                                              │
│                                                                          │
│  "unaffable" can be tokenized multiple ways:                           │
│  • ["un", "aff", "able"]                                                │
│  • ["una", "ff", "able"]                                                │
│  • ["un", "a", "ff", "able"]                                            │
│                                                                          │
│  Standard: Always use the "best" segmentation                          │
│  Regularization: Randomly sample different segmentations during train  │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  BENEFITS:                                                               │
│  ─────────                                                               │
│                                                                          │
│  1. More robust to tokenization variations                             │
│  2. Better handling of typos and rare words                            │
│  3. Improved generalization                                            │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  IMPLEMENTATION (SentencePiece):                                        │
│  ────────────────────────────────                                        │
│                                                                          │
│  # During training, sample from multiple segmentations                 │
│  import sentencepiece as spm                                           │
│                                                                          │
│  sp = spm.SentencePieceProcessor(model_file='model.model')            │
│                                                                          │
│  # NBest sampling: sample from top-N segmentations                     │
│  pieces = sp.encode('hello', out_type=str,                            │
│                      enable_sampling=True,                             │
│                      alpha=0.1,  # Smoothing parameter                │
│                      nbest_size=-1)  # Sample from all                │
│                                                                          │
│  This is used during TRAINING only, not inference.                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Tokenization-Free Approaches

Some models skip tokenization entirely:

Code

┌─────────────────────────────────────────────────────────────────────────┐
│                    TOKENIZATION-FREE MODELS                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  BYTE-LEVEL MODELS:                                                      │
│  ──────────────────                                                      │
│                                                                          │
│  ByT5 (Google, 2021):                                                   │
│  • Input: raw UTF-8 bytes (0-255)                                      │
│  • No tokenizer needed                                                 │
│  • More robust to noise/typos                                          │
│  • BUT: 4-6× longer sequences                                          │
│  • Slower training and inference                                       │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  MEGABYTE (Meta, 2023):                                                 │
│  • Hierarchical: bytes → patches → global model                        │
│  • Patches of ~8 bytes processed locally                              │
│  • Global model sees compressed patches                                │
│  • Best of both: byte-level + efficiency                              │
│                                                                          │
│  Architecture:                                                          │
│  Bytes → Local Encoder (small) → Patch Embeddings → Global Model      │
│                                                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                          │
│  WHEN TO USE BYTE-LEVEL:                                                │
│  ────────────────────────                                                │
│                                                                          │
│  ✓ Noisy text (social media, OCR)                                     │
│  ✓ Code with unusual syntax                                           │
│  ✓ Multilingual with many scripts                                     │
│  ✓ DNA/protein sequences                                              │
│  ✗ Typical NLP tasks (subword is more efficient)                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Structured Data Tokenization

Handling JSON, XML, and tables:

Python

import tiktoken
import json

enc = tiktoken.get_encoding("cl100k_base")

# JSON tokenization is expensive
data = {"name": "John", "age": 30, "city": "NYC"}
json_str = json.dumps(data)
tokens = enc.encode(json_str)
print(f"JSON: {len(json_str)} chars → {len(tokens)} tokens")
# JSON uses many structural tokens: {, }, ", :, etc.

# Optimization 1: Minify JSON (remove whitespace)
compact = json.dumps(data, separators=(',', ':'))
print(f"Compact: {len(compact)} chars → {len(enc.encode(compact))} tokens")

# Optimization 2: Use shorter keys
short_data = {"n": "John", "a": 30, "c": "NYC"}
short_json = json.dumps(short_data, separators=(',', ':'))
print(f"Short keys: {len(enc.encode(short_json))} tokens")

# Optimization 3: For repeated structures, consider custom format
# Instead of: [{"x": 1, "y": 2}, {"x": 3, "y": 4}]
# Use: "1,2|3,4" with schema defined in system prompt

# Table tokenization
table = """
| Name  | Age | City |
|-------|-----|------|
| John  | 30  | NYC  |
| Jane  | 25  | LA   |
"""
print(f"Markdown table: {len(enc.encode(table))} tokens")

# CSV is more token-efficient
csv = "Name,Age,City\nJohn,30,NYC\nJane,25,LA"
print(f"CSV: {len(enc.encode(csv))} tokens")

Sources

Neural Machine Translation of Rare Words with Subword Units - Original BPE for NMT (Sennrich et al., 2016)
Google's Neural Machine Translation System - WordPiece tokenization
SentencePiece: A simple and language independent subword tokenizer - SentencePiece paper (Kudo & Richardson, 2018)
Subword Regularization - Training with multiple segmentations (Kudo, 2018)
Language Models are Unsupervised Multitask Learners - GPT-2 byte-level BPE
ByT5: Towards a Token-Free Future - Byte-level transformers (Xue et al., 2021)
MEGABYTE: Predicting Million-byte Sequences - Hierarchical byte-level (Yu et al., 2023)
HuggingFace Tokenizers Documentation - Fast tokenizers library
tiktoken GitHub - OpenAI's BPE implementation
Llama 3 Model Card - Llama 3 tokenizer details
The Tokenizer Playground - OpenAI's interactive tokenizer tool
StarCoder: A State-of-the-Art LLM for Code - Code tokenization (2023)
DeepSeek-Coder - Fill-in-the-middle tokenization (2024)

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

EducationLLMs

Transformer Architecture: A Complete Deep Dive

Deep exploration of the transformer architecture—from embedding layers through attention and feed-forward networks to the output head. Understand why decoder-only models dominate, how residual connections enable deep networks, and the engineering decisions behind GPT, Llama, and modern LLMs.

30 min read

EducationLLMs

LLM Pre-training: Building Foundation Models from Scratch

Field guide to pre-training large language models—from data curation and architecture decisions to scaling laws and distributed training infrastructure. Understanding how GPT, Llama, and other foundation models are built.

15 min read

LLMsML Engineering

Context Extension: How LLMs Scale Beyond Training Length

Deep dive into context extension techniques—how models trained on 4K tokens extrapolate to 128K+. Understand RoPE scaling, Position Interpolation, NTK-aware scaling, YaRN, and the mathematics of long-context LLMs.

5 min read

LLMsML Engineering

Attention Mechanisms: From Self-Attention to FlashAttention

Detailed walkthrough of attention mechanisms—the core innovation powering modern LLMs. From the intuition behind self-attention to the engineering of FlashAttention, understand how transformers actually work.

7 min read

LLMsML Engineering

Text Generation & Decoding Strategies: A Complete Guide

Hands-on guide to how LLMs actually generate text—from greedy decoding to beam search, temperature scaling, nucleus sampling, speculative decoding, and structured generation. Master the techniques that control LLM output quality, creativity, and speed.

16 min read

LLMsML Engineering

LLM Inference Optimization: From Quantization to Speculative Decoding

Practical guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.

12 min read

LLMsML Engineering

Open-Source LLMs: The Complete 2025 Guide

Hands-on guide to open-source LLMs—Llama 4, Qwen3, DeepSeek V3.2, Mistral Large 3, Kimi K2, GLM-4.7 and more. Detailed benchmarks, hardware requirements, deployment strategies, and practical recommendations for production use.

3 min read

LLMsMultimodal

Multimodal LLMs: Vision, Audio, and Beyond

Field guide to multimodal LLMs—vision-language models, audio understanding, video comprehension, and any-to-any models. Architecture deep dives, benchmarks, implementation patterns, and production deployment.

12 min read

EducationRAG

Building Production-Ready RAG Systems: Lessons from the Field

Production-focused guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.

16 min read

Table of Contents

Why Tokenization Matters

Part I: Tokenization Fundamentals

What is Tokenization?

The Out-of-Vocabulary Problem

Part II: Byte Pair Encoding (BPE)

The Algorithm

BPE Tokenization (Inference)

BPE Implementation

Part III: WordPiece

The BERT Tokenizer

BPE vs WordPiece

Part IV: Unigram Language Model

A Different Approach

Part V: SentencePiece

Language-Agnostic Tokenization

Which Tokenizer Do Models Use?

Part VI: Tokenization Quirks and Issues

Common Problems

Tokenization Best Practices

Part VII: Recent Innovations (2024-2025)

Llama 3's Tokenizer Switch

Modern Vocabulary Size Trends

Special Tokens in Modern LLMs

Tokenization for Code

Updated Tokenizer Recommendations

Part VIII: tiktoken Deep Dive

OpenAI's Tokenization Library

tiktoken Encodings

Using tiktoken

tiktoken with Special Tokens

Creating Custom tiktoken Encodings

tiktoken Regex Patterns

tiktoken Performance Optimization

Part IX: HuggingFace Tokenizers

The Fast Tokenizers Library

Building a Tokenizer from Scratch

Pre-Tokenizers in Detail

Post-Processors for Special Tokens

Encoding with Offset Tracking

Batch Encoding with Padding and Truncation

Part X: Unicode and Byte-Level BPE

The Unicode Challenge

GPT-2's Byte-to-Character Mapping

Handling Multilingual Text

Unicode Normalization

Emoji Tokenization

Part XI: Tokenization Benchmarks

Token Efficiency Comparison

Cost Implications

Benchmark Code

Part XII: Training Your Own Tokenizer

When to Train Custom Tokenizers

Complete Training Pipeline

Vocabulary Size Selection

Data Preparation for Tokenizer Training

Part XIII: Domain-Specific Tokenization

Medical/Scientific Tokenization

Code Tokenization

Fill-in-the-Middle (FIM) Tokens

Chemical and Mathematical Notation

Part XIV: Multimodal Tokenization

Vision Tokenization

Audio Tokenization

Video Tokenization

Part XV: Production Considerations

Tokenization Caching

Streaming Tokenization

Token Budget Management

Tokenization Monitoring

Part XVI: Debugging Tokenization Issues

Common Issues and Solutions

Debugging Tools

Summary

Part XVII: Advanced Topics

Token Healing

Extending Existing Tokenizers

Subword Regularization

Tokenization-Free Approaches

Structured Data Tokenization