Tokenization Deep Dive: BPE, WordPiece, and SentencePiece
A comprehensive deep dive into tokenization—how LLMs convert text to numbers. Understand BPE, WordPiece, Unigram, and SentencePiece, and why tokenization matters for model performance.
Table of Contents
Why Tokenization Matters
Before a language model can process text, it must convert characters into numbers. This conversion—tokenization—is more important than it might seem. Tokenization determines:
- Vocabulary size: How many unique tokens the model knows
- Sequence length: How many tokens a text becomes (affects compute cost)
- Out-of-vocabulary handling: What happens with unknown words
- Multilingual ability: How well non-English text is represented
- Efficiency: Tokens per character ratio affects cost
2025 vocabulary size trends: The field has seen dramatic vocabulary expansion. Qwen uses 151,646 tokens optimized for both English and Chinese (24,953 Chinese tokens). Llama-3 jumped to 128,256 tokens, up from Llama-2's 32K. Gemma-2 peaks at 256K tokens, while DeepSeek-V3 uses 128K tokens tuned for 128K context windows.
ByteLevel BPE dominates: According to recent analysis, models like GPT-2, GPT-3, Llama-3, Falcon, and Qwen use ByteLevel BPE, which operates at the byte level to ensure any input—regardless of language or special characters—can be tokenized and perfectly reversed. SentencePiece BPE remains popular in Mistral, Llama-2, and Yi.
Why vocabulary size is expanding: Research shows that larger vocabularies improve multilingual capability and reduce sequence length (fewer tokens per text). The tradeoff is embedding table size—151K vocab at 4096 dimensions = 620M parameters just for embeddings.
Poor tokenization can cripple an otherwise good model. Understanding tokenization helps you choose the right tokenizer, understand model limitations, and debug unexpected behaviors.
Part I: Tokenization Fundamentals
What is Tokenization?
┌─────────────────────────────────────────────────────────────────────────┐
│ TOKENIZATION OVERVIEW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE PROCESS: │
│ ──────────── │
│ │
│ Text: "Hello, world!" │
│ │
│ │ │
│ ▼ Tokenization │
│ │
│ Tokens: ["Hello", ",", " world", "!"] │
│ │
│ │ │
│ ▼ Token → ID mapping │
│ │
│ IDs: [15496, 11, 995, 0] │
│ │
│ These IDs are what the model actually processes. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ DIFFERENT GRANULARITIES: │
│ ──────────────────────── │
│ │
│ CHARACTER-LEVEL: │
│ "Hello" → ["H", "e", "l", "l", "o"] (5 tokens) │
│ + Small vocabulary (~100-300 characters) │
│ + Handles any text │
│ - Very long sequences │
│ - Hard for model to learn word meanings │
│ │
│ WORD-LEVEL: │
│ "Hello" → ["Hello"] (1 token) │
│ + Semantic units preserved │
│ - Huge vocabulary (100K+ words) │
│ - Can't handle unknown words ("OOV problem") │
│ │
│ SUBWORD-LEVEL (Modern approach): │
│ "unhappiness" → ["un", "happiness"] (2 tokens) │
│ + Moderate vocabulary (30K-100K) │
│ + Handles unknown words via subword composition │
│ + Captures morphology (prefixes, suffixes) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Out-of-Vocabulary Problem
Word-level tokenization has a fatal flaw:
┌─────────────────────────────────────────────────────────────────────────┐
│ THE OOV PROBLEM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ WORD-LEVEL TOKENIZATION: │
│ ──────────────────────── │
│ │
│ Training vocabulary: {"the", "cat", "sat", "on", "mat", ...} │
│ (100,000 most common words) │
│ │
│ Input: "The ChatGPT model is transformative" │
│ │
│ "ChatGPT" → NOT IN VOCABULARY → [UNK] token │
│ "transformative" → NOT IN VOCABULARY → [UNK] │
│ │
│ Result: "The [UNK] model is [UNK]" │
│ │
│ All meaning from new/rare words is LOST! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY THIS IS SEVERE: │
│ ─────────────────── │
│ │
│ • Technical terms: "PyTorch", "Kubernetes", "GraphQL" │
│ • Names: "Enrico", "Anthropic", "OpenAI" │
│ • New words: "COVID-19", "NFT", "ChatGPT" │
│ • Misspellings: "teh", "recieve" │
│ • Compound words: "unprecedented", "unbelievable" │
│ • Other languages: Any non-English word │
│ │
│ Real text is FULL of rare/new words! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SUBWORD SOLUTION: │
│ ───────────────── │
│ │
│ "ChatGPT" → ["Chat", "G", "PT"] │
│ "transformative" → ["transform", "ative"] │
│ "unprecedented" → ["un", "pre", "ced", "ented"] │
│ │
│ Unknown words decompose into known subwords. │
│ Meaning is partially preserved! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part II: Byte Pair Encoding (BPE)
The Algorithm
BPE is the most widely used subword tokenization algorithm (GPT-2, GPT-3, GPT-4, LLaMA, etc.):
┌─────────────────────────────────────────────────────────────────────────┐
│ BYTE PAIR ENCODING (BPE) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRAINING ALGORITHM: │
│ ─────────────────── │
│ │
│ 1. Start with character vocabulary │
│ 2. Count all adjacent pairs in corpus │
│ 3. Merge most frequent pair into new token │
│ 4. Repeat until desired vocabulary size │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EXAMPLE: │
│ ──────── │
│ │
│ Training corpus: "low lower lowest" │
│ │
│ Initial vocabulary: {l, o, w, e, r, s, t, ' '} (characters) │
│ │
│ Step 0: Split words into characters │
│ "low" → [l, o, w] │
│ "lower" → [l, o, w, e, r] │
│ "lowest" → [l, o, w, e, s, t] │
│ │
│ Step 1: Count pairs │
│ (l, o): 3 ← MOST FREQUENT │
│ (o, w): 3 │
│ (w, e): 2 │
│ (e, r): 1 │
│ (e, s): 1 │
│ (s, t): 1 │
│ │
│ Merge (l, o) → "lo" │
│ "low" → [lo, w] │
│ "lower" → [lo, w, e, r] │
│ "lowest" → [lo, w, e, s, t] │
│ │
│ Step 2: Count pairs again │
│ (lo, w): 3 ← MOST FREQUENT │
│ (w, e): 2 │
│ (e, r): 1 │
│ (e, s): 1 │
│ (s, t): 1 │
│ │
│ Merge (lo, w) → "low" │
│ "low" → [low] │
│ "lower" → [low, e, r] │
│ "lowest" → [low, e, s, t] │
│ │
│ Step 3: Count pairs │
│ (low, e): 2 ← MOST FREQUENT │
│ (e, r): 1 │
│ (e, s): 1 │
│ (s, t): 1 │
│ │
│ Merge (low, e) → "lowe" │
│ "low" → [low] │
│ "lower" → [lowe, r] │
│ "lowest" → [lowe, s, t] │
│ │
│ Continue until vocabulary size reached... │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ FINAL VOCABULARY (after more merges): │
│ {l, o, w, e, r, s, t, lo, low, lowe, lower, lowest, ...} │
│ │
│ Each merge is recorded: [(l,o)→lo, (lo,w)→low, ...] │
│ These merges are applied in order during tokenization. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
BPE Tokenization (Inference)
┌─────────────────────────────────────────────────────────────────────────┐
│ BPE TOKENIZATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TOKENIZATION ALGORITHM: │
│ ─────────────────────── │
│ │
│ 1. Split input into characters (or bytes) │
│ 2. Apply learned merges in order │
│ 3. Stop when no more merges apply │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EXAMPLE: │
│ ──────── │
│ │
│ Input: "lowest" │
│ Merges: [(l,o)→lo, (lo,w)→low, (low,e)→lowe, ...] │
│ │
│ Step 0: [l, o, w, e, s, t] │
│ Apply (l,o)→lo: [lo, w, e, s, t] │
│ Apply (lo,w)→low: [low, e, s, t] │
│ Apply (low,e)→lowe: [lowe, s, t] │
│ No more merges apply to remaining pairs │
│ │
│ Result: ["lowe", "s", "t"] │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ HANDLING UNKNOWN SUBSTRINGS: │
│ ──────────────────────────── │
│ │
│ Input: "xyzzy" (rare word) │
│ │
│ If "x", "y", "z" are in vocabulary as individual characters: │
│ [x, y, z, z, y] │
│ │
│ If some characters unknown: Use byte fallback │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ BYTE-LEVEL BPE (GPT-2 and later): │
│ ───────────────────────────────── │
│ │
│ Start with 256 byte tokens instead of characters. │
│ Guarantees any input can be tokenized! │
│ │
│ Base vocabulary: {0x00, 0x01, ..., 0xFF} │
│ + Learned merges: {Ġthe, Ġand, ing, ...} │
│ │
│ "Ġ" represents a space preceding the token (common in GPT-2) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
BPE Implementation
from collections import Counter, defaultdict
import re
class BPETokenizer:
"""Simplified BPE tokenizer implementation."""
def __init__(self, vocab_size: int = 1000):
self.vocab_size = vocab_size
self.merges = [] # List of (pair, merged) tuples
self.vocab = {} # Token to ID mapping
def train(self, corpus: str):
"""Train BPE on a corpus."""
# Tokenize corpus into words, then characters
words = corpus.split()
word_freqs = Counter(words)
# Split words into characters with end-of-word marker
splits = {
word: list(word) + ['</w>']
for word in word_freqs
}
# Start with character vocabulary
vocab = set()
for word in splits:
vocab.update(splits[word])
# Merge until desired vocab size
while len(vocab) < self.vocab_size:
# Count all pairs
pair_freqs = defaultdict(int)
for word, freq in word_freqs.items():
symbols = splits[word]
for i in range(len(symbols) - 1):
pair = (symbols[i], symbols[i + 1])
pair_freqs[pair] += freq
if not pair_freqs:
break
# Find most frequent pair
best_pair = max(pair_freqs, key=pair_freqs.get)
merged = best_pair[0] + best_pair[1]
# Record merge
self.merges.append((best_pair, merged))
vocab.add(merged)
# Apply merge to all words
for word in splits:
symbols = splits[word]
new_symbols = []
i = 0
while i < len(symbols):
if (i < len(symbols) - 1 and
symbols[i] == best_pair[0] and
symbols[i + 1] == best_pair[1]):
new_symbols.append(merged)
i += 2
else:
new_symbols.append(symbols[i])
i += 1
splits[word] = new_symbols
# Create vocab mapping
self.vocab = {token: i for i, token in enumerate(sorted(vocab))}
def tokenize(self, text: str) -> list:
"""Tokenize text using learned merges."""
words = text.split()
tokens = []
for word in words:
# Start with characters
symbols = list(word) + ['</w>']
# Apply merges in order
for (pair, merged) in self.merges:
new_symbols = []
i = 0
while i < len(symbols):
if (i < len(symbols) - 1 and
symbols[i] == pair[0] and
symbols[i + 1] == pair[1]):
new_symbols.append(merged)
i += 2
else:
new_symbols.append(symbols[i])
i += 1
symbols = new_symbols
tokens.extend(symbols)
return tokens
def encode(self, text: str) -> list:
"""Convert text to token IDs."""
tokens = self.tokenize(text)
return [self.vocab.get(t, self.vocab.get('<unk>', 0)) for t in tokens]
def decode(self, ids: list) -> str:
"""Convert token IDs back to text."""
id_to_token = {v: k for k, v in self.vocab.items()}
tokens = [id_to_token.get(i, '<unk>') for i in ids]
text = ''.join(tokens)
text = text.replace('</w>', ' ')
return text.strip()
Part III: WordPiece
The BERT Tokenizer
WordPiece is used by BERT and similar models. It's similar to BPE but uses a different merge criterion:
┌─────────────────────────────────────────────────────────────────────────┐
│ WORDPIECE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ KEY DIFFERENCE FROM BPE: │
│ ──────────────────────── │
│ │
│ BPE: Merge most FREQUENT pair │
│ WordPiece: Merge pair that maximizes LIKELIHOOD │
│ │
│ Likelihood score for merging (a, b) → ab: │
│ │
│ score(a, b) = freq(ab) / (freq(a) × freq(b)) │
│ │
│ This prefers merges where ab appears more often than │
│ expected by chance given individual frequencies. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EXAMPLE: │
│ ──────── │
│ │
│ freq("th") = 100 │
│ freq("t") = 500 │
│ freq("h") = 300 │
│ score("t", "h") = 100 / (500 × 300) = 0.00067 │
│ │
│ freq("un") = 80 │
│ freq("u") = 100 │
│ freq("n") = 200 │
│ score("u", "n") = 80 / (100 × 200) = 0.004 │
│ │
│ "un" has higher score despite lower frequency! │
│ It's more "surprising" to see u and n together. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WORDPIECE TOKENIZATION: │
│ ─────────────────────── │
│ │
│ Input: "unhappiness" │
│ Output: ["un", "##happi", "##ness"] │
│ │
│ The "##" prefix indicates continuation of a word. │
│ First subword has no prefix. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TOKENIZATION ALGORITHM (greedy): │
│ ───────────────────────────────── │
│ │
│ 1. For each word, find longest prefix in vocabulary │
│ 2. Add token, continue with remainder │
│ 3. If single character not in vocab, use [UNK] │
│ │
│ "unhappiness" │
│ → "un" is in vocab, remainder "happiness" │
│ → "happiness" → longest match "##happi" (maybe), remainder "ness" │
│ → "##ness" is in vocab │
│ → Result: ["un", "##happi", "##ness"] │
│ │
└─────────────────────────────────────────────────────────────────────────┘
BPE vs WordPiece
┌─────────────────────────────────────────────────────────────────────────┐
│ BPE VS WORDPIECE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ MERGE CRITERION: │
│ ──────────────── │
│ │
│ BPE: Frequency │
│ • Merge pairs that appear most often │
│ • Simple and fast │
│ • May create arbitrary subwords │
│ │
│ WordPiece: Likelihood │
│ • Merge pairs that are "surprising" together │
│ • Tends to create more linguistically meaningful units │
│ • Slightly more complex │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TOKENIZATION: │
│ ───────────── │
│ │
│ BPE: Apply merges in learned order │
│ • Deterministic: same merges → same tokenization │
│ • Can be slow for long sequences │
│ │
│ WordPiece: Greedy longest-match │
│ • Faster at inference │
│ • May produce different segmentations │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ USAGE: │
│ ────── │
│ │
│ BPE: GPT-2, GPT-3, GPT-4, LLaMA, most modern LLMs │
│ WordPiece: BERT, DistilBERT, ELECTRA │
│ │
│ In practice, both work well. BPE is more common now. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part IV: Unigram Language Model
A Different Approach
Unigram (used by SentencePiece) takes a fundamentally different approach:
┌─────────────────────────────────────────────────────────────────────────┐
│ UNIGRAM LANGUAGE MODEL │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ KEY IDEA: │
│ ───────── │
│ │
│ BPE/WordPiece: Build vocabulary bottom-up (merge characters) │
│ Unigram: Start with large vocabulary, PRUNE it down │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TRAINING: │
│ ───────── │
│ │
│ 1. Start with large vocabulary (all substrings up to length N) │
│ 2. For each token, compute how much removing it hurts likelihood │
│ 3. Remove tokens that hurt least (keep most useful) │
│ 4. Repeat until desired vocabulary size │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PROBABILISTIC MODEL: │
│ ──────────────────── │
│ │
│ Each token has a probability P(token). │
│ Tokenization probability: P(x₁) × P(x₂) × ... × P(xₙ) │
│ │
│ Given text "hello": │
│ Segmentation 1: ["hello"] → P("hello") │
│ Segmentation 2: ["hel", "lo"] → P("hel") × P("lo") │
│ Segmentation 3: ["h", "e", "l", "l", "o"] → P("h")×P("e")×... │
│ │
│ Choose segmentation with highest probability! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TOKENIZATION (Viterbi algorithm): │
│ ────────────────────────────────── │
│ │
│ Find most probable segmentation using dynamic programming. │
│ │
│ For each position, store best path to reach it. │
│ Time complexity: O(n × max_token_length) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ADVANTAGES: │
│ ─────────── │
│ │
│ • Probabilistic framework (can sample different tokenizations) │
│ • Often produces more linguistically meaningful tokens │
│ • Used by: T5, mT5, ALBERT, XLNet (via SentencePiece) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part V: SentencePiece
Language-Agnostic Tokenization
SentencePiece is a library that implements both BPE and Unigram, with a key innovation: it treats text as a raw byte stream, making it language-agnostic:
┌─────────────────────────────────────────────────────────────────────────┐
│ SENTENCEPIECE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ KEY INNOVATIONS: │
│ ──────────────── │
│ │
│ 1. LANGUAGE-AGNOSTIC │
│ No pre-tokenization (no word splitting) │
│ Treats text as raw character/byte sequence │
│ Works for any language, including Chinese, Japanese │
│ │
│ 2. WHITESPACE AS TOKEN │
│ Whitespace is just another character │
│ Can be part of tokens: "▁the" (▁ = space) │
│ Enables lossless reconstruction │
│ │
│ 3. SUPPORTS MULTIPLE ALGORITHMS │
│ BPE mode │
│ Unigram mode (default, often better) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TOKENIZATION EXAMPLE: │
│ ───────────────────── │
│ │
│ Input: "Hello World" │
│ │
│ Traditional (GPT-2 style): │
│ Pre-tokenize on spaces: ["Hello", "World"] │
│ Then BPE: ["Hello", "ĠWorld"] (Ġ indicates preceding space) │
│ │
│ SentencePiece: │
│ Raw input: "Hello World" │
│ Tokenize: ["▁Hello", "▁World"] (▁ is the space character) │
│ │
│ Key difference: SentencePiece puts space at START of word. │
│ This works better for languages without spaces (Chinese, Japanese). │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ USAGE: │
│ ────── │
│ │
│ import sentencepiece as spm │
│ │
│ # Train │
│ spm.SentencePieceTrainer.train( │
│ input='corpus.txt', │
│ model_prefix='mymodel', │
│ vocab_size=32000, │
│ model_type='unigram', # or 'bpe' │
│ ) │
│ │
│ # Load and use │
│ sp = spm.SentencePieceProcessor() │
│ sp.load('mymodel.model') │
│ │
│ tokens = sp.encode_as_pieces('Hello World') │
│ # ['▁Hello', '▁World'] │
│ │
│ ids = sp.encode_as_ids('Hello World') │
│ # [1234, 5678] │
│ │
│ text = sp.decode_ids([1234, 5678]) │
│ # 'Hello World' │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Which Tokenizer Do Models Use?
┌─────────────────────────────────────────────────────────────────────────┐
│ TOKENIZERS BY MODEL │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ MODEL TOKENIZER VOCAB SIZE │
│ ───────────────────────────────────────────────────────────────────── │
│ GPT-2 BPE (byte-level) 50,257 │
│ GPT-3 BPE (byte-level) 50,257 │
│ GPT-4 BPE (cl100k) 100,256 │
│ │
│ BERT WordPiece 30,522 │
│ RoBERTa BPE 50,265 │
│ │
│ T5 SentencePiece 32,000 │
│ LLaMA SentencePiece 32,000 │
│ LLaMA 2 SentencePiece 32,000 │
│ LLaMA 3/4 BPE (tiktoken) 128,256 │
│ │
│ Mistral SentencePiece 32,000 │
│ Mixtral SentencePiece 32,000 │
│ │
│ Qwen3 (2025) BBPE 151,669 (119 languages) │
│ Kimi K2 (2025) BPE ~128,000 │
│ Claude BPE variant ~100,000 │
│ Gemini 2 SentencePiece ~256,000 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ VOCABULARY SIZE TRENDS (2025): │
│ ───────────────────────────── │
│ │
│ 2020: 32K-50K typical (LLaMA, Mistral) │
│ 2024: 100K-128K becoming standard (Llama 3, GPT-4) │
│ 2025: 150K-256K for multilingual (Qwen3, Gemini 2) │
│ │
│ Larger vocabulary: │
│ + More tokens as single units (15% fewer tokens in Llama 3 vs 2) │
│ + Shorter sequences (faster, cheaper inference) │
│ + Better multilingual (Qwen3: 119 languages) │
│ - Larger embedding matrix (but tiny vs model weights) │
│ │
│ Qwen3 note: Large vocab can cause slower generation in some │
│ languages (Hindi, Italian) due to less efficient tokenization. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part VI: Tokenization Quirks and Issues
Common Problems
┌─────────────────────────────────────────────────────────────────────────┐
│ TOKENIZATION ISSUES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. INCONSISTENT NUMBERS: │
│ ──────────────────────── │
│ │
│ "123" → ["123"] (1 token) │
│ "1234" → ["12", "34"] (2 tokens) │
│ "12345" → ["123", "45"] (2 tokens) │
│ │
│ Different numbers tokenize differently! │
│ This can hurt arithmetic performance. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 2. WHITESPACE SENSITIVITY: │
│ ────────────────────────── │
│ │
│ "Hello" → ["Hello"] │
│ " Hello" → ["Ġ", "Hello"] or ["ĠHello"] │
│ " Hello" → ["Ġ", "Ġ", "Hello"] │
│ │
│ Leading spaces change tokenization! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 3. NON-ENGLISH INEFFICIENCY: │
│ ──────────────────────────── │
│ │
│ English: "Hello" → 1-2 tokens │
│ Chinese: "你好" → 2-4 tokens (each character separate) │
│ Arabic: "مرحبا" → 4-8 tokens │
│ │
│ Tokenizers trained on mostly English are inefficient for others. │
│ Non-English text uses more tokens → higher cost, shorter context. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 4. CODE TOKENIZATION: │
│ ───────────────────── │
│ │
│ "def function():" → ["def", "Ġfunction", "():", ...] │
│ " return x" → ["Ġ", "Ġ", "Ġ", "Ġreturn", "Ġx"] │
│ │
│ Indentation creates many tokens! │
│ Some tokenizers have special handling for spaces. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 5. CONTEXT LENGTH IMPLICATIONS: │
│ ─────────────────────────────── │
│ │
│ "4096 token context" means different things: │
│ • ~3000 English words │
│ • ~1500 Chinese characters │
│ • ~1000 lines of code │
│ │
│ Token count ≠ word count ≠ character count │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Tokenization Best Practices
┌─────────────────────────────────────────────────────────────────────────┐
│ BEST PRACTICES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. USE THE MODEL'S TOKENIZER │
│ Always use the exact tokenizer the model was trained with. │
│ Different tokenizers → different token IDs → garbage output. │
│ │
│ 2. COUNT TOKENS, NOT WORDS │
│ When checking context limits, count tokens. │
│ Most APIs provide token counting methods. │
│ │
│ # OpenAI │
│ import tiktoken │
│ enc = tiktoken.encoding_for_model("gpt-4") │
│ tokens = enc.encode("Hello world") │
│ print(len(tokens)) │
│ │
│ 3. HANDLE SPECIAL TOKENS │
│ Be aware of special tokens: [CLS], [SEP], <|endoftext|>, etc. │
│ They affect sequence length and model behavior. │
│ │
│ 4. TEST NON-ENGLISH AND CODE │
│ If your application uses non-English text or code, │
│ test tokenization efficiency and quality. │
│ │
│ 5. CONSIDER TOKENIZATION IN PROMPTS │
│ Token boundaries affect model behavior. │
│ " hello" vs "hello" may produce different outputs. │
│ Be consistent with spacing. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part VII: Recent Innovations (2024-2025)
Llama 3's Tokenizer Switch
Meta made a significant change with Llama 3, switching from SentencePiece to a tiktoken-style BPE tokenizer:
┌─────────────────────────────────────────────────────────────────────────┐
│ LLAMA 3 TOKENIZER CHANGES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ LLAMA 1/2 vs LLAMA 3: │
│ ───────────────────── │
│ │
│ LLaMA 1/2: │
│ • SentencePiece with Unigram model │
│ • 32,000 vocabulary size │
│ • Works well but limited for code and non-English │
│ │
│ LLaMA 3: │
│ • tiktoken-style BPE │
│ • 128,256 vocabulary size (4× larger!) │
│ • Better efficiency across languages and code │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY THE SWITCH? │
│ ─────────────── │
│ │
│ 1. EFFICIENCY: │
│ Same text → fewer tokens → faster inference │
│ "Hello world!" in Llama 2: ~4 tokens │
│ "Hello world!" in Llama 3: ~3 tokens │
│ 15-25% reduction in token count on average │
│ │
│ 2. CODE HANDLING: │
│ Larger vocab includes more programming patterns │
│ "def function():" tokenizes more efficiently │
│ Better whitespace/indentation handling │
│ │
│ 3. MULTILINGUAL: │
│ More dedicated tokens for non-English scripts │
│ Chinese, Japanese, Korean much more efficient │
│ Still English-optimized but better balance │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ COMPATIBILITY NOTE: │
│ ─────────────────── │
│ │
│ Llama 3's tokenizer is NOT compatible with Llama 2! │
│ Different tokenizers = different token IDs. │
│ Must use matching tokenizer for each model. │
│ │
│ # Llama 3 │
│ from transformers import AutoTokenizer │
│ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B") │
│ │
│ # Llama 2 (different!) │
│ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b") │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Modern Vocabulary Size Trends
┌─────────────────────────────────────────────────────────────────────────┐
│ VOCABULARY SIZE EVOLUTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Model Year Vocab Size Notes │
│ ───────────────────────────────────────────────────────────────────── │
│ GPT-2 2019 50,257 Baseline modern BPE │
│ GPT-3 2020 50,257 Same as GPT-2 │
│ LLaMA 1 2023 32,000 SentencePiece │
│ LLaMA 2 2023 32,000 SentencePiece │
│ GPT-4 2023 100,256 cl100k_base │
│ Mistral 2023 32,000 SentencePiece │
│ LLaMA 3 2024 128,256 tiktoken-style │
│ Gemma 2024 256,000 Large multilingual vocab │
│ Gemini 1.5 2024 ~256,000 Large for multimodal │
│ Claude 3 2024 ~100,000 Estimated │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TREND: Vocabularies getting larger │
│ ───────────────────────────────── │
│ │
│ Benefits of larger vocabulary: │
│ • More words/phrases as single tokens │
│ • Shorter sequences → lower latency │
│ • Better multilingual coverage │
│ • Better code handling │
│ │
│ Cost: Larger embedding table │
│ • 128K vocab × 4096 dim × 4 bytes = 2GB │
│ • Tiny compared to 8B+ model weights │
│ │
│ Verdict: Worth it for efficiency gains. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Special Tokens in Modern LLMs
┌─────────────────────────────────────────────────────────────────────────┐
│ SPECIAL TOKENS (2024-2025) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CHAT/INSTRUCTION TOKENS: │
│ ───────────────────────── │
│ │
│ LLaMA 3 Chat: │
│ <|begin_of_text|> │
│ <|start_header_id|>system<|end_header_id|> │
│ You are a helpful assistant. │
│ <|eot_id|> │
│ <|start_header_id|>user<|end_header_id|> │
│ Hello! │
│ <|eot_id|> │
│ <|start_header_id|>assistant<|end_header_id|> │
│ │
│ Mistral/Mixtral: │
│ [INST] User message [/INST] Assistant response │
│ │
│ ChatML (many models): │
│ <|im_start|>system\n...<|im_end|> │
│ <|im_start|>user\n...<|im_end|> │
│ <|im_start|>assistant\n...<|im_end|> │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TOOL/FUNCTION CALLING TOKENS: │
│ ────────────────────────────── │
│ │
│ Many models now have dedicated tokens for: │
│ • <tool_call> ... </tool_call> │
│ • <function> ... </function> │
│ • <|python_tag|> (for code execution) │
│ │
│ These help models distinguish structured output from free text. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MULTIMODAL TOKENS: │
│ ────────────────── │
│ │
│ Vision-language models add: │
│ • <image> or <|image|> for image placeholders │
│ • <video> for video │
│ • Special tokens for image patch positions │
│ │
│ LLaVA: <image> expands to 576 visual tokens │
│ Qwen-VL: Uses <img>...</img> tags │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Tokenization for Code
┌─────────────────────────────────────────────────────────────────────────┐
│ CODE-OPTIMIZED TOKENIZATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CHALLENGES WITH CODE: │
│ ───────────────────── │
│ │
│ 1. Indentation creates many tokens │
│ 2. Variable names often split awkwardly │
│ 3. Special characters less efficiently encoded │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ STARCODER / CODELLAMA IMPROVEMENTS: │
│ ──────────────────────────────────── │
│ │
│ • Include common code patterns in vocabulary │
│ • "def ", "return ", "import " as single tokens │
│ • Better handling of indentation (tabs vs spaces) │
│ • Repository-aware tokenization (file paths, imports) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ FILL-IN-THE-MIDDLE (FIM) TOKENS: │
│ ───────────────────────────────── │
│ │
│ Special tokens for code completion: │
│ │
│ <fim_prefix>def hello(): │
│ <fim_suffix> │
│ return result │
│ <fim_middle> │
│ │
│ Model fills in the middle given prefix and suffix. │
│ Used by: StarCoder, CodeLlama, DeepSeek-Coder │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ TOKEN EFFICIENCY (Python, approximate): │
│ ──────────────────────────────────────── │
│ │
│ Model Tokens per 1K chars Relative │
│ ───────────────────────────────────────────────────────────── │
│ GPT-3.5 400 1.0× │
│ GPT-4 350 0.88× │
│ CodeLlama 320 0.80× │
│ LLaMA 3 300 0.75× │
│ DeepSeek-Coder 290 0.73× │
│ │
│ Code-specialized models are ~25% more efficient on code. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Updated Tokenizer Recommendations
┌─────────────────────────────────────────────────────────────────────────┐
│ TOKENIZER SELECTION (2025) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ FOR TRAINING NEW MODELS: │
│ ───────────────────────── │
│ │
│ English-focused: │
│ • Use tiktoken-style BPE with 100K+ vocabulary │
│ • Follow Llama 3's approach │
│ │
│ Multilingual: │
│ • Use SentencePiece with large vocabulary (256K+) │
│ • Ensure balanced language coverage in training data │
│ │
│ Code-specialized: │
│ • Include code patterns in vocabulary training data │
│ • Add FIM special tokens │
│ • Consider separate code tokenizer │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ FOR USING EXISTING MODELS: │
│ ─────────────────────────── │
│ │
│ • ALWAYS use the model's exact tokenizer │
│ • Never mix tokenizers between models │
│ • Use the tokenizer's encode/decode methods, not string manipulation │
│ │
│ # Correct │
│ from transformers import AutoTokenizer │
│ tokenizer = AutoTokenizer.from_pretrained("model-name") │
│ tokens = tokenizer.encode(text) │
│ │
│ # Wrong (don't do this!) │
│ tokens = text.split() # Not tokenization! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part VIII: tiktoken Deep Dive
OpenAI's Tokenization Library
tiktoken is OpenAI's fast BPE tokenizer implementation, written in Rust with Python bindings. It's used by all OpenAI models and has become the de facto standard for high-performance tokenization:
┌─────────────────────────────────────────────────────────────────────────┐
│ TIKTOKEN ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ WHY TIKTOKEN? │
│ ───────────── │
│ │
│ Performance comparison (encoding 1M tokens): │
│ │
│ Library Time Relative │
│ ───────────────────────────────────────── │
│ tiktoken (Rust) 0.8s 1.0× │
│ HF tokenizers 1.2s 1.5× │
│ SentencePiece 3.5s 4.4× │
│ Python BPE 45s 56× │
│ │
│ tiktoken is 50× faster than pure Python implementations! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ CORE ARCHITECTURE: │
│ ────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ tiktoken │ │
│ ├─────────────────────────────────────────────────────────────────┤ │
│ │ Python API (tiktoken package) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Rust Core (tiktoken-rs) │ │
│ │ • Regex-based pre-tokenization │ │
│ │ • BPE merge application │ │
│ │ • Byte-level encoding │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Encoding Files (.tiktoken) │ │
│ │ • Vocabulary mappings │ │
│ │ • Merge rules │ │
│ │ • Special tokens │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
tiktoken Encodings
tiktoken provides several pre-built encodings for different model families:
┌─────────────────────────────────────────────────────────────────────────┐
│ TIKTOKEN ENCODINGS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ENCODING VOCAB SIZE MODELS │
│ ───────────────────────────────────────────────────────────────────── │
│ gpt2 50,257 GPT-2 │
│ r50k_base 50,257 text-davinci-002, code-davinci-002 │
│ p50k_base 50,281 text-davinci-003, code-davinci-002 │
│ p50k_edit 50,281 text-davinci-edit-001 │
│ cl100k_base 100,256 GPT-3.5-turbo, GPT-4, text-embedding │
│ o200k_base 200,019 GPT-4o, GPT-4o-mini │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EVOLUTION: │
│ ────────── │
│ │
│ gpt2 (2019) │
│ • Original GPT-2 tokenizer │
│ • 50K vocabulary │
│ • Basic byte-level BPE │
│ │
│ cl100k_base (2022) │
│ • Doubled vocabulary to 100K │
│ • Better multilingual support │
│ • Improved code handling │
│ • Used by GPT-4 │
│ │
│ o200k_base (2024) │
│ • Doubled again to 200K │
│ • Optimized for GPT-4o efficiency │
│ • ~10-15% fewer tokens for typical text │
│ • Much better non-English efficiency │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Using tiktoken
import tiktoken
# Get encoding for a specific model
enc = tiktoken.encoding_for_model("gpt-4")
# Or by encoding name
enc = tiktoken.get_encoding("cl100k_base")
# Basic encoding/decoding
text = "Hello, world! How are you?"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
# Text: Hello, world! How are you?
# Tokens: [9906, 11, 1917, 0, 2650, 527, 499, 30]
# Token count: 8
# Decode back to text
decoded = enc.decode(tokens)
print(f"Decoded: {decoded}")
# Decoded: Hello, world! How are you?
# Decode individual tokens (useful for debugging)
for token_id in tokens:
token_bytes = enc.decode_single_token_bytes(token_id)
print(f" {token_id} -> {token_bytes} -> {token_bytes.decode('utf-8', errors='replace')}")
# 9906 -> b'Hello' -> Hello
# 11 -> b',' -> ,
# 1917 -> b' world' -> world
# ...
tiktoken with Special Tokens
import tiktoken
# cl100k_base special tokens
enc = tiktoken.get_encoding("cl100k_base")
# Default special tokens are NOT encoded
text_with_special = "Hello <|endoftext|> World"
tokens = enc.encode(text_with_special)
# Encodes "<|endoftext|>" as regular text tokens!
# To handle special tokens, use allowed_special or disallowed_special
tokens = enc.encode(
text_with_special,
allowed_special={"<|endoftext|>"}
)
# Now <|endoftext|> becomes token 100257
# Allow all special tokens
tokens = enc.encode(
text_with_special,
allowed_special="all"
)
# Disallow specific patterns (raises error if found)
try:
tokens = enc.encode(
text_with_special,
disallowed_special={"<|endoftext|>"}
)
except ValueError as e:
print(f"Error: {e}")
Creating Custom tiktoken Encodings
import tiktoken
from tiktoken import Encoding
# Create a custom encoding with additional special tokens
cl100k = tiktoken.get_encoding("cl100k_base")
# Add custom special tokens for your application
custom_special_tokens = {
"<|system|>": 100264,
"<|user|>": 100265,
"<|assistant|>": 100266,
"<|tool_call|>": 100267,
"<|tool_result|>": 100268,
}
# Merge with existing special tokens
all_special_tokens = {
**cl100k._special_tokens,
**custom_special_tokens
}
# Create new encoding
custom_enc = Encoding(
name="custom_cl100k",
pat_str=cl100k._pat_str, # Same regex pattern
mergeable_ranks=cl100k._mergeable_ranks, # Same BPE merges
special_tokens=all_special_tokens
)
# Use custom encoding
text = "<|system|>You are helpful.<|user|>Hello!"
tokens = custom_enc.encode(text, allowed_special="all")
print(tokens)
# [100264, 2675, 527, 11190, 13, 100265, 9906, 0]
tiktoken Regex Patterns
The key innovation in tiktoken is its regex-based pre-tokenization:
┌─────────────────────────────────────────────────────────────────────────┐
│ TIKTOKEN PRE-TOKENIZATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PRE-TOKENIZATION REGEX (cl100k_base): │
│ ───────────────────────────────────── │
│ │
│ The regex splits text into chunks BEFORE BPE: │
│ │
│ (?i:'s|'t|'re|'ve|'m|'ll|'d) │
│ |[^\r\n\p{L}\p{N}]?\p{L}+ │
│ |\p{N}{1,3} │
│ | ?[^\s\p{L}\p{N}]+[\r\n]* │
│ |\s*[\r\n]+ │
│ |\s+(?!\S) │
│ |\s+ │
│ │
│ WHAT THIS DOES: │
│ ──────────────── │
│ │
│ Pattern Matches │
│ ───────────────────────────────────────────────────────────────────── │
│ (?i:'s|'t|'re|'ve|'m|'ll|'d) Contractions: "don't" → "don", "'t" │
│ [^\r\n\p{L}\p{N}]?\p{L}+ Words with optional prefix │
│ \p{N}{1,3} Numbers in groups of 1-3 digits │
│ ?[^\s\p{L}\p{N}]+[\r\n]* Punctuation and symbols │
│ \s*[\r\n]+ Newlines (with leading space) │
│ \s+(?!\S) Trailing whitespace │
│ \s+ Other whitespace │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EXAMPLE: │
│ ──────── │
│ │
│ Input: "Hello, world! I can't wait for 2024." │
│ │
│ Pre-tokenization splits: │
│ ["Hello", ",", " world", "!", " I", " can", "'t", " wait", │
│ " for", " 202", "4", "."] │
│ │
│ Then BPE is applied to each chunk independently. │
│ This prevents merges across word boundaries. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY PRE-TOKENIZE? │
│ ───────────────── │
│ │
│ 1. Prevents weird merges: " the" won't merge with "n" to form "then" │
│ 2. Consistent handling of contractions │
│ 3. Numbers split into manageable chunks (202, 4 not 2024) │
│ 4. Improves tokenization quality and consistency │
│ │
└─────────────────────────────────────────────────────────────────────────┘
tiktoken Performance Optimization
import tiktoken
import time
enc = tiktoken.get_encoding("cl100k_base")
# Batch encoding for better performance
texts = ["Hello world"] * 10000
# Method 1: Loop (slower)
start = time.time()
tokens_list = [enc.encode(text) for text in texts]
print(f"Loop: {time.time() - start:.3f}s")
# Method 2: encode_batch (faster, uses parallelism)
start = time.time()
tokens_list = enc.encode_batch(texts)
print(f"Batch: {time.time() - start:.3f}s")
# Method 3: encode_batch with num_threads
start = time.time()
tokens_list = enc.encode_batch(texts, num_threads=8)
print(f"Batch (8 threads): {time.time() - start:.3f}s")
# Typical results:
# Loop: 0.45s
# Batch: 0.12s
# Batch (8 threads): 0.04s
Part IX: HuggingFace Tokenizers
The Fast Tokenizers Library
HuggingFace's tokenizers library provides a unified, high-performance interface for all tokenization algorithms:
┌─────────────────────────────────────────────────────────────────────────┐
│ HUGGINGFACE TOKENIZERS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ARCHITECTURE: │
│ ───────────── │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Tokenization Pipeline │ │
│ ├─────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ Input Text │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────┐ │ │
│ │ │ Normalizer │ Unicode normalization, lowercase, etc. │ │
│ │ └─────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────┐ │ │
│ │ │PreTokenizer │ Split into words/chunks │ │
│ │ └─────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────┐ │ │
│ │ │ Model │ BPE / WordPiece / Unigram │ │
│ │ └─────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────┐ │ │
│ │ │PostProcessor│ Add special tokens ([CLS], [SEP]) │ │
│ │ └─────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────┐ │ │
│ │ │ Decoder │ Convert tokens back to string │ │
│ │ └─────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Output Encoding │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ KEY FEATURES: │
│ ───────────── │
│ │
│ • Written in Rust for speed │
│ • Unified API for all algorithms (BPE, WordPiece, Unigram) │
│ • Training from scratch │
│ • Customizable pipeline components │
│ • Offset tracking (character positions) │
│ • Batched encoding with parallelism │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Building a Tokenizer from Scratch
from tokenizers import Tokenizer
from tokenizers.models import BPE, WordPiece, Unigram
from tokenizers.trainers import BpeTrainer, WordPieceTrainer, UnigramTrainer
from tokenizers.pre_tokenizers import Whitespace, ByteLevel
from tokenizers.normalizers import NFD, Lowercase, StripAccents, Sequence
from tokenizers.processors import TemplateProcessing
# ═══════════════════════════════════════════════════════════════════════
# OPTION 1: BPE Tokenizer (like GPT-2)
# ═══════════════════════════════════════════════════════════════════════
# Initialize with BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
# Set up the pipeline
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)
# Configure trainer
trainer = BpeTrainer(
vocab_size=30000,
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
show_progress=True
)
# Train on files
files = ["wiki.txt", "books.txt"]
tokenizer.train(files, trainer)
# Or train on iterator
def batch_iterator(dataset):
for i in range(0, len(dataset), 1000):
yield dataset[i:i+1000]["text"]
tokenizer.train_from_iterator(batch_iterator(dataset), trainer)
# Save
tokenizer.save("my-bpe-tokenizer.json")
# ═══════════════════════════════════════════════════════════════════════
# OPTION 2: WordPiece Tokenizer (like BERT)
# ═══════════════════════════════════════════════════════════════════════
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer(
vocab_size=30000,
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
continuing_subword_prefix="##"
)
tokenizer.train(files, trainer)
# ═══════════════════════════════════════════════════════════════════════
# OPTION 3: Unigram Tokenizer (like T5/LLaMA)
# ═══════════════════════════════════════════════════════════════════════
tokenizer = Tokenizer(Unigram())
tokenizer.normalizer = Sequence([NFD()])
tokenizer.pre_tokenizer = Whitespace()
trainer = UnigramTrainer(
vocab_size=32000,
special_tokens=["<pad>", "<eos>", "<unk>"],
unk_token="<unk>"
)
tokenizer.train(files, trainer)
Pre-Tokenizers in Detail
┌─────────────────────────────────────────────────────────────────────────┐
│ PRE-TOKENIZERS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PRE-TOKENIZER DESCRIPTION EXAMPLE │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ Whitespace Split on whitespace "Hello world" │
│ → ["Hello", "world"] │
│ │
│ WhitespaceSplit Split, keep whitespace "Hello world" │
│ → ["Hello", " ", "world"]│
│ │
│ Punctuation Split on punctuation "Hello, world!" │
│ → ["Hello", ",", "world", │
│ "!"] │
│ │
│ ByteLevel Convert to bytes "Hello" → bytes │
│ (GPT-2 style) Ġ prefix for spaces │
│ │
│ Metaspace SentencePiece-like " Hello" → "▁Hello" │
│ space handling │
│ │
│ CharDelimiterSplit Split on specific char "a|b|c" (delim="|") │
│ → ["a", "b", "c"] │
│ │
│ Digits Split digits "test123" → ["test", │
│ "1", "2", "3"]│
│ │
│ Split Regex-based split Custom patterns │
│ │
│ Sequence Chain multiple Combine any of above │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EXAMPLE COMBINATIONS: │
│ ───────────────────── │
│ │
│ # GPT-2 style (byte-level) │
│ pre_tokenizer = ByteLevel(add_prefix_space=True) │
│ │
│ # BERT style (whitespace + punctuation) │
│ pre_tokenizer = Sequence([Whitespace(), Punctuation()]) │
│ │
│ # SentencePiece style │
│ pre_tokenizer = Metaspace(replacement="▁", add_prefix_space=True) │
│ │
│ # Code-aware (split on digits and punctuation) │
│ pre_tokenizer = Sequence([ │
│ Whitespace(), │
│ Punctuation(), │
│ Digits(individual_digits=True) │
│ ]) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Post-Processors for Special Tokens
from tokenizers.processors import TemplateProcessing, BertProcessing
# BERT-style: [CLS] ... [SEP] for single, [CLS] ... [SEP] ... [SEP] for pairs
tokenizer.post_processor = BertProcessing(
sep=("[SEP]", tokenizer.token_to_id("[SEP]")),
cls=("[CLS]", tokenizer.token_to_id("[CLS]"))
)
# Custom template
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B [SEP]",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]"))
]
)
# LLaMA-style (no special tokens added by default)
# Just encode as-is, special tokens handled in chat template
Encoding with Offset Tracking
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
# Encode with offset tracking
encoding = tokenizer.encode("Hello, world!")
# Get tokens
print(encoding.tokens)
# ['[CLS]', 'hello', ',', 'world', '!', '[SEP]']
# Get token IDs
print(encoding.ids)
# [101, 7592, 1010, 2088, 999, 102]
# Get attention mask
print(encoding.attention_mask)
# [1, 1, 1, 1, 1, 1]
# Get offsets (character positions in original text)
print(encoding.offsets)
# [(0, 0), (0, 5), (5, 6), (7, 12), (12, 13), (0, 0)]
# [CLS] Hello , world ! [SEP]
# Use offsets to map back to original text
text = "Hello, world!"
for token, (start, end) in zip(encoding.tokens, encoding.offsets):
if start != end: # Skip special tokens
print(f"{token}: '{text[start:end]}'")
# hello: 'Hello'
# ,: ','
# world: 'world'
# !: '!'
Batch Encoding with Padding and Truncation
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
# Enable padding and truncation
tokenizer.enable_padding(
pad_id=tokenizer.token_to_id("[PAD]"),
pad_token="[PAD]",
length=128 # Pad to fixed length, or None for dynamic
)
tokenizer.enable_truncation(max_length=128)
# Batch encode
texts = [
"Short text",
"This is a much longer text that will be truncated if necessary",
"Medium length"
]
encodings = tokenizer.encode_batch(texts)
for enc in encodings:
print(f"Length: {len(enc.ids)}, Tokens: {enc.tokens[:5]}...")
# Length: 128, Tokens: ['[CLS]', 'short', 'text', '[SEP]', '[PAD]']...
# Length: 128, Tokens: ['[CLS]', 'this', 'is', 'a', 'much']...
# Length: 128, Tokens: ['[CLS]', 'medium', 'length', '[SEP]', '[PAD]']...
Part X: Unicode and Byte-Level BPE
The Unicode Challenge
┌─────────────────────────────────────────────────────────────────────────┐
│ UNICODE TOKENIZATION CHALLENGES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE PROBLEM: │
│ ──────────── │
│ │
│ Unicode has 150,000+ characters across scripts: │
│ • Latin: A-Z, a-z, àáâãäå... │
│ • CJK: 你好世界 (Chinese), こんにちは (Japanese), 안녕 (Korean) │
│ • Arabic: مرحبا │
│ • Cyrillic: Привет │
│ • Emojis: 😀🎉🚀💻 │
│ • Mathematical: ∑∫∂∇ │
│ │
│ Character-level vocabulary would be HUGE! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SOLUTION: BYTE-LEVEL BPE │
│ ───────────────────────── │
│ │
│ Convert everything to UTF-8 bytes first: │
│ • Only 256 possible byte values (0x00-0xFF) │
│ • Any character can be represented │
│ • Base vocabulary is just 256 tokens! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ UTF-8 ENCODING: │
│ ─────────────── │
│ │
│ Character UTF-8 Bytes Num Bytes │
│ ───────────────────────────────────────────────────────────────────── │
│ 'A' 0x41 1 │
│ 'ñ' 0xC3 0xB1 2 │
│ '中' 0xE4 0xB8 0xAD 3 │
│ '😀' 0xF0 0x9F 0x98 0x80 4 │
│ │
│ ASCII (0-127): 1 byte │
│ Latin extended: 2 bytes │
│ CJK: 3 bytes │
│ Emojis: 4 bytes │
│ │
└─────────────────────────────────────────────────────────────────────────┘
GPT-2's Byte-to-Character Mapping
GPT-2 introduced a clever trick to make byte sequences readable:
# GPT-2's byte-to-character mapping
# Maps bytes 0-255 to printable Unicode characters
def bytes_to_unicode():
"""
Returns a mapping from bytes to Unicode characters.
Avoids control characters and whitespace issues.
"""
# Printable ASCII characters
bs = list(range(ord("!"), ord("~")+1)) # 33-126
bs += list(range(ord("¡"), ord("¬")+1)) # 161-172
bs += list(range(ord("®"), ord("ÿ")+1)) # 174-255
cs = bs[:]
n = 0
# Map remaining bytes (0-32, 127-160, 173) to higher Unicode
for b in range(256):
if b not in bs:
bs.append(b)
cs.append(256 + n)
n += 1
cs = [chr(n) for n in cs]
return dict(zip(bs, cs))
byte_encoder = bytes_to_unicode()
byte_decoder = {v: k for k, v in byte_encoder.items()}
# Example: Space (byte 32) maps to 'Ġ' (Unicode 288)
print(byte_encoder[32]) # 'Ġ'
# This is why GPT-2 tokens look like:
# "Ġthe" = " the" (space + the)
# "Ġhello" = " hello"
Handling Multilingual Text
┌─────────────────────────────────────────────────────────────────────────┐
│ MULTILINGUAL TOKENIZATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TOKEN EFFICIENCY BY LANGUAGE (approximate): │
│ ─────────────────────────────────────────── │
│ │
│ Same meaning, different token counts: │
│ │
│ Language Text GPT-4 LLaMA 3 Gemini │
│ (cl100k) (128K) (256K) │
│ ───────────────────────────────────────────────────────────────────── │
│ English "Hello world" 2 2 2 │
│ Spanish "Hola mundo" 3 2 2 │
│ French "Bonjour monde" 3 2 2 │
│ German "Hallo Welt" 3 2 2 │
│ Russian "Привет мир" 5 4 2 │
│ Chinese "你好世界" 4 4 2 │
│ Japanese "こんにちは" 5 5 3 │
│ Arabic "مرحبا بالعالم" 8 7 4 │
│ Hindi "नमस्ते दुनिया" 12 10 5 │
│ │
│ Non-English typically uses 2-5× more tokens! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ WHY THE DISPARITY? │
│ ────────────────── │
│ │
│ 1. Training data is English-heavy │
│ • English patterns merged more aggressively │
│ • "the" is one token, but "的" might not be │
│ │
│ 2. UTF-8 byte count varies │
│ • ASCII: 1 byte per character │
│ • CJK: 3 bytes per character │
│ • More bytes = potentially more tokens │
│ │
│ 3. Vocabulary allocation │
│ • 50K vocab mostly English subwords │
│ • Non-English shares remaining capacity │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SOLUTIONS: │
│ ────────── │
│ │
│ 1. Larger vocabularies (100K-256K) │
│ • More room for non-English tokens │
│ • GPT-4o's o200k_base is much better │
│ │
│ 2. Balanced training data │
│ • Include more non-English text in tokenizer training │
│ • Gemini trained on multilingual data │
│ │
│ 3. Dedicated multilingual tokenizers │
│ • XLM-RoBERTa, mT5 optimized for many languages │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Unicode Normalization
import unicodedata
# Different ways to represent the same character
s1 = "café" # 'é' as single character (U+00E9)
s2 = "café" # 'e' + combining accent (U+0065 U+0301)
print(len(s1), len(s2)) # 4, 5 - different lengths!
print(s1 == s2) # False!
# Normalization forms
nfc = unicodedata.normalize('NFC', s2) # Composed: é
nfd = unicodedata.normalize('NFD', s1) # Decomposed: e + ́
print(s1 == nfc) # True
# Why this matters for tokenization:
# Without normalization, "café" might tokenize differently
# depending on how the accent was input!
# SentencePiece does NFKC normalization by default
# tiktoken does NOT normalize (byte-level handles this)
# Recommendation: Normalize before tokenization if using
# character-level approaches. Byte-level BPE is more robust.
Emoji Tokenization
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
# Single emoji
print(enc.encode("😀"))
# [76460] - one token
# Emoji with skin tone modifier
print(enc.encode("👋🏽"))
# [54959, 243, 234] - multiple tokens!
# 👋 (wave) + 🏽 (medium skin tone) = 2 Unicode code points
# Flag emoji (regional indicators)
print(enc.encode("🇺🇸"))
# [155, 232, 161, 248] - 4 tokens
# Flags are two regional indicator letters
# ZWJ sequences (complex emoji)
print(enc.encode("👨👩👧👦")) # Family emoji
# Many tokens - combines man + woman + girl + boy with ZWJ
# Emoji are expensive! A family emoji can be 10+ tokens.
# Token counts for various emoji (cl100k_base):
emoji_tokens = {
"😀": 1,
"👋": 1,
"👋🏽": 3,
"🇺🇸": 4,
"👨👩👧👦": 11,
"🏳️🌈": 6,
}
Part XI: Tokenization Benchmarks
Token Efficiency Comparison
┌─────────────────────────────────────────────────────────────────────────┐
│ TOKEN EFFICIENCY BENCHMARKS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ENGLISH TEXT (Wikipedia sample): │
│ ───────────────────────────────── │
│ │
│ Tokenizer Tokens/1K chars Tokens/1K words Compression │
│ ───────────────────────────────────────────────────────────────────── │
│ GPT-2 (gpt2) 280 1,340 3.57× │
│ GPT-3.5 (cl100k) 250 1,200 4.00× │
│ GPT-4o (o200k) 220 1,050 4.55× │
│ LLaMA 2 290 1,380 3.45× │
│ LLaMA 3 230 1,100 4.35× │
│ Mistral 285 1,360 3.51× │
│ Claude 3 245 1,170 4.08× │
│ │
│ Compression = chars/tokens (higher = more efficient) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PYTHON CODE: │
│ ──────────── │
│ │
│ Tokenizer Tokens/1K chars vs English │
│ ───────────────────────────────────────────────────────────────────── │
│ GPT-2 350 1.25× more │
│ GPT-3.5 (cl100k) 300 1.20× more │
│ GPT-4o (o200k) 260 1.18× more │
│ LLaMA 3 250 1.09× more │
│ CodeLlama 230 1.00× (optimized) │
│ DeepSeek-Coder 220 0.96× (better!) │
│ │
│ Code-specialized models are more efficient on code. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ CHINESE TEXT: │
│ ───────────── │
│ │
│ Tokenizer Tokens/1K chars vs English │
│ ───────────────────────────────────────────────────────────────────── │
│ GPT-2 800 2.86× more │
│ GPT-3.5 (cl100k) 550 2.20× more │
│ GPT-4o (o200k) 380 1.73× more │
│ Qwen 250 1.00× (optimized) │
│ Yi 260 1.04× │
│ Baichuan 240 0.96× (better) │
│ │
│ Chinese-optimized models achieve parity with English. │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ JSON/STRUCTURED DATA: │
│ ───────────────────── │
│ │
│ Tokenizer Tokens/1K chars Notes │
│ ───────────────────────────────────────────────────────────────────── │
│ GPT-2 320 Brackets, quotes costly │
│ GPT-3.5 (cl100k) 280 Better structural tokens │
│ GPT-4o (o200k) 240 "{" often single token │
│ │
│ JSON typically 10-20% more tokens than plain text. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Cost Implications
┌─────────────────────────────────────────────────────────────────────────┐
│ TOKENIZATION COST IMPACT │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ API PRICING (per 1M tokens): │
│ ──────────────────────────── │
│ │
│ Model Input Output Effective $/1K chars │
│ ───────────────────────────────────────────────────────────────────── │
│ GPT-4o $2.50 $10.00 ~$0.0006 (English) │
│ GPT-4o-mini $0.15 $0.60 ~$0.00003 (English) │
│ Claude 3.5 Sonnet $3.00 $15.00 ~$0.0007 (English) │
│ Claude 3 Haiku $0.25 $1.25 ~$0.00006 (English) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LANGUAGE COST MULTIPLIER: │
│ ───────────────────────── │
│ │
│ Processing 10,000 characters of text: │
│ │
│ Language Tokens (GPT-4o) Cost (at $2.50/M input) │
│ ───────────────────────────────────────────────────────────────────── │
│ English 2,200 $0.0055 │
│ Spanish 2,600 $0.0065 (1.2×) │
│ Chinese 3,800 $0.0095 (1.7×) │
│ Arabic 5,200 $0.0130 (2.4×) │
│ Hindi 6,500 $0.0163 (3.0×) │
│ │
│ Non-English users pay significantly more for the same content! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ OPTIMIZATION STRATEGIES: │
│ ──────────────────────── │
│ │
│ 1. Use GPT-4o (o200k) over GPT-4 (cl100k) for multilingual │
│ • ~20% fewer tokens for non-English │
│ │
│ 2. For Chinese/Japanese: Consider Qwen or Yi models │
│ • Purpose-built tokenizers │
│ • Same cost as English │
│ │
│ 3. Compress prompts where possible │
│ • Remove unnecessary whitespace │
│ • Use abbreviations in system prompts │
│ │
│ 4. Cache and reuse common prefixes │
│ • OpenAI/Anthropic offer prefix caching discounts │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Benchmark Code
import tiktoken
from transformers import AutoTokenizer
import time
def benchmark_tokenizer(name, tokenizer, texts, encode_fn):
"""Benchmark a tokenizer on given texts."""
start = time.time()
total_chars = sum(len(t) for t in texts)
total_tokens = 0
for text in texts:
tokens = encode_fn(text)
total_tokens += len(tokens)
elapsed = time.time() - start
return {
"name": name,
"total_chars": total_chars,
"total_tokens": total_tokens,
"tokens_per_1k_chars": (total_tokens / total_chars) * 1000,
"compression_ratio": total_chars / total_tokens,
"time_seconds": elapsed,
"chars_per_second": total_chars / elapsed
}
# Sample texts
english_texts = [
"The quick brown fox jumps over the lazy dog.",
"Machine learning is transforming how we build software.",
# ... more samples
] * 100
chinese_texts = [
"机器学习正在改变我们构建软件的方式。",
"人工智能将在未来几年内带来巨大变革。",
# ... more samples
] * 100
code_texts = [
"def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)",
"class DataProcessor:\n def __init__(self, config):\n self.config = config",
# ... more samples
] * 100
# Benchmark tiktoken
cl100k = tiktoken.get_encoding("cl100k_base")
o200k = tiktoken.get_encoding("o200k_base")
results = []
results.append(benchmark_tokenizer(
"GPT-3.5/4 (cl100k)", cl100k, english_texts, cl100k.encode
))
results.append(benchmark_tokenizer(
"GPT-4o (o200k)", o200k, english_texts, o200k.encode
))
# Benchmark HuggingFace tokenizers
llama_tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
results.append(benchmark_tokenizer(
"LLaMA 2", llama_tok, english_texts, llama_tok.encode
))
# Print results
print(f"{'Tokenizer':<25} {'Tokens/1K chars':<18} {'Compression':<12} {'Speed':<15}")
print("-" * 70)
for r in results:
print(f"{r['name']:<25} {r['tokens_per_1k_chars']:<18.1f} {r['compression_ratio']:<12.2f}x {r['chars_per_second']:<15,.0f}")
Part XII: Training Your Own Tokenizer
When to Train Custom Tokenizers
┌─────────────────────────────────────────────────────────────────────────┐
│ WHEN TO TRAIN CUSTOM TOKENIZERS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ TRAIN CUSTOM TOKENIZER WHEN: │
│ ──────────────────────────── │
│ │
│ ✓ Training a new model from scratch │
│ ✓ Domain has very specialized vocabulary │
│ • Medical: "electroencephalography", "bronchopneumonia" │
│ • Legal: "indemnification", "notwithstanding" │
│ • Scientific: "CRISPR-Cas9", "immunohistochemistry" │
│ ✓ Non-English language dominates your use case │
│ ✓ Highly specialized format (DNA sequences, chemical formulas) │
│ ✓ Need maximum efficiency for specific domain │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ DON'T TRAIN CUSTOM TOKENIZER WHEN: │
│ ─────────────────────────────────── │
│ │
│ ✗ Using pre-trained models (MUST use their tokenizer) │
│ ✗ General-purpose text (existing tokenizers are fine) │
│ ✗ Small fine-tuning dataset │
│ ✗ Want to leverage transfer learning │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ HYBRID APPROACH: │
│ ──────────────── │
│ │
│ Extend existing tokenizer with domain terms: │
│ • Add special tokens for domain-specific terms │
│ • Keep base vocabulary for transfer learning │
│ • Requires additional embedding training │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Complete Training Pipeline
from tokenizers import (
Tokenizer, models, pre_tokenizers, decoders,
trainers, processors, normalizers
)
from pathlib import Path
import json
def train_bpe_tokenizer(
corpus_files: list,
vocab_size: int = 32000,
min_frequency: int = 2,
output_path: str = "tokenizer.json"
):
"""
Train a production-quality BPE tokenizer.
Args:
corpus_files: List of paths to training text files
vocab_size: Target vocabulary size
min_frequency: Minimum token frequency to include
output_path: Where to save the trained tokenizer
"""
# ═══════════════════════════════════════════════════════════════════
# Step 1: Initialize the model
# ═══════════════════════════════════════════════════════════════════
tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
# ═══════════════════════════════════════════════════════════════════
# Step 2: Set up normalizer (text preprocessing)
# ═══════════════════════════════════════════════════════════════════
tokenizer.normalizer = normalizers.Sequence([
normalizers.NFC(), # Unicode normalization
normalizers.Replace(r"\s+", " "), # Collapse whitespace
normalizers.Strip(), # Trim
])
# ═══════════════════════════════════════════════════════════════════
# Step 3: Set up pre-tokenizer (how to split before BPE)
# ═══════════════════════════════════════════════════════════════════
# GPT-2 style: byte-level with space prefix
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(
add_prefix_space=True,
use_regex=True
)
# ═══════════════════════════════════════════════════════════════════
# Step 4: Configure trainer
# ═══════════════════════════════════════════════════════════════════
special_tokens = [
"<unk>", # Unknown token
"<s>", # Beginning of sequence
"</s>", # End of sequence
"<pad>", # Padding
"<mask>", # For masked language modeling
# Add custom special tokens
"<|system|>",
"<|user|>",
"<|assistant|>",
]
trainer = trainers.BpeTrainer(
vocab_size=vocab_size,
min_frequency=min_frequency,
special_tokens=special_tokens,
show_progress=True,
initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
# ═══════════════════════════════════════════════════════════════════
# Step 5: Train
# ═══════════════════════════════════════════════════════════════════
print(f"Training on {len(corpus_files)} files...")
tokenizer.train(corpus_files, trainer)
# ═══════════════════════════════════════════════════════════════════
# Step 6: Set up decoder (for converting back to text)
# ═══════════════════════════════════════════════════════════════════
tokenizer.decoder = decoders.ByteLevel()
# ═══════════════════════════════════════════════════════════════════
# Step 7: Set up post-processor (add special tokens)
# ═══════════════════════════════════════════════════════════════════
tokenizer.post_processor = processors.TemplateProcessing(
single="<s> $A </s>",
pair="<s> $A </s> $B </s>",
special_tokens=[
("<s>", tokenizer.token_to_id("<s>")),
("</s>", tokenizer.token_to_id("</s>")),
]
)
# ═══════════════════════════════════════════════════════════════════
# Step 8: Save
# ═══════════════════════════════════════════════════════════════════
tokenizer.save(output_path)
print(f"Tokenizer saved to {output_path}")
# Print statistics
print(f"\nTokenizer Statistics:")
print(f" Vocabulary size: {tokenizer.get_vocab_size()}")
print(f" Special tokens: {special_tokens}")
return tokenizer
def train_from_dataset(dataset, output_path: str, vocab_size: int = 32000):
"""
Train tokenizer from a HuggingFace dataset.
"""
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
trainer = trainers.BpeTrainer(
vocab_size=vocab_size,
special_tokens=["<unk>", "<s>", "</s>", "<pad>"]
)
# Train from iterator (memory efficient)
def batch_iterator(batch_size=1000):
for i in range(0, len(dataset), batch_size):
yield dataset[i:i+batch_size]["text"]
tokenizer.train_from_iterator(batch_iterator(), trainer)
tokenizer.save(output_path)
return tokenizer
# Example usage
if __name__ == "__main__":
# Train on local files
corpus_files = [
"data/wikipedia.txt",
"data/books.txt",
"data/code.txt"
]
tokenizer = train_bpe_tokenizer(
corpus_files=corpus_files,
vocab_size=32000,
min_frequency=2,
output_path="my_tokenizer.json"
)
# Test the tokenizer
text = "Hello, world! This is a test."
encoding = tokenizer.encode(text)
print(f"\nTest encoding:")
print(f" Input: {text}")
print(f" Tokens: {encoding.tokens}")
print(f" IDs: {encoding.ids}")
Vocabulary Size Selection
┌─────────────────────────────────────────────────────────────────────────┐
│ VOCABULARY SIZE GUIDE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ FACTORS TO CONSIDER: │
│ ──────────────────── │
│ │
│ 1. Model Size │
│ • Embedding table = vocab_size × embedding_dim × 4 bytes │
│ • 32K × 4096 × 4 = 512 MB │
│ • 128K × 4096 × 4 = 2 GB │
│ • For small models (<1B), keep vocab smaller │
│ │
│ 2. Training Data Size │
│ • Small data: smaller vocab (rare tokens won't learn well) │
│ • Large data: larger vocab (can afford more tokens) │
│ │
│ 3. Language Coverage │
│ • English only: 32K-50K sufficient │
│ • Multilingual: 100K-256K recommended │
│ │
│ 4. Domain Specificity │
│ • General: standard vocab sizes │
│ • Specialized (code, medical): may need larger │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ RECOMMENDED SIZES: │
│ ────────────────── │
│ │
│ Use Case Vocab Size Rationale │
│ ───────────────────────────────────────────────────────────────────── │
│ Small model (<500M) 16K-32K Minimize embedding cost │
│ Medium model (1B-7B) 32K-64K Balance efficiency/size │
│ Large model (7B+) 64K-128K Efficiency matters more │
│ English-only 32K-50K Sufficient coverage │
│ Multilingual 100K-256K Cover multiple scripts │
│ Code-focused 32K-64K + FIM special tokens │
│ Code + multilingual 128K+ Llama 3 approach │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EMPIRICAL TUNING: │
│ ───────────────── │
│ │
│ 1. Train tokenizers with different vocab sizes │
│ 2. Measure tokens/character on held-out data │
│ 3. Plot compression ratio vs vocab size │
│ 4. Find knee of curve (diminishing returns) │
│ │
│ Typically: Beyond 64K, gains are marginal for English. │
│ Multilingual benefits from 100K+. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Data Preparation for Tokenizer Training
import os
from pathlib import Path
from typing import Iterator
import random
def prepare_corpus(
input_paths: list,
output_path: str,
sample_ratio: float = 1.0,
min_length: int = 100,
max_length: int = 100000,
shuffle: bool = True
):
"""
Prepare and clean corpus for tokenizer training.
Key considerations:
- Balance across domains/languages
- Remove very short/long texts
- Deduplicate
- Sample if corpus too large
"""
all_texts = []
for input_path in input_paths:
path = Path(input_path)
if path.is_file():
with open(path, 'r', encoding='utf-8', errors='ignore') as f:
for line in f:
line = line.strip()
if min_length <= len(line) <= max_length:
all_texts.append(line)
elif path.is_dir():
for file_path in path.glob('**/*.txt'):
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
text = f.read().strip()
if min_length <= len(text) <= max_length:
all_texts.append(text)
print(f"Collected {len(all_texts)} texts")
# Sample if needed
if sample_ratio < 1.0:
sample_size = int(len(all_texts) * sample_ratio)
all_texts = random.sample(all_texts, sample_size)
print(f"Sampled {len(all_texts)} texts")
# Shuffle for better training
if shuffle:
random.shuffle(all_texts)
# Write to output
with open(output_path, 'w', encoding='utf-8') as f:
for text in all_texts:
f.write(text + '\n')
print(f"Written to {output_path}")
print(f"Total characters: {sum(len(t) for t in all_texts):,}")
return output_path
def balanced_multilingual_corpus(
language_files: dict,
output_path: str,
samples_per_language: int = 100000
):
"""
Create a balanced multilingual corpus.
Args:
language_files: {"en": "english.txt", "zh": "chinese.txt", ...}
samples_per_language: Max samples per language for balance
"""
all_texts = []
for lang, file_path in language_files.items():
texts = []
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
if len(line) > 50: # Minimum length
texts.append(line)
# Sample for balance
if len(texts) > samples_per_language:
texts = random.sample(texts, samples_per_language)
print(f" {lang}: {len(texts)} samples")
all_texts.extend(texts)
random.shuffle(all_texts)
with open(output_path, 'w', encoding='utf-8') as f:
for text in all_texts:
f.write(text + '\n')
return output_path
Part XIII: Domain-Specific Tokenization
Medical/Scientific Tokenization
┌─────────────────────────────────────────────────────────────────────────┐
│ MEDICAL/SCIENTIFIC TOKENIZATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE PROBLEM: │
│ ──────────── │
│ │
│ Standard tokenizers fragment medical terms: │
│ │
│ "electroencephalography" │
│ GPT-4: ["elect", "ro", "ence", "phal", "ography"] (5 tokens) │
│ │
│ "immunohistochemistry" │
│ GPT-4: ["imm", "uno", "hist", "ochemistry"] (4 tokens) │
│ │
│ Fragmentation hurts model understanding of domain terms! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SOLUTION: DOMAIN-AWARE TOKENIZER │
│ ───────────────────────────────── │
│ │
│ 1. Include medical/scientific text in training corpus │
│ 2. Use larger vocabulary to capture domain terms │
│ 3. Consider adding high-frequency terms as special tokens │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ EXAMPLE: PUBMED-TRAINED TOKENIZER │
│ ───────────────────────────────── │
│ │
│ "electroencephalography" │
│ PubMed tokenizer: ["electroencephalography"] (1 token!) │
│ │
│ "CRISPR-Cas9" │
│ Standard: ["CR", "ISP", "R", "-", "C", "as", "9"] (7 tokens) │
│ Domain: ["CRISPR", "-", "Cas9"] (3 tokens) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Code Tokenization
# Challenges with code tokenization
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
# Problem 1: Indentation creates many tokens
code = """def factorial(n):
if n <= 1:
return 1
return n * factorial(n - 1)"""
tokens = enc.encode(code)
print(f"Token count: {len(tokens)}") # More than you'd expect
# Problem 2: Variable names split awkwardly
print(enc.encode("getUserById"))
# ['get', 'User', 'By', 'Id'] - 4 tokens, loses semantic connection
print(enc.encode("calculate_total_price"))
# ['calculate', '_', 'total', '_', 'price'] - underscores are separate
# Problem 3: Numbers in code
print(enc.encode("0x1234ABCD")) # Hex literals
print(enc.encode("192.168.1.1")) # IP addresses
# Often split unpredictably
# ═══════════════════════════════════════════════════════════════════════
# SOLUTIONS FOR CODE TOKENIZATION
# ═══════════════════════════════════════════════════════════════════════
# Solution 1: Use code-optimized models
# StarCoder, CodeLlama, DeepSeek-Coder have better code tokenizers
# Solution 2: Pre-process code
def normalize_code_for_tokenization(code: str) -> str:
"""Normalize code to improve tokenization."""
# Convert tabs to spaces (consistent indentation)
code = code.replace('\t', ' ')
# Normalize line endings
code = code.replace('\r\n', '\n')
# Remove trailing whitespace
lines = [line.rstrip() for line in code.split('\n')]
code = '\n'.join(lines)
return code
# Solution 3: Train domain-specific tokenizer
# Include large code corpus in tokenizer training
Fill-in-the-Middle (FIM) Tokens
# FIM tokens for code completion
# Standard autoregressive: predict next token
# FIM: predict MIDDLE given prefix and suffix
# Example FIM format (StarCoder/CodeLlama):
#
# Original code:
# def hello():
# print("Hello") <- want model to generate this
# return True
#
# FIM transformation:
# <fim_prefix>def hello():
# <fim_suffix>
# return True
# <fim_middle> print("Hello")
def convert_to_fim(code: str, cursor_pos: int) -> str:
"""Convert code to FIM format for training."""
prefix = code[:cursor_pos]
suffix = code[cursor_pos:]
# FIM format
return f"<fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>"
def fim_infill(model, tokenizer, prefix: str, suffix: str) -> str:
"""Use FIM to generate code between prefix and suffix."""
prompt = f"<fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(
input_ids,
max_new_tokens=256,
eos_token_id=tokenizer.encode("<fim_suffix>")[0],
pad_token_id=tokenizer.pad_token_id
)
generated = tokenizer.decode(output[0], skip_special_tokens=False)
# Extract the middle part
middle_start = generated.find("<fim_middle>") + len("<fim_middle>")
middle_end = generated.find("<fim_suffix>", middle_start)
if middle_end == -1:
middle_end = len(generated)
return generated[middle_start:middle_end]
Chemical and Mathematical Notation
┌─────────────────────────────────────────────────────────────────────────┐
│ SPECIALIZED NOTATION TOKENIZATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CHEMICAL FORMULAS: │
│ ────────────────── │
│ │
│ Formula Standard Tokenizer Ideal │
│ ───────────────────────────────────────────────────────────────────── │
│ H2O ["H", "2", "O"] ["H2O"] │
│ C6H12O6 ["C", "6", "H", ...] ["C6H12O6"] │
│ NaCl ["Na", "Cl"] ["NaCl"] │
│ CH3COOH ["CH", "3", "CO", "OH"] ["CH3COOH"] │
│ │
│ SMILES (molecular representation): │
│ CC(=O)OC1=CC=CC=C1C(=O)O (Aspirin) │
│ Breaks into many small tokens, loses structure │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ MATHEMATICAL NOTATION: │
│ ────────────────────── │
│ │
│ LaTeX: $\sum_{i=1}^{n} x_i$ │
│ Standard: ["$", "\\", "sum", "_", "{", "i", "=", "1", ...] (many!) │
│ │
│ Greek letters: α, β, γ, θ, Σ, ∫ │
│ Often multi-token in standard tokenizers │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SOLUTIONS: │
│ ────────── │
│ │
│ 1. Add domain terms as special tokens │
│ 2. Use character-level models for chemistry (SELFIES, SMILES) │
│ 3. Train on domain-specific corpora │
│ 4. Consider multi-modal approaches (image of formula) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Part XIV: Multimodal Tokenization
Vision Tokenization
┌─────────────────────────────────────────────────────────────────────────┐
│ VISION TOKENIZATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ THE CHALLENGE: │
│ ────────────── │
│ │
│ Text: discrete tokens from finite vocabulary │
│ Images: continuous pixel values, high dimensionality │
│ │
│ 224×224 RGB image = 150,528 values │
│ Can't feed raw pixels to transformer! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SOLUTION 1: PATCH EMBEDDINGS (ViT style) │
│ ───────────────────────────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Image Tokenization (ViT) │ │
│ ├─────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ Input Image (224×224) │ │
│ │ │ │ │
│ │ ▼ Split into patches │ │
│ │ │ │
│ │ ┌────┬────┬────┬────┐ │ │
│ │ │ P1 │ P2 │ P3 │ P4 │ 14×14 patches │ │
│ │ ├────┼────┼────┼────┤ (16×16 pixels each) │ │
│ │ │ P5 │ P6 │ P7 │ P8 │ │ │
│ │ ├────┼────┼────┼────┤ = 196 patches │ │
│ │ │... │... │... │... │ │ │
│ │ └────┴────┴────┴────┘ │ │
│ │ │ │ │
│ │ ▼ Linear projection │ │
│ │ │ │
│ │ Each patch → embedding vector (e.g., 768-dim) │ │
│ │ 196 patches → 196 "visual tokens" │ │
│ │ │ │ │
│ │ ▼ Add position embeddings │ │
│ │ │ │
│ │ [CLS] + 196 patch tokens → Transformer │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Token count: 196 (fixed) + special tokens │
│ Higher resolution → more tokens (quadratic!) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SOLUTION 2: VISION ENCODER + PROJECTOR (LLaVA style) │
│ ───────────────────────────────────────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Image → CLIP ViT → [v1, v2, ..., v576] (576 visual tokens) │ │
│ │ │ │ │
│ │ ▼ MLP Projector │ │
│ │ │ │
│ │ [v1', v2', ..., v576'] (projected to LLM embedding space) │ │
│ │ │ │ │
│ │ ▼ Concatenate with text │ │
│ │ │ │
│ │ [BOS] + visual_tokens + text_tokens → LLM │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ LLaVA-1.5: 576 visual tokens per image │
│ Higher resolution variants: 1000+ tokens │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ SOLUTION 3: DISCRETE VISUAL TOKENS (VQGAN/VQVAE) │
│ ───────────────────────────────────────────────── │
│ │
│ Image → Encoder → Quantize to codebook → Discrete tokens │
│ │
│ Like text tokenization: finite vocabulary of visual "words" │
│ Used by: DALL-E, Parti, some video models │
│ │
│ Codebook size: typically 8192-16384 visual tokens │
│ Can generate images autoregressively like text! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 2025 INNOVATIONS: │
│ ───────────────── │
│ │
│ TokenFlow (CVPR 2025): │
│ • Unified image tokenizer for understanding AND generation │
│ • Dual-codebook: semantic + pixel-level features │
│ • Decouples semantic and pixel learning while maintaining alignment │
│ │
│ Harmonizer (May 2025): │
│ • FusionQuantizer for heterogeneous signals (text, audio, video) │
│ • Unified tokenization across modalities │
│ │
│ Gemini 2.0 Token Counting: │
│ • Images ≤384px: 258 tokens │
│ • Larger images: 768×768 tiles, 258 tokens each │
│ • Video: 263 tokens/second │
│ • Audio: 32 tokens/second │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Audio Tokenization
┌─────────────────────────────────────────────────────────────────────────┐
│ AUDIO TOKENIZATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ APPROACH 1: MEL SPECTROGRAM + PATCHES │
│ ───────────────────────────────────── │
│ │
│ Audio waveform → Mel spectrogram → Patch embedding │
│ Similar to image tokenization │
│ Used by: Whisper (OpenAI), AudioLM │
│ │
│ Whisper: │
│ • 30s audio → 80 mel bins × 3000 frames │
│ • 2D convolutions → 1500 audio tokens │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ APPROACH 2: NEURAL AUDIO CODEC (EnCodec/DAC) │
│ ───────────────────────────────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Audio → Encoder → RVQ → Discrete codes │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Residual Vector Quantization (RVQ): │ │
│ │ Layer 1: Coarse features (prosody, speaker) │ │
│ │ Layer 2: Mid-level features │ │
│ │ ... │ │
│ │ Layer 8: Fine details │ │
│ │ │ │
│ │ Each layer: 1024-codebook quantization │ │
│ │ Result: 8 × (audio_length / stride) tokens │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ EnCodec (Meta): │
│ • 24kHz audio → 75 tokens/second (at 6kbps) │
│ • 8 codebook layers │
│ • Can reconstruct high-quality audio │
│ │
│ Used by: MusicGen, AudioCraft, Bark │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ APPROACH 3: SEMANTIC + ACOUSTIC TOKENS │
│ ─────────────────────────────────────── │
│ │
│ Two-stage tokenization (AudioLM, VALL-E): │
│ │
│ 1. Semantic tokens: Content/meaning │
│ • From w2v-BERT or HuBERT │
│ • ~50 tokens/second │
│ • Language-model-friendly │
│ │
│ 2. Acoustic tokens: Sound quality │
│ • From EnCodec/SoundStream │
│ • ~200-600 tokens/second │
│ • For high-fidelity reconstruction │
│ │
│ Generate semantic first, then acoustic (coarse-to-fine) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Video Tokenization
# Video tokenization approaches
# Approach 1: Frame-by-frame (simple but expensive)
# Each frame → image tokenizer → many tokens
# 30fps × 10 seconds × 576 tokens/frame = 172,800 tokens!
# Approach 2: Temporal compression
# Sample keyframes, use temporal transformer
# Typical: 1-4 fps sampling
# Approach 3: 3D patches (VideoMAE, InternVideo)
def tokenize_video_3d_patches(
video, # (T, H, W, C) tensor
patch_size=16, # Spatial patch size
temporal_patch=2 # Temporal patch size
):
"""
Tokenize video using 3D patches.
Example: 16 frames × 224×224 video
Patches: (16/2) × (224/16) × (224/16) = 8 × 14 × 14 = 1568 tokens
"""
T, H, W, C = video.shape
# Number of patches in each dimension
n_t = T // temporal_patch # 8
n_h = H // patch_size # 14
n_w = W // patch_size # 14
# Reshape into patches
patches = video.reshape(
n_t, temporal_patch,
n_h, patch_size,
n_w, patch_size,
C
)
patches = patches.transpose(0, 2, 4, 1, 3, 5, 6)
patches = patches.reshape(n_t * n_h * n_w, -1)
# Linear projection to embeddings
# patches: (1568, temporal_patch × patch_size × patch_size × C)
# Project to: (1568, embed_dim)
return patches # 1568 video tokens
# Approach 4: Video codec tokens (VideoGPT, MAGVIT)
# Use VQ-VAE trained on video
# Discrete codebook for video "words"
Part XV: Production Considerations
Tokenization Caching
import hashlib
from functools import lru_cache
from typing import List, Tuple
import tiktoken
import redis
class CachedTokenizer:
"""
Production tokenizer with multi-level caching.
"""
def __init__(
self,
model: str = "gpt-4",
redis_url: str = None,
local_cache_size: int = 10000
):
self.enc = tiktoken.encoding_for_model(model)
self.redis = redis.from_url(redis_url) if redis_url else None
self._local_cache = {}
self._cache_size = local_cache_size
def _hash_text(self, text: str) -> str:
"""Create cache key from text."""
return hashlib.sha256(text.encode()).hexdigest()[:16]
def encode(self, text: str) -> List[int]:
"""Encode with caching."""
cache_key = self._hash_text(text)
# Level 1: Local memory cache
if cache_key in self._local_cache:
return self._local_cache[cache_key]
# Level 2: Redis cache
if self.redis:
cached = self.redis.get(f"tok:{cache_key}")
if cached:
tokens = list(map(int, cached.decode().split(',')))
self._local_cache[cache_key] = tokens
return tokens
# Level 3: Compute
tokens = self.enc.encode(text)
# Store in caches
self._local_cache[cache_key] = tokens
if len(self._local_cache) > self._cache_size:
# Simple eviction (could use LRU)
self._local_cache.pop(next(iter(self._local_cache)))
if self.redis:
self.redis.setex(
f"tok:{cache_key}",
3600, # 1 hour TTL
','.join(map(str, tokens))
)
return tokens
def encode_batch(
self,
texts: List[str],
num_threads: int = 4
) -> List[List[int]]:
"""Batch encode with caching."""
results = [None] * len(texts)
uncached = []
# Check cache first
for i, text in enumerate(texts):
cache_key = self._hash_text(text)
if cache_key in self._local_cache:
results[i] = self._local_cache[cache_key]
else:
uncached.append((i, text))
# Batch encode uncached
if uncached:
uncached_texts = [text for _, text in uncached]
encoded = self.enc.encode_batch(uncached_texts, num_threads=num_threads)
for (i, text), tokens in zip(uncached, encoded):
results[i] = tokens
cache_key = self._hash_text(text)
self._local_cache[cache_key] = tokens
return results
def count_tokens(self, text: str) -> int:
"""Quick token count."""
return len(self.encode(text))
Streaming Tokenization
from typing import Iterator, Generator
import tiktoken
def stream_tokenize(
text_stream: Iterator[str],
model: str = "gpt-4",
chunk_size: int = 1000
) -> Generator[list, None, None]:
"""
Tokenize streaming text efficiently.
Handles the challenge of tokens that span chunk boundaries.
"""
enc = tiktoken.encoding_for_model(model)
buffer = ""
for chunk in text_stream:
buffer += chunk
if len(buffer) >= chunk_size:
# Tokenize all but the last few characters
# (in case a token spans the boundary)
safe_boundary = len(buffer) - 50 # Keep 50 char buffer
to_tokenize = buffer[:safe_boundary]
buffer = buffer[safe_boundary:]
tokens = enc.encode(to_tokenize)
yield tokens
# Tokenize remaining buffer
if buffer:
tokens = enc.encode(buffer)
yield tokens
def count_tokens_streaming(text_stream: Iterator[str]) -> int:
"""Count tokens in a stream without loading full text."""
total = 0
for token_batch in stream_tokenize(text_stream):
total += len(token_batch)
return total
# Usage with file streaming
def tokenize_large_file(file_path: str):
"""Tokenize a large file without loading into memory."""
def file_chunks(path, chunk_size=8192):
with open(path, 'r', encoding='utf-8') as f:
while chunk := f.read(chunk_size):
yield chunk
all_tokens = []
for token_batch in stream_tokenize(file_chunks(file_path)):
all_tokens.extend(token_batch)
return all_tokens
Token Budget Management
from dataclasses import dataclass
from typing import List, Optional
import tiktoken
@dataclass
class TokenBudget:
"""Manage token budgets for context windows."""
max_tokens: int
reserved_output: int = 1000
@property
def available_input(self) -> int:
return self.max_tokens - self.reserved_output
class ContextManager:
"""
Manage context window token budget.
"""
def __init__(
self,
model: str,
max_context: int,
reserved_output: int = 2000
):
self.enc = tiktoken.encoding_for_model(model)
self.budget = TokenBudget(max_context, reserved_output)
def count(self, text: str) -> int:
"""Count tokens in text."""
return len(self.enc.encode(text))
def fit_messages(
self,
messages: List[dict],
system_prompt: Optional[str] = None
) -> List[dict]:
"""
Fit messages into context window, truncating oldest if needed.
Returns messages that fit within budget.
"""
budget = self.budget.available_input
# Account for system prompt
if system_prompt:
system_tokens = self.count(system_prompt) + 10 # overhead
budget -= system_tokens
# Calculate tokens for each message
message_tokens = []
for msg in messages:
# Approximate token count including formatting
tokens = self.count(msg["content"]) + 4 # role + formatting
message_tokens.append(tokens)
# Always keep the most recent message
total = message_tokens[-1]
kept_indices = [len(messages) - 1]
# Add older messages until budget exhausted
for i in range(len(messages) - 2, -1, -1):
if total + message_tokens[i] <= budget:
total += message_tokens[i]
kept_indices.append(i)
else:
break
# Return messages in original order
kept_indices.sort()
return [messages[i] for i in kept_indices]
def truncate_to_fit(
self,
text: str,
max_tokens: int,
from_end: bool = False
) -> str:
"""
Truncate text to fit within token limit.
Args:
text: Text to truncate
max_tokens: Maximum tokens allowed
from_end: If True, keep end of text; if False, keep start
"""
tokens = self.enc.encode(text)
if len(tokens) <= max_tokens:
return text
if from_end:
truncated_tokens = tokens[-max_tokens:]
else:
truncated_tokens = tokens[:max_tokens]
return self.enc.decode(truncated_tokens)
def split_into_chunks(
self,
text: str,
chunk_tokens: int,
overlap_tokens: int = 0
) -> List[str]:
"""
Split text into chunks of specified token size.
"""
tokens = self.enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + chunk_tokens, len(tokens))
chunk_tokens_slice = tokens[start:end]
chunks.append(self.enc.decode(chunk_tokens_slice))
start = end - overlap_tokens
return chunks
# Usage
manager = ContextManager(
model="gpt-4",
max_context=128000,
reserved_output=4000
)
# Fit conversation into context
messages = [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"},
# ... many more messages
]
fitted = manager.fit_messages(
messages,
system_prompt="You are a helpful assistant."
)
print(f"Kept {len(fitted)} of {len(messages)} messages")
Tokenization Monitoring
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, List
import time
@dataclass
class TokenizationMetrics:
"""Track tokenization metrics for monitoring."""
total_texts: int = 0
total_tokens: int = 0
total_chars: int = 0
total_time_ms: float = 0
errors: int = 0
# Per-language stats
by_language: Dict[str, dict] = field(default_factory=dict)
def record(
self,
text: str,
tokens: int,
time_ms: float,
language: str = "unknown"
):
self.total_texts += 1
self.total_tokens += tokens
self.total_chars += len(text)
self.total_time_ms += time_ms
if language not in self.by_language:
self.by_language[language] = {
"texts": 0, "tokens": 0, "chars": 0
}
self.by_language[language]["texts"] += 1
self.by_language[language]["tokens"] += tokens
self.by_language[language]["chars"] += len(text)
@property
def avg_tokens_per_char(self) -> float:
if self.total_chars == 0:
return 0
return self.total_tokens / self.total_chars
@property
def avg_time_per_text_ms(self) -> float:
if self.total_texts == 0:
return 0
return self.total_time_ms / self.total_texts
def report(self) -> dict:
return {
"total_texts": self.total_texts,
"total_tokens": self.total_tokens,
"total_chars": self.total_chars,
"avg_tokens_per_char": self.avg_tokens_per_char,
"avg_time_per_text_ms": self.avg_time_per_text_ms,
"compression_ratio": self.total_chars / max(self.total_tokens, 1),
"errors": self.errors,
"by_language": self.by_language
}
class MonitoredTokenizer:
"""Tokenizer with built-in monitoring."""
def __init__(self, model: str = "gpt-4"):
import tiktoken
self.enc = tiktoken.encoding_for_model(model)
self.metrics = TokenizationMetrics()
def encode(self, text: str, language: str = "en") -> List[int]:
start = time.time()
try:
tokens = self.enc.encode(text)
time_ms = (time.time() - start) * 1000
self.metrics.record(
text=text,
tokens=len(tokens),
time_ms=time_ms,
language=language
)
return tokens
except Exception as e:
self.metrics.errors += 1
raise
def get_metrics(self) -> dict:
return self.metrics.report()
# Usage
tokenizer = MonitoredTokenizer("gpt-4")
# Process texts
for text, lang in texts_with_language:
tokens = tokenizer.encode(text, language=lang)
# Get metrics
metrics = tokenizer.get_metrics()
print(f"Processed {metrics['total_texts']} texts")
print(f"Average compression: {metrics['compression_ratio']:.2f}x")
print(f"By language: {metrics['by_language']}")
Part XVI: Debugging Tokenization Issues
Common Issues and Solutions
┌─────────────────────────────────────────────────────────────────────────┐
│ TOKENIZATION DEBUGGING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ISSUE 1: UNEXPECTED TOKEN COUNT │
│ ─────────────────────────────── │
│ │
│ Symptom: "My 100-word text uses 500 tokens!" │
│ │
│ Common causes: │
│ • Non-English text (2-5× token inflation) │
│ • Lots of code/special characters │
│ • Emojis (1-11 tokens each!) │
│ • Unusual whitespace/formatting │
│ │
│ Debug: │
│ tokens = enc.encode(text) │
│ for t in tokens: │
│ print(repr(enc.decode([t]))) # See each token │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ISSUE 2: TOKENIZATION INCONSISTENCY │
│ ─────────────────────────────────── │
│ │
│ Symptom: Same text tokenizes differently │
│ │
│ Common causes: │
│ • Unicode normalization differences (NFC vs NFD) │
│ • Invisible characters (zero-width spaces, RTL marks) │
│ • Different line endings (\n vs \r\n) │
│ • Leading/trailing whitespace │
│ │
│ Debug: │
│ print([hex(ord(c)) for c in text]) # See exact characters │
│ import unicodedata │
│ print([unicodedata.name(c, 'UNKNOWN') for c in text]) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ISSUE 3: DECODE/ENCODE MISMATCH │
│ ─────────────────────────────── │
│ │
│ Symptom: decode(encode(text)) != text │
│ │
│ Common causes: │
│ • Invalid UTF-8 sequences in input │
│ • Characters outside tokenizer's training data │
│ • Special tokens being interpreted │
│ │
│ Debug: │
│ original = "test text" │
│ tokens = enc.encode(original) │
│ decoded = enc.decode(tokens) │
│ if original != decoded: │
│ for i, (o, d) in enumerate(zip(original, decoded)): │
│ if o != d: │
│ print(f"Diff at {i}: {repr(o)} vs {repr(d)}") │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ ISSUE 4: WRONG TOKENIZER FOR MODEL │
│ ────────────────────────────────── │
│ │
│ Symptom: Model output is garbage/repetitive │
│ │
│ Cause: Using tokenizer from different model │
│ │
│ Solution: ALWAYS match tokenizer to model │
│ │
│ # WRONG │
│ tok = AutoTokenizer.from_pretrained("bert-base") │
│ model = AutoModel.from_pretrained("gpt2") # Different! │
│ │
│ # RIGHT │
│ tok = AutoTokenizer.from_pretrained("gpt2") │
│ model = AutoModel.from_pretrained("gpt2") # Same! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Debugging Tools
import tiktoken
from typing import List, Tuple
def debug_tokenization(text: str, model: str = "gpt-4") -> dict:
"""
Comprehensive tokenization debugging.
"""
enc = tiktoken.encoding_for_model(model)
# Basic encoding
tokens = enc.encode(text)
# Decode each token
token_details = []
for t in tokens:
decoded = enc.decode([t])
token_bytes = enc.decode_single_token_bytes(t)
token_details.append({
"id": t,
"decoded": decoded,
"repr": repr(decoded),
"bytes": token_bytes.hex(),
"length": len(decoded)
})
# Find unusual tokens
unusual = []
for detail in token_details:
decoded = detail["decoded"]
# Check for non-printable characters
if any(ord(c) < 32 or ord(c) > 126 for c in decoded if c not in '\n\t'):
unusual.append(detail)
# Roundtrip check
decoded_text = enc.decode(tokens)
roundtrip_ok = decoded_text == text
return {
"original_text": text,
"original_length": len(text),
"token_count": len(tokens),
"tokens_per_char": len(tokens) / len(text) if text else 0,
"token_details": token_details,
"unusual_tokens": unusual,
"roundtrip_ok": roundtrip_ok,
"roundtrip_diff": None if roundtrip_ok else {
"original": repr(text),
"decoded": repr(decoded_text)
}
}
def visualize_tokens(text: str, model: str = "gpt-4") -> str:
"""
Visualize tokenization with colored output.
"""
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
# ANSI color codes
colors = [
'\033[91m', # Red
'\033[92m', # Green
'\033[93m', # Yellow
'\033[94m', # Blue
'\033[95m', # Magenta
'\033[96m', # Cyan
]
reset = '\033[0m'
result = []
for i, t in enumerate(tokens):
color = colors[i % len(colors)]
decoded = enc.decode([t])
# Show spaces explicitly
display = decoded.replace(' ', '·').replace('\n', '↵\n')
result.append(f"{color}[{display}]{reset}")
return ''.join(result)
def find_token_boundaries(text: str, model: str = "gpt-4") -> List[int]:
"""
Find character positions where tokens start.
"""
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
boundaries = [0]
current_pos = 0
for t in tokens:
decoded = enc.decode([t])
current_pos += len(decoded)
boundaries.append(current_pos)
return boundaries
def compare_tokenizers(text: str, models: List[str]) -> dict:
"""
Compare tokenization across different models.
"""
results = {}
for model in models:
try:
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
results[model] = {
"token_count": len(tokens),
"tokens": tokens[:20], # First 20
"decoded": [enc.decode([t]) for t in tokens[:20]]
}
except Exception as e:
results[model] = {"error": str(e)}
return results
# Usage examples
if __name__ == "__main__":
test_text = "Hello, world! 你好世界 🎉"
# Debug tokenization
debug = debug_tokenization(test_text)
print(f"Token count: {debug['token_count']}")
print(f"Tokens per char: {debug['tokens_per_char']:.2f}")
print(f"Unusual tokens: {debug['unusual_tokens']}")
# Visualize
print("\nVisualization:")
print(visualize_tokens(test_text))
# Compare tokenizers
print("\nComparison:")
comparison = compare_tokenizers(test_text, ["gpt-3.5-turbo", "gpt-4", "gpt-4o"])
for model, result in comparison.items():
print(f" {model}: {result.get('token_count', 'N/A')} tokens")
Summary
Tokenization converts text to numbers that models can process. The key algorithms:
┌─────────────────────────────────────────────────────────────────────────┐
│ KEY TAKEAWAYS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SUBWORD TOKENIZATION: │
│ • Balances vocabulary size and coverage │
│ • Handles unknown words via composition │
│ • Standard for all modern LLMs │
│ │
│ BPE (Byte Pair Encoding): │
│ • Iteratively merge most frequent pairs │
│ • Simple and effective │
│ • Used by: GPT series, LLaMA 3 │
│ │
│ WordPiece: │
│ • Merge based on likelihood, not frequency │
│ • Used by: BERT family │
│ │
│ Unigram: │
│ • Probabilistic model, prune vocabulary │
│ • Often more linguistically meaningful │
│ • Used by: T5, LLaMA 1/2, Mistral (via SentencePiece) │
│ │
│ SentencePiece: │
│ • Language-agnostic (no pre-tokenization) │
│ • Supports BPE and Unigram │
│ • Best for multilingual models │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ LIBRARIES: │
│ ────────── │
│ │
│ tiktoken: │
│ • OpenAI's fast BPE implementation (Rust) │
│ • Encodings: gpt2, cl100k_base, o200k_base │
│ • Best for OpenAI model compatibility │
│ │
│ HuggingFace Tokenizers: │
│ • Unified API for all algorithms │
│ • Train custom tokenizers from scratch │
│ • Offset tracking, batching, streaming │
│ │
│ SentencePiece: │
│ • Google's language-agnostic tokenizer │
│ • Best for multilingual and Asian languages │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ 2024-2025 TRENDS: │
│ • Larger vocabularies (100K-256K) for efficiency │
│ • Llama 3 switched SentencePiece → tiktoken-style BPE │
│ • Code-specialized tokenizers with FIM tokens │
│ • More special tokens for chat, tools, multimodal │
│ • Multimodal tokens for vision, audio, video │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PRACTICAL ADVICE: │
│ • Always use the model's exact tokenizer │
│ • Count tokens, not words, for context limits │
│ • Test tokenization for your specific use case │
│ • Cache tokenization results in production │
│ • Monitor token efficiency by language/domain │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Frequently Asked Questions
Related Articles
Transformer Architecture: A Complete Deep Dive
A comprehensive exploration of the transformer architecture—from embedding layers through attention and feed-forward networks to the output head. Understand why decoder-only models dominate, how residual connections enable deep networks, and the engineering decisions behind GPT, Llama, and modern LLMs.
LLM Pre-training: Building Foundation Models from Scratch
A comprehensive guide to pre-training large language models—from data curation and architecture decisions to scaling laws and distributed training infrastructure. Understanding how GPT, Llama, and other foundation models are built.
Context Extension: How LLMs Scale Beyond Training Length
A comprehensive deep dive into context extension techniques—how models trained on 4K tokens extrapolate to 128K+. Understand RoPE scaling, Position Interpolation, NTK-aware scaling, YaRN, and the mathematics of long-context LLMs.
Attention Mechanisms: From Self-Attention to FlashAttention
A comprehensive deep dive into attention mechanisms—the core innovation powering modern LLMs. From the intuition behind self-attention to the engineering of FlashAttention, understand how transformers actually work.
Text Generation & Decoding Strategies: A Complete Guide
A comprehensive guide to how LLMs actually generate text—from greedy decoding to beam search, temperature scaling, nucleus sampling, speculative decoding, and structured generation. Master the techniques that control LLM output quality, creativity, and speed.
LLM Inference Optimization: From Quantization to Speculative Decoding
A comprehensive guide to optimizing LLM inference for production—covering quantization, attention optimization, batching strategies, and deployment frameworks.
Open-Source LLMs: The Complete 2025 Guide
A comprehensive guide to open-source LLMs—Llama 4, Qwen3, DeepSeek V3.2, Mistral Large 3, Kimi K2, GLM-4.7 and more. Detailed benchmarks, hardware requirements, deployment strategies, and practical recommendations for production use.
Multimodal LLMs: Vision, Audio, and Beyond
A comprehensive guide to multimodal LLMs—vision-language models, audio understanding, video comprehension, and any-to-any models. Architecture deep dives, benchmarks, implementation patterns, and production deployment.
Building Production-Ready RAG Systems: Lessons from the Field
A comprehensive guide to building Retrieval-Augmented Generation systems that actually work in production, based on real-world experience at Goji AI.