Is mechanistic interpretability actually achievable for large models?

It's an open question. Current techniques work well for small models but struggle with frontier models. Optimistically, better techniques and automation might crack larger models. Pessimistically, complexity might grow faster than our ability to analyze it. The field is actively working on scaling.

How does this relate to AI safety?

Interpretability is one pillar of AI safety. If we can understand what models compute, we can verify they're not deceptive, detect dangerous capabilities, and ensure alignment. Without interpretability, we're relying on behavioral testing, which can miss hidden behaviors.

Can I use interpretability to improve my RAG system?

Directly, probably not—current techniques require significant expertise. Indirectly, interpretability insights inform practical advice: understanding that models have "retrieval" circuits explains why providing explicit context helps, and understanding induction heads explains in-context learning.

What's the relationship between interpretability and explainable AI (XAI)?

Interpretability aims to understand how models work mechanistically. XAI typically aims to explain individual predictions to users (e.g., "the model predicted X because of features Y and Z"). Interpretability is more fundamental—understanding the algorithms—while XAI is more applied—providing useful explanations.

Will this work for multimodal models?

Early work is extending to vision transformers and multimodal models. The core techniques (SAEs, attention analysis, patching) apply, but there are additional challenges around understanding cross-modal features and interactions.

Mechanistic Interpretability: Understanding What's Really Happening Inside LLMs | Enrico Piovano

Why Interpretability Matters

Large language models are simultaneously the most capable and least understood software systems humanity has built. We deploy them to write code, give medical advice, and make consequential decisions—yet we don't truly know how they work. We know the architecture, the training process, the loss function. But we don't know what a model "knows," how it reasons, or why it occasionally fails spectacularly.

This matters for several reasons:

Safety: If we can't understand what a model is computing, we can't verify it's safe. A model might learn to be deceptive, to pursue unintended goals, or to fail catastrophically in subtle edge cases—and we'd have no way to detect this from behavior alone.

Debugging: When models fail, interpretability can tell us why. Instead of blindly trying different prompts or training approaches, we could diagnose the actual computational failure.

Improvement: Understanding how models succeed and fail can guide architecture improvements, training refinements, and prompting strategies.

Trust: For high-stakes applications, we need to know when a model is genuinely reasoning versus pattern-matching in ways that might break.

Mechanistic interpretability is the science of opening the black box—reverse-engineering neural networks to understand the actual algorithms they implement.

What is Mechanistic Interpretability?

Mechanistic interpretability aims to understand neural networks at a mechanistic level: identifying the specific computations performed by individual components (neurons, attention heads, layers) and how these compose into algorithms.

The analogy is to reverse-engineering software. Given a compiled binary, a reverse engineer seeks to understand the algorithms, data structures, and logic the program implements. Similarly, given a trained neural network, a mechanistic interpretability researcher seeks to understand the features detected, the computations performed, and the algorithms executed.

This is distinct from other approaches to understanding ML systems:

Behavioral testing probes what a model does (inputs → outputs) without examining internals. Useful but limited—it can't explain why a model behaves as it does.

Attribution methods identify which inputs influenced outputs (saliency maps, attention visualization). These show what the model attended to but not how it processed that information.

Probing trains classifiers on internal representations to detect what information is encoded. This reveals what a model represents but not how it uses those representations.

Mechanistic interpretability goes deeper: it aims to understand the actual computations—the algorithms implemented by the weights and activations.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│               INTERPRETABILITY APPROACHES COMPARED                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  BEHAVIORAL TESTING                                                         │
│  ────────────────────                                                       │
│  Question: "What does the model output for this input?"                     │
│  Method: Run many inputs, observe outputs                                   │
│  Learns: Input-output mapping, failure modes                                │
│  Limitation: No insight into internal processing                            │
│                                                                             │
│  Example: Testing GPT on math problems → 85% accuracy on 2-digit addition   │
│           (Tells us accuracy, not how it adds)                              │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  ATTRIBUTION / SALIENCY                                                     │
│  ────────────────────────                                                   │
│  Question: "Which inputs influenced this output?"                           │
│  Method: Gradient-based attribution, attention weights                      │
│  Learns: Which tokens/features were important                               │
│  Limitation: Importance ≠ how information was processed                     │
│                                                                             │
│  Example: Attention visualizations showing model focuses on "not" in        │
│           "This movie is not good" → "negative"                             │
│           (Tells us it saw "not", not how it processed negation)            │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  PROBING                                                                    │
│  ────────                                                                   │
│  Question: "What information is encoded in these representations?"          │
│  Method: Train classifiers on intermediate activations                      │
│  Learns: What concepts are represented                                      │
│  Limitation: Encoded ≠ used; classifier might extract unused info           │
│                                                                             │
│  Example: Training POS-tag classifier on layer 6 → 95% accuracy             │
│           (Tells us POS is encoded, not if/how model uses it)               │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  MECHANISTIC INTERPRETABILITY                                               │
│  ─────────────────────────────                                              │
│  Question: "What algorithm does this model implement?"                      │
│  Method: Analyze individual components, trace information flow              │
│  Learns: Actual computations and algorithms                                 │
│  Limitation: Labor-intensive, may not scale to full models                  │
│                                                                             │
│  Example: Finding that attention head 5.7 copies tokens from positions      │
│           matching a learned pattern, implementing "induction heads"        │
│           (Tells us the actual mechanism)                                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

The Superposition Hypothesis

Before diving into techniques, we need to understand a key challenge: superposition. Neural networks don't represent one concept per neuron. Instead, they represent many more concepts than they have neurons by encoding concepts as directions in activation space.

Features vs Neurons

Early interpretability work tried to understand what individual neurons detect. While some neurons are interpretable (the famous "cat neuron" in image models), most neurons don't correspond to clean concepts. A single neuron might activate for an apparently random collection of inputs.

The resolution is that models don't store one feature per neuron. They store features as directions in the activation space. A feature might be represented by a vector like [0.3, -0.1, 0.7, ...] across many neurons. Individual neurons participate in many features.

This is called superposition—more features are represented than there are dimensions, by exploiting the geometry of high-dimensional spaces.

Why Superposition Exists

Superposition likely arises because:

There are more concepts to represent than neurons available. The world has countless concepts; models have limited neurons.
Most concepts are sparse. Any given input only activates a small subset of possible concepts. "Legal contract language" features aren't needed when processing "cat pictures."
High-dimensional geometry allows it. In high dimensions, you can fit exponentially many nearly-orthogonal directions. Features that rarely co-occur can share neurons without much interference.

Superposition is efficient but makes interpretation harder. We can't just ask "what does neuron 847 do?"—it participates in many features. We need methods to extract the actual features from the superposition.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                        THE SUPERPOSITION PROBLEM                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  NAIVE VIEW: One neuron = One concept                                       │
│  ──────────────────────────────────────                                     │
│                                                                             │
│  Neuron 1: "Cat detector"                                                   │
│  Neuron 2: "Legal language detector"                                        │
│  Neuron 3: "Python code detector"                                           │
│  ...                                                                        │
│                                                                             │
│  Problem: Models have ~768-12288 dimensions but millions of concepts        │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  REALITY: Features as directions in activation space                        │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  Feature "cat" = direction [0.3, 0.1, -0.2, 0.5, ...]                       │
│  Feature "legal" = direction [-0.1, 0.4, 0.3, 0.2, ...]                     │
│  Feature "python" = direction [0.2, -0.3, 0.4, -0.1, ...]                   │
│                                                                             │
│  Each neuron participates in MANY features                                  │
│  Neuron 1 = 0.3×"cat" - 0.1×"legal" + 0.2×"python" + ...                   │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  WHY THIS WORKS:                                                            │
│  ───────────────                                                            │
│                                                                             │
│  High-dimensional spaces have MANY nearly-orthogonal directions             │
│                                                                             │
│  In 2D: At most 2 orthogonal directions                                     │
│  In 768D: Can fit ~10^6 directions with <10° angle between any pair         │
│                                                                             │
│  If features rarely co-occur (cat + legal language is rare),                │
│  they can share dimensions without interfering                              │
│                                                                             │
│  Sparse activation = efficient use of limited dimensions                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Sparse Autoencoders: Extracting Features from Superposition

If features are directions in activation space, how do we find them? This is where Sparse Autoencoders (SAEs) come in—currently the most promising technique for extracting interpretable features from neural networks.

The SAE Architecture

A sparse autoencoder is trained to:

Take model activations as input
Expand them to a much larger hidden layer (e.g., 768 → 32,000)
Enforce sparsity (only a few hidden units active)
Reconstruct the original activations

The key insight: the sparsity constraint forces the autoencoder to find a basis where features are separated. Each hidden unit ideally corresponds to one feature.

Why Sparsity Helps

Without sparsity, an autoencoder could use any basis for its hidden layer—including keeping features entangled. The sparsity constraint changes this:

If only k hidden units can be active, and the autoencoder must reconstruct inputs accurately, it must find hidden units that correspond to independently-occurring features. Features that co-occur would need to share hidden units, violating sparsity.

The result is a dictionary of features. Each hidden unit in the sparse autoencoder corresponds to one (hopefully interpretable) feature. The activation of that hidden unit tells you how strongly that feature is present.

What SAEs Find

Recent work applying SAEs to language models has found remarkably interpretable features:

Entity features: Specific people, places, organizations (a "Golden Gate Bridge" feature, an "Eiffel Tower" feature)

Concept features: Abstract concepts (a "deception" feature, a "uncertainty" feature)

Syntax features: Grammatical structures (a "list item" feature, an "if-then" feature)

Style features: Tone and register (a "formal language" feature, a "sarcasm" feature)

Task features: Operations the model performs (an "addition" feature, a "translation" feature)

The Anthropic team famously found a "Golden Gate Bridge" feature in Claude. When artificially amplified, the model became obsessed with the Golden Gate Bridge, inserting references to it in almost every response. This demonstrates that features are not just descriptive—they causally influence model behavior.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                     SPARSE AUTOENCODER ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Model activations                                                          │
│  (768 dimensions)                                                           │
│         │                                                                   │
│         ▼                                                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    ENCODER (768 → 32,000)                            │   │
│  │                                                                       │   │
│  │   Linear projection + ReLU                                            │   │
│  │   hidden = ReLU(W_enc @ activations + b_enc)                          │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│         │                                                                   │
│         ▼                                                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │              SPARSE HIDDEN LAYER (32,000 units)                      │   │
│  │                                                                       │   │
│  │   Only ~50-200 units are non-zero (sparse!)                          │   │
│  │                                                                       │   │
│  │   [0, 0, 0.8, 0, 0, 0, 1.2, 0, ..., 0, 0.3, 0, 0]                    │   │
│  │        ↑           ↑                    ↑                             │   │
│  │   "Python"    "Function"           "Security"                         │   │
│  │    feature     feature               feature                          │   │
│  │                                                                       │   │
│  │   Each active unit = a feature present in the input                  │   │
│  │   Activation magnitude = feature strength                             │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│         │                                                                   │
│         ▼                                                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    DECODER (32,000 → 768)                            │   │
│  │                                                                       │   │
│  │   Linear projection (reconstructs original activations)              │   │
│  │   reconstructed = W_dec @ hidden + b_dec                              │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│         │                                                                   │
│         ▼                                                                   │
│  Reconstructed activations                                                  │
│  (768 dimensions)                                                           │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  TRAINING OBJECTIVE:                                                        │
│                                                                             │
│  Loss = reconstruction_error + λ × sparsity_penalty                         │
│                                                                             │
│  reconstruction_error = ||activations - reconstructed||²                    │
│  sparsity_penalty = Σ|hidden|  (L1 norm encourages zeros)                  │
│                                                                             │
│  λ controls trade-off: higher = sparser but worse reconstruction           │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  INTERPRETING FEATURES:                                                     │
│                                                                             │
│  For each hidden unit (feature):                                            │
│  1. Find inputs that maximally activate it                                  │
│  2. Look for patterns in those inputs                                       │
│  3. Name the feature based on the pattern                                   │
│                                                                             │
│  Example: Hidden unit 7,432 activates on:                                   │
│    - "The for loop iterates over..."                                        │
│    - "Looping through the array..."                                         │
│    - "for i in range(10):"                                                  │
│  → Feature 7,432 = "iteration/looping concept"                              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Circuits: How Features Compose into Algorithms

Features are the vocabulary of neural network computation. Circuits are the grammar—how features combine and transform through the network to implement algorithms.

What is a Circuit?

A circuit is a subgraph of the neural network that implements a specific computation. It consists of:

Input features: What information the circuit reads
Intermediate computations: How attention heads and MLPs process that information
Output features: What the circuit produces

Circuits can be simple (a single attention head copying a token) or complex (multiple layers collaborating to perform multi-step reasoning).

The Induction Head Circuit

The best-understood circuit in language models is the induction head—a mechanism for in-context learning discovered by Anthropic researchers.

Induction heads implement a simple but powerful pattern: [A][B] ... [A] → [B]. If the model sees the sequence "Harry Potter ... Harry," it predicts "Potter" because it saw "Harry Potter" earlier.

The circuit involves two attention heads working together:

Previous Token Head (Layer N): Attends from each token to the token before it. After this head, each position contains information about the preceding token.

Induction Head (Layer N+1): Attends from the current position to earlier positions that were preceded by a matching token. Because of the previous token head, it can search for "positions preceded by a token matching the current token."

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                        INDUCTION HEAD CIRCUIT                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Pattern: [A][B] ... [A] → [B]                                              │
│                                                                             │
│  Example: "Harry Potter is a wizard. Harry" → "Potter"                      │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  Input sequence: "Harry Potter is a wizard . Harry"                         │
│  Positions:        0      1    2  3    4   5    6                           │
│                                                                             │
│  STEP 1: Previous Token Head (Layer N)                                      │
│  ─────────────────────────────────────                                      │
│                                                                             │
│  Each position attends to the previous position                             │
│  Position 1 ("Potter") ← attends to → Position 0 ("Harry")                  │
│                                                                             │
│  After this head, Position 1 knows "I come after 'Harry'"                   │
│                                                                             │
│  Stored in residual stream:                                                 │
│  Pos 1: [Potter embedding] + [info: preceded by "Harry"]                    │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  STEP 2: Induction Head (Layer N+1)                                         │
│  ──────────────────────────────────                                         │
│                                                                             │
│  Position 6 ("Harry") asks: "Where have I seen my token before,             │
│                              and what came after it?"                       │
│                                                                             │
│  Query (pos 6): "Looking for positions preceded by 'Harry'"                 │
│  Key (pos 1):   "I am preceded by 'Harry'"                                  │
│                                    ↓                                        │
│                              MATCH!                                         │
│                                    ↓                                        │
│  Value (pos 1): "Potter" embedding                                          │
│                                                                             │
│  Position 6 receives "Potter" information → predicts "Potter"               │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                             │
│  WHY THIS MATTERS:                                                          │
│                                                                             │
│  This is a general-purpose in-context learning mechanism.                   │
│  Any [A][B] pattern in context becomes a "rule" the model follows.          │
│                                                                             │
│  "The word for cat is gato. The word for dog is" → "perro"?                │
│  Model sees pattern: "word for X is Y" → copies Y after X                   │
│                                                                             │
│  Induction heads are key to few-shot learning!                              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

More Complex Circuits

Beyond induction heads, researchers have identified circuits for:

Indirect Object Identification: "Mary gave the book to John. She gave him the" → "book". The circuit must track that "him" refers to John and predict what Mary might give.

Greater-Than Comparison: "The war lasted from 1812 to 18..." → predicts "13" or higher. The circuit must understand the year comparison context and constrain outputs appropriately.

Gendered Pronoun Resolution: "The doctor said she..." Understanding when "she" is correct requires resolving coreference.

Factual Recall: "The Eiffel Tower is in..." → "Paris". The model must look up stored knowledge.

Each circuit involves specific attention heads and MLP layers working together, and researchers can often identify which components are responsible.

Attention Patterns: The Information Routing System

Attention is how transformers move information between positions. Understanding attention patterns reveals how information flows through the network.

Types of Attention Heads

Not all attention heads do the same thing. Researchers have identified several common patterns:

Position heads: Attend based on relative position (e.g., always attend to the previous token, or to the first token).

Syntax heads: Attend based on syntactic relationships (e.g., verbs attend to subjects).

Copy heads: Copy information from attended positions to the current position.

Induction heads: Implement the [A][B]...[A] → [B] pattern described above.

Retrieval heads: Search for specific content in the context (like looking up a definition given earlier).

Attention Head Analysis

To understand what an attention head does:

Visualize attention patterns across many inputs
Look for consistent patterns (does it always attend to certain positions or content types?)
Test interventions (what happens if you ablate this head?)
Examine QKV matrices (what queries does it form? What keys does it attend to? What values does it copy?)

For example, if you find that a head consistently attends from pronouns to their antecedents, you've identified a coreference resolution head.

The Residual Stream View

A useful mental model for transformer computation is the residual stream. Instead of thinking of transformers as layer-by-layer processing, think of a stream of information that flows through the network.

The Residual Stream Mental Model

At each position, there's a residual stream—a vector that accumulates information through the network:

Initial state: Token embedding
After each layer: Residual stream += attention_output + mlp_output
Final state: Used for prediction

Each attention head and MLP can read from and write to the residual stream. They don't directly communicate with each other—they communicate through this shared stream.

This view clarifies:

Attention heads read from the residual stream (via QKV projections) and write back to it (via output projection)
MLPs read from the residual stream and write processed information back
Information persists across layers in the residual stream
Later components can use earlier outputs because they're accumulated in the stream

Implications

The residual stream view explains several phenomena:

Skip connections matter: Without residuals, information would have to pass through every layer. Residuals let information skip ahead.

Layer order is flexible: Components don't strictly depend on adjacent layers. An attention head in layer 10 can use information written by layer 2.

Distributed computation: A "computation" might involve components spread across many layers, all contributing to the same residual dimensions.

Code

┌─────────────────────────────────────────────────────────────────────────────┐
│                       RESIDUAL STREAM VISUALIZATION                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Token: "The"                                                               │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ RESIDUAL STREAM (accumulates information through layers)             │   │
│  │                                                                       │   │
│  │ ══════════════════════════════════════════════════════════════════   │   │
│  │                                                                       │   │
│  │ Initial: embed("The") = [0.2, -0.1, 0.5, ...]                        │   │
│  │                                                                       │   │
│  │     │                                                                 │   │
│  │     │◄────── Layer 0 Attention writes: +[0.1, 0.0, -0.1, ...]        │   │
│  │     │◄────── Layer 0 MLP writes:       +[0.0, 0.2, 0.1, ...]         │   │
│  │     ▼                                                                 │   │
│  │                                                                       │   │
│  │ After L0: [0.3, 0.1, 0.5, ...]                                       │   │
│  │                                                                       │   │
│  │     │                                                                 │   │
│  │     │◄────── Layer 1 Attention reads stream, writes updates          │   │
│  │     │◄────── Layer 1 MLP reads stream, writes updates                │   │
│  │     ▼                                                                 │   │
│  │                                                                       │   │
│  │ After L1: [0.4, -0.1, 0.6, ...]                                      │   │
│  │                                                                       │   │
│  │     │                                                                 │   │
│  │     :  (more layers)                                                  │   │
│  │     │                                                                 │   │
│  │     ▼                                                                 │   │
│  │                                                                       │   │
│  │ Final: [0.8, 0.3, -0.2, ...] → Unembedding → Next token prediction   │   │
│  │                                                                       │   │
│  │ ══════════════════════════════════════════════════════════════════   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  KEY INSIGHTS:                                                              │
│                                                                             │
│  1. Information PERSISTS: What layer 2 writes is still there for layer 10  │
│                                                                             │
│  2. PARALLEL computation: Attention and MLP both read from same stream     │
│                                                                             │
│  3. SELECTIVE reading/writing: Each component uses projections to          │
│     interact with specific "subspaces" of the residual stream              │
│                                                                             │
│  4. SKIP connections implicit: Early layers can influence final output     │
│     without passing through middle layers                                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Activation Patching: Causal Intervention

To move beyond correlation to causation, researchers use activation patching—replacing activations from one input with activations from another and observing the effect.

The Patching Method

Given two inputs that produce different outputs:

Clean input: "The Eiffel Tower is in" → "Paris"
Corrupted input: "The Colosseum is in" → "Rome"

To find which components cause the different outputs:

Run the model on both inputs, saving all activations
Run the model on the clean input, but patch a specific activation with its corrupted version
Observe how the output changes

If patching component X causes the output to flip from "Paris" to "Rome," that component is causally responsible for the difference.

Activation Patching in Practice

Researchers use patching to:

Identify important components: Which attention heads matter for a specific task?

Trace information flow: Where does the model "look up" the answer?

Verify circuit hypotheses: If we think head 5.3 does X, does patching it disrupt X?

Localize knowledge: Where is the fact "Eiffel Tower → Paris" stored?

Patching comes in several flavors:

Zero patching: Replace activation with zeros
Mean patching: Replace with average activation across inputs
Noise patching: Add noise to activations
Resample patching: Replace with activation from a different input

The Logit Lens

A simpler intervention technique is the logit lens: apply the unembedding matrix (which converts the final layer's output to token probabilities) to intermediate layers.

This reveals what the model would predict if it stopped at layer N. By comparing predictions across layers, we see how the model's "belief" evolves:

Layer 0: Random-ish predictions
Layer 5: Starting to narrow down
Layer 10: Confident prediction emerging
Layer 12: Final answer

The logit lens shows when in the network a prediction is formed, though not how.

Practical Applications

Mechanistic interpretability isn't just academic—it has practical applications:

Model Debugging

When a model fails, interpretability can identify why:

Which features activated inappropriately?
Which attention patterns went wrong?
Where did the circuit break down?

For example, if a model incorrectly says "The capital of Australia is Sydney," interpretability might reveal:

A "major city" feature activated strongly
The "capital lookup" attention head attended to Sydney mentions
The factual recall circuit was overridden by frequency information

Targeted Model Editing

If we can identify where knowledge is stored, we can edit it:

ROME (Rank-One Model Editing) modifies specific factual associations
MEMIT extends this to multiple facts
These rely on interpretability to locate what to edit

Safety Analysis

Interpretability can detect concerning behaviors:

Features related to deception or manipulation
Circuits that bypass safety training
Hidden capabilities not evident from behavior

Anthropic's research on "sleeper agents" (models trained to behave differently in certain contexts) used interpretability to understand how such behaviors are implemented.

Prompt Engineering Insights

Understanding how models process prompts can improve prompt design:

Which instructions activate helpfulness features?
How do chain-of-thought prompts change computation?
Why do certain prompt formats work better?

Current Limitations and Challenges

Mechanistic interpretability is promising but faces significant challenges:

Scale

Current techniques work best on small models (GPT-2 scale). Frontier models have:

100x more parameters
More complex emergent behaviors
Denser feature superposition

Scaling interpretability to frontier models remains an open challenge.

Completeness

We understand individual circuits, but not how they compose into complete model behavior. A full understanding would require:

A complete feature dictionary
All circuits identified
How circuits interact

We're far from this for any model.

Automation

Current interpretability is labor-intensive:

Manually examining features
Hand-tracing circuits
Interpreting attention patterns

Automated interpretability tools are improving but limited.

Validation

How do we know our interpretations are correct? A feature we call "Python code" might actually be "ASCII text with brackets." Validation methods include:

Causal intervention (does manipulating the feature produce expected effects?)
Prediction (can we predict feature activation on new inputs?)
Consistency (do multiple analysis methods agree?)

The Path Forward

Despite limitations, the field is progressing rapidly:

Better SAEs: Improved training techniques yield cleaner features with less superposition.

Automated Circuit Discovery: Tools to automatically identify circuits, reducing manual effort.

Scaling Laws for Interpretability: Understanding how interpretability difficulty scales with model size.

Integration with Training: Using interpretability insights during training, not just after.

Safety Applications: Detecting deception, ensuring robustness, validating alignment.

The goal is ambitious: fully understand how neural networks compute, with the same completeness that we understand traditional algorithms. We're far from there, but progress is accelerating.

Tools and Resources

For practitioners interested in exploring mechanistic interpretability:

Libraries

TransformerLens: A library for mechanistic interpretability of transformers. Provides hooks to access intermediate activations, attention patterns, and more.

SAE Lens: Tools for training and analyzing sparse autoencoders on language models.

CircuitsVis: Visualization tools for attention patterns and circuits.

pyvene: A library for performing interventions on neural network activations.

Key Papers

"A Mathematical Framework for Transformer Circuits" (Elhage et al.)
"In-context Learning and Induction Heads" (Olsson et al.)
"Toy Models of Superposition" (Elhage et al.)
"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (Anthropic)
"Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (Anthropic)

Communities

EleutherAI Discord (interpretability channel)
Alignment Forum
LessWrong (AI alignment posts)

Table of Contents

Why Interpretability Matters

What is Mechanistic Interpretability?

The Superposition Hypothesis

Features vs Neurons

Why Superposition Exists

Sparse Autoencoders: Extracting Features from Superposition

The SAE Architecture

Why Sparsity Helps

What SAEs Find

Circuits: How Features Compose into Algorithms

What is a Circuit?

The Induction Head Circuit

More Complex Circuits

Attention Patterns: The Information Routing System

Types of Attention Heads

Attention Head Analysis

The Residual Stream View

The Residual Stream Mental Model

Implications

Activation Patching: Causal Intervention

The Patching Method

Activation Patching in Practice

The Logit Lens

Practical Applications

Model Debugging

Targeted Model Editing

Safety Analysis

Prompt Engineering Insights

Current Limitations and Challenges

Scale

Completeness

Automation

Validation

The Path Forward

Tools and Resources

Libraries

Key Papers

Communities

Frequently Asked Questions

Enrico Piovano, PhD

Related Articles

Transformer Architecture: A Complete Deep Dive

Attention Mechanisms: From Self-Attention to FlashAttention

LLM Safety and Red Teaming: Attacks, Defenses, and Best Practices

RLHF Complete Guide: Aligning LLMs with Human Preferences