Mechanistic Interpretability: Understanding What's Really Happening Inside LLMs
A comprehensive introduction to mechanistic interpretability—the science of reverse-engineering neural networks to understand how they actually compute. From attention patterns to circuits to features, discover what's really happening inside language models.
Table of Contents
Why Interpretability Matters
Large language models are simultaneously the most capable and least understood software systems humanity has built. We deploy them to write code, give medical advice, and make consequential decisions—yet we don't truly know how they work. We know the architecture, the training process, the loss function. But we don't know what a model "knows," how it reasons, or why it occasionally fails spectacularly.
This matters for several reasons:
Safety: If we can't understand what a model is computing, we can't verify it's safe. A model might learn to be deceptive, to pursue unintended goals, or to fail catastrophically in subtle edge cases—and we'd have no way to detect this from behavior alone.
Debugging: When models fail, interpretability can tell us why. Instead of blindly trying different prompts or training approaches, we could diagnose the actual computational failure.
Improvement: Understanding how models succeed and fail can guide architecture improvements, training refinements, and prompting strategies.
Trust: For high-stakes applications, we need to know when a model is genuinely reasoning versus pattern-matching in ways that might break.
Mechanistic interpretability is the science of opening the black box—reverse-engineering neural networks to understand the actual algorithms they implement.
What is Mechanistic Interpretability?
Mechanistic interpretability aims to understand neural networks at a mechanistic level: identifying the specific computations performed by individual components (neurons, attention heads, layers) and how these compose into algorithms.
The analogy is to reverse-engineering software. Given a compiled binary, a reverse engineer seeks to understand the algorithms, data structures, and logic the program implements. Similarly, given a trained neural network, a mechanistic interpretability researcher seeks to understand the features detected, the computations performed, and the algorithms executed.
This is distinct from other approaches to understanding ML systems:
Behavioral testing probes what a model does (inputs → outputs) without examining internals. Useful but limited—it can't explain why a model behaves as it does.
Attribution methods identify which inputs influenced outputs (saliency maps, attention visualization). These show what the model attended to but not how it processed that information.
Probing trains classifiers on internal representations to detect what information is encoded. This reveals what a model represents but not how it uses those representations.
Mechanistic interpretability goes deeper: it aims to understand the actual computations—the algorithms implemented by the weights and activations.
┌─────────────────────────────────────────────────────────────────────────────┐
│ INTERPRETABILITY APPROACHES COMPARED │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ BEHAVIORAL TESTING │
│ ──────────────────── │
│ Question: "What does the model output for this input?" │
│ Method: Run many inputs, observe outputs │
│ Learns: Input-output mapping, failure modes │
│ Limitation: No insight into internal processing │
│ │
│ Example: Testing GPT on math problems → 85% accuracy on 2-digit addition │
│ (Tells us accuracy, not how it adds) │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ ATTRIBUTION / SALIENCY │
│ ──────────────────────── │
│ Question: "Which inputs influenced this output?" │
│ Method: Gradient-based attribution, attention weights │
│ Learns: Which tokens/features were important │
│ Limitation: Importance ≠ how information was processed │
│ │
│ Example: Attention visualizations showing model focuses on "not" in │
│ "This movie is not good" → "negative" │
│ (Tells us it saw "not", not how it processed negation) │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ PROBING │
│ ──────── │
│ Question: "What information is encoded in these representations?" │
│ Method: Train classifiers on intermediate activations │
│ Learns: What concepts are represented │
│ Limitation: Encoded ≠ used; classifier might extract unused info │
│ │
│ Example: Training POS-tag classifier on layer 6 → 95% accuracy │
│ (Tells us POS is encoded, not if/how model uses it) │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ MECHANISTIC INTERPRETABILITY │
│ ───────────────────────────── │
│ Question: "What algorithm does this model implement?" │
│ Method: Analyze individual components, trace information flow │
│ Learns: Actual computations and algorithms │
│ Limitation: Labor-intensive, may not scale to full models │
│ │
│ Example: Finding that attention head 5.7 copies tokens from positions │
│ matching a learned pattern, implementing "induction heads" │
│ (Tells us the actual mechanism) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
The Superposition Hypothesis
Before diving into techniques, we need to understand a key challenge: superposition. Neural networks don't represent one concept per neuron. Instead, they represent many more concepts than they have neurons by encoding concepts as directions in activation space.
Features vs Neurons
Early interpretability work tried to understand what individual neurons detect. While some neurons are interpretable (the famous "cat neuron" in image models), most neurons don't correspond to clean concepts. A single neuron might activate for an apparently random collection of inputs.
The resolution is that models don't store one feature per neuron. They store features as directions in the activation space. A feature might be represented by a vector like [0.3, -0.1, 0.7, ...] across many neurons. Individual neurons participate in many features.
This is called superposition—more features are represented than there are dimensions, by exploiting the geometry of high-dimensional spaces.
Why Superposition Exists
Superposition likely arises because:
-
There are more concepts to represent than neurons available. The world has countless concepts; models have limited neurons.
-
Most concepts are sparse. Any given input only activates a small subset of possible concepts. "Legal contract language" features aren't needed when processing "cat pictures."
-
High-dimensional geometry allows it. In high dimensions, you can fit exponentially many nearly-orthogonal directions. Features that rarely co-occur can share neurons without much interference.
Superposition is efficient but makes interpretation harder. We can't just ask "what does neuron 847 do?"—it participates in many features. We need methods to extract the actual features from the superposition.
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE SUPERPOSITION PROBLEM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ NAIVE VIEW: One neuron = One concept │
│ ────────────────────────────────────── │
│ │
│ Neuron 1: "Cat detector" │
│ Neuron 2: "Legal language detector" │
│ Neuron 3: "Python code detector" │
│ ... │
│ │
│ Problem: Models have ~768-12288 dimensions but millions of concepts │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ REALITY: Features as directions in activation space │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ Feature "cat" = direction [0.3, 0.1, -0.2, 0.5, ...] │
│ Feature "legal" = direction [-0.1, 0.4, 0.3, 0.2, ...] │
│ Feature "python" = direction [0.2, -0.3, 0.4, -0.1, ...] │
│ │
│ Each neuron participates in MANY features │
│ Neuron 1 = 0.3×"cat" - 0.1×"legal" + 0.2×"python" + ... │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ WHY THIS WORKS: │
│ ─────────────── │
│ │
│ High-dimensional spaces have MANY nearly-orthogonal directions │
│ │
│ In 2D: At most 2 orthogonal directions │
│ In 768D: Can fit ~10^6 directions with <10° angle between any pair │
│ │
│ If features rarely co-occur (cat + legal language is rare), │
│ they can share dimensions without interfering │
│ │
│ Sparse activation = efficient use of limited dimensions │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Sparse Autoencoders: Extracting Features from Superposition
If features are directions in activation space, how do we find them? This is where Sparse Autoencoders (SAEs) come in—currently the most promising technique for extracting interpretable features from neural networks.
The SAE Architecture
A sparse autoencoder is trained to:
- Take model activations as input
- Expand them to a much larger hidden layer (e.g., 768 → 32,000)
- Enforce sparsity (only a few hidden units active)
- Reconstruct the original activations
The key insight: the sparsity constraint forces the autoencoder to find a basis where features are separated. Each hidden unit ideally corresponds to one feature.
Why Sparsity Helps
Without sparsity, an autoencoder could use any basis for its hidden layer—including keeping features entangled. The sparsity constraint changes this:
If only k hidden units can be active, and the autoencoder must reconstruct inputs accurately, it must find hidden units that correspond to independently-occurring features. Features that co-occur would need to share hidden units, violating sparsity.
The result is a dictionary of features. Each hidden unit in the sparse autoencoder corresponds to one (hopefully interpretable) feature. The activation of that hidden unit tells you how strongly that feature is present.
What SAEs Find
Recent work applying SAEs to language models has found remarkably interpretable features:
Entity features: Specific people, places, organizations (a "Golden Gate Bridge" feature, an "Eiffel Tower" feature)
Concept features: Abstract concepts (a "deception" feature, a "uncertainty" feature)
Syntax features: Grammatical structures (a "list item" feature, an "if-then" feature)
Style features: Tone and register (a "formal language" feature, a "sarcasm" feature)
Task features: Operations the model performs (an "addition" feature, a "translation" feature)
The Anthropic team famously found a "Golden Gate Bridge" feature in Claude. When artificially amplified, the model became obsessed with the Golden Gate Bridge, inserting references to it in almost every response. This demonstrates that features are not just descriptive—they causally influence model behavior.
┌─────────────────────────────────────────────────────────────────────────────┐
│ SPARSE AUTOENCODER ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Model activations │
│ (768 dimensions) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ ENCODER (768 → 32,000) │ │
│ │ │ │
│ │ Linear projection + ReLU │ │
│ │ hidden = ReLU(W_enc @ activations + b_enc) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ SPARSE HIDDEN LAYER (32,000 units) │ │
│ │ │ │
│ │ Only ~50-200 units are non-zero (sparse!) │ │
│ │ │ │
│ │ [0, 0, 0.8, 0, 0, 0, 1.2, 0, ..., 0, 0.3, 0, 0] │ │
│ │ ↑ ↑ ↑ │ │
│ │ "Python" "Function" "Security" │ │
│ │ feature feature feature │ │
│ │ │ │
│ │ Each active unit = a feature present in the input │ │
│ │ Activation magnitude = feature strength │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ DECODER (32,000 → 768) │ │
│ │ │ │
│ │ Linear projection (reconstructs original activations) │ │
│ │ reconstructed = W_dec @ hidden + b_dec │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Reconstructed activations │
│ (768 dimensions) │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ TRAINING OBJECTIVE: │
│ │
│ Loss = reconstruction_error + λ × sparsity_penalty │
│ │
│ reconstruction_error = ||activations - reconstructed||² │
│ sparsity_penalty = Σ|hidden| (L1 norm encourages zeros) │
│ │
│ λ controls trade-off: higher = sparser but worse reconstruction │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ INTERPRETING FEATURES: │
│ │
│ For each hidden unit (feature): │
│ 1. Find inputs that maximally activate it │
│ 2. Look for patterns in those inputs │
│ 3. Name the feature based on the pattern │
│ │
│ Example: Hidden unit 7,432 activates on: │
│ - "The for loop iterates over..." │
│ - "Looping through the array..." │
│ - "for i in range(10):" │
│ → Feature 7,432 = "iteration/looping concept" │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Circuits: How Features Compose into Algorithms
Features are the vocabulary of neural network computation. Circuits are the grammar—how features combine and transform through the network to implement algorithms.
What is a Circuit?
A circuit is a subgraph of the neural network that implements a specific computation. It consists of:
- Input features: What information the circuit reads
- Intermediate computations: How attention heads and MLPs process that information
- Output features: What the circuit produces
Circuits can be simple (a single attention head copying a token) or complex (multiple layers collaborating to perform multi-step reasoning).
The Induction Head Circuit
The best-understood circuit in language models is the induction head—a mechanism for in-context learning discovered by Anthropic researchers.
Induction heads implement a simple but powerful pattern: [A][B] ... [A] → [B]. If the model sees the sequence "Harry Potter ... Harry," it predicts "Potter" because it saw "Harry Potter" earlier.
The circuit involves two attention heads working together:
Previous Token Head (Layer N): Attends from each token to the token before it. After this head, each position contains information about the preceding token.
Induction Head (Layer N+1): Attends from the current position to earlier positions that were preceded by a matching token. Because of the previous token head, it can search for "positions preceded by a token matching the current token."
┌─────────────────────────────────────────────────────────────────────────────┐
│ INDUCTION HEAD CIRCUIT │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Pattern: [A][B] ... [A] → [B] │
│ │
│ Example: "Harry Potter is a wizard. Harry" → "Potter" │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ Input sequence: "Harry Potter is a wizard . Harry" │
│ Positions: 0 1 2 3 4 5 6 │
│ │
│ STEP 1: Previous Token Head (Layer N) │
│ ───────────────────────────────────── │
│ │
│ Each position attends to the previous position │
│ Position 1 ("Potter") ← attends to → Position 0 ("Harry") │
│ │
│ After this head, Position 1 knows "I come after 'Harry'" │
│ │
│ Stored in residual stream: │
│ Pos 1: [Potter embedding] + [info: preceded by "Harry"] │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ STEP 2: Induction Head (Layer N+1) │
│ ────────────────────────────────── │
│ │
│ Position 6 ("Harry") asks: "Where have I seen my token before, │
│ and what came after it?" │
│ │
│ Query (pos 6): "Looking for positions preceded by 'Harry'" │
│ Key (pos 1): "I am preceded by 'Harry'" │
│ ↓ │
│ MATCH! │
│ ↓ │
│ Value (pos 1): "Potter" embedding │
│ │
│ Position 6 receives "Potter" information → predicts "Potter" │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ WHY THIS MATTERS: │
│ │
│ This is a general-purpose in-context learning mechanism. │
│ Any [A][B] pattern in context becomes a "rule" the model follows. │
│ │
│ "The word for cat is gato. The word for dog is" → "perro"? │
│ Model sees pattern: "word for X is Y" → copies Y after X │
│ │
│ Induction heads are key to few-shot learning! │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
More Complex Circuits
Beyond induction heads, researchers have identified circuits for:
Indirect Object Identification: "Mary gave the book to John. She gave him the" → "book". The circuit must track that "him" refers to John and predict what Mary might give.
Greater-Than Comparison: "The war lasted from 1812 to 18..." → predicts "13" or higher. The circuit must understand the year comparison context and constrain outputs appropriately.
Gendered Pronoun Resolution: "The doctor said she..." Understanding when "she" is correct requires resolving coreference.
Factual Recall: "The Eiffel Tower is in..." → "Paris". The model must look up stored knowledge.
Each circuit involves specific attention heads and MLP layers working together, and researchers can often identify which components are responsible.
Attention Patterns: The Information Routing System
Attention is how transformers move information between positions. Understanding attention patterns reveals how information flows through the network.
Types of Attention Heads
Not all attention heads do the same thing. Researchers have identified several common patterns:
Position heads: Attend based on relative position (e.g., always attend to the previous token, or to the first token).
Syntax heads: Attend based on syntactic relationships (e.g., verbs attend to subjects).
Copy heads: Copy information from attended positions to the current position.
Induction heads: Implement the [A][B]...[A] → [B] pattern described above.
Retrieval heads: Search for specific content in the context (like looking up a definition given earlier).
Attention Head Analysis
To understand what an attention head does:
- Visualize attention patterns across many inputs
- Look for consistent patterns (does it always attend to certain positions or content types?)
- Test interventions (what happens if you ablate this head?)
- Examine QKV matrices (what queries does it form? What keys does it attend to? What values does it copy?)
For example, if you find that a head consistently attends from pronouns to their antecedents, you've identified a coreference resolution head.
The Residual Stream View
A useful mental model for transformer computation is the residual stream. Instead of thinking of transformers as layer-by-layer processing, think of a stream of information that flows through the network.
The Residual Stream Mental Model
At each position, there's a residual stream—a vector that accumulates information through the network:
- Initial state: Token embedding
- After each layer: Residual stream += attention_output + mlp_output
- Final state: Used for prediction
Each attention head and MLP can read from and write to the residual stream. They don't directly communicate with each other—they communicate through this shared stream.
This view clarifies:
- Attention heads read from the residual stream (via QKV projections) and write back to it (via output projection)
- MLPs read from the residual stream and write processed information back
- Information persists across layers in the residual stream
- Later components can use earlier outputs because they're accumulated in the stream
Implications
The residual stream view explains several phenomena:
Skip connections matter: Without residuals, information would have to pass through every layer. Residuals let information skip ahead.
Layer order is flexible: Components don't strictly depend on adjacent layers. An attention head in layer 10 can use information written by layer 2.
Distributed computation: A "computation" might involve components spread across many layers, all contributing to the same residual dimensions.
┌─────────────────────────────────────────────────────────────────────────────┐
│ RESIDUAL STREAM VISUALIZATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Token: "The" │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ RESIDUAL STREAM (accumulates information through layers) │ │
│ │ │ │
│ │ ══════════════════════════════════════════════════════════════════ │ │
│ │ │ │
│ │ Initial: embed("The") = [0.2, -0.1, 0.5, ...] │ │
│ │ │ │
│ │ │ │ │
│ │ │◄────── Layer 0 Attention writes: +[0.1, 0.0, -0.1, ...] │ │
│ │ │◄────── Layer 0 MLP writes: +[0.0, 0.2, 0.1, ...] │ │
│ │ ▼ │ │
│ │ │ │
│ │ After L0: [0.3, 0.1, 0.5, ...] │ │
│ │ │ │
│ │ │ │ │
│ │ │◄────── Layer 1 Attention reads stream, writes updates │ │
│ │ │◄────── Layer 1 MLP reads stream, writes updates │ │
│ │ ▼ │ │
│ │ │ │
│ │ After L1: [0.4, -0.1, 0.6, ...] │ │
│ │ │ │
│ │ │ │ │
│ │ : (more layers) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ │ │
│ │ Final: [0.8, 0.3, -0.2, ...] → Unembedding → Next token prediction │ │
│ │ │ │
│ │ ══════════════════════════════════════════════════════════════════ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ KEY INSIGHTS: │
│ │
│ 1. Information PERSISTS: What layer 2 writes is still there for layer 10 │
│ │
│ 2. PARALLEL computation: Attention and MLP both read from same stream │
│ │
│ 3. SELECTIVE reading/writing: Each component uses projections to │
│ interact with specific "subspaces" of the residual stream │
│ │
│ 4. SKIP connections implicit: Early layers can influence final output │
│ without passing through middle layers │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Activation Patching: Causal Intervention
To move beyond correlation to causation, researchers use activation patching—replacing activations from one input with activations from another and observing the effect.
The Patching Method
Given two inputs that produce different outputs:
- Clean input: "The Eiffel Tower is in" → "Paris"
- Corrupted input: "The Colosseum is in" → "Rome"
To find which components cause the different outputs:
- Run the model on both inputs, saving all activations
- Run the model on the clean input, but patch a specific activation with its corrupted version
- Observe how the output changes
If patching component X causes the output to flip from "Paris" to "Rome," that component is causally responsible for the difference.
Activation Patching in Practice
Researchers use patching to:
Identify important components: Which attention heads matter for a specific task?
Trace information flow: Where does the model "look up" the answer?
Verify circuit hypotheses: If we think head 5.3 does X, does patching it disrupt X?
Localize knowledge: Where is the fact "Eiffel Tower → Paris" stored?
Patching comes in several flavors:
- Zero patching: Replace activation with zeros
- Mean patching: Replace with average activation across inputs
- Noise patching: Add noise to activations
- Resample patching: Replace with activation from a different input
The Logit Lens
A simpler intervention technique is the logit lens: apply the unembedding matrix (which converts the final layer's output to token probabilities) to intermediate layers.
This reveals what the model would predict if it stopped at layer N. By comparing predictions across layers, we see how the model's "belief" evolves:
- Layer 0: Random-ish predictions
- Layer 5: Starting to narrow down
- Layer 10: Confident prediction emerging
- Layer 12: Final answer
The logit lens shows when in the network a prediction is formed, though not how.
Practical Applications
Mechanistic interpretability isn't just academic—it has practical applications:
Model Debugging
When a model fails, interpretability can identify why:
- Which features activated inappropriately?
- Which attention patterns went wrong?
- Where did the circuit break down?
For example, if a model incorrectly says "The capital of Australia is Sydney," interpretability might reveal:
- A "major city" feature activated strongly
- The "capital lookup" attention head attended to Sydney mentions
- The factual recall circuit was overridden by frequency information
Targeted Model Editing
If we can identify where knowledge is stored, we can edit it:
- ROME (Rank-One Model Editing) modifies specific factual associations
- MEMIT extends this to multiple facts
- These rely on interpretability to locate what to edit
Safety Analysis
Interpretability can detect concerning behaviors:
- Features related to deception or manipulation
- Circuits that bypass safety training
- Hidden capabilities not evident from behavior
Anthropic's research on "sleeper agents" (models trained to behave differently in certain contexts) used interpretability to understand how such behaviors are implemented.
Prompt Engineering Insights
Understanding how models process prompts can improve prompt design:
- Which instructions activate helpfulness features?
- How do chain-of-thought prompts change computation?
- Why do certain prompt formats work better?
Current Limitations and Challenges
Mechanistic interpretability is promising but faces significant challenges:
Scale
Current techniques work best on small models (GPT-2 scale). Frontier models have:
- 100x more parameters
- More complex emergent behaviors
- Denser feature superposition
Scaling interpretability to frontier models remains an open challenge.
Completeness
We understand individual circuits, but not how they compose into complete model behavior. A full understanding would require:
- A complete feature dictionary
- All circuits identified
- How circuits interact
We're far from this for any model.
Automation
Current interpretability is labor-intensive:
- Manually examining features
- Hand-tracing circuits
- Interpreting attention patterns
Automated interpretability tools are improving but limited.
Validation
How do we know our interpretations are correct? A feature we call "Python code" might actually be "ASCII text with brackets." Validation methods include:
- Causal intervention (does manipulating the feature produce expected effects?)
- Prediction (can we predict feature activation on new inputs?)
- Consistency (do multiple analysis methods agree?)
The Path Forward
Despite limitations, the field is progressing rapidly:
Better SAEs: Improved training techniques yield cleaner features with less superposition.
Automated Circuit Discovery: Tools to automatically identify circuits, reducing manual effort.
Scaling Laws for Interpretability: Understanding how interpretability difficulty scales with model size.
Integration with Training: Using interpretability insights during training, not just after.
Safety Applications: Detecting deception, ensuring robustness, validating alignment.
The goal is ambitious: fully understand how neural networks compute, with the same completeness that we understand traditional algorithms. We're far from there, but progress is accelerating.
Tools and Resources
For practitioners interested in exploring mechanistic interpretability:
Libraries
TransformerLens: A library for mechanistic interpretability of transformers. Provides hooks to access intermediate activations, attention patterns, and more.
SAE Lens: Tools for training and analyzing sparse autoencoders on language models.
CircuitsVis: Visualization tools for attention patterns and circuits.
pyvene: A library for performing interventions on neural network activations.
Key Papers
- "A Mathematical Framework for Transformer Circuits" (Elhage et al.)
- "In-context Learning and Induction Heads" (Olsson et al.)
- "Toy Models of Superposition" (Elhage et al.)
- "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (Anthropic)
- "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (Anthropic)
Communities
- EleutherAI Discord (interpretability channel)
- Alignment Forum
- LessWrong (AI alignment posts)
Frequently Asked Questions
Related Articles
Transformer Architecture: A Complete Deep Dive
A comprehensive exploration of the transformer architecture—from embedding layers through attention and feed-forward networks to the output head. Understand why decoder-only models dominate, how residual connections enable deep networks, and the engineering decisions behind GPT, Llama, and modern LLMs.
Attention Mechanisms: From Self-Attention to FlashAttention
A comprehensive deep dive into attention mechanisms—the core innovation powering modern LLMs. From the intuition behind self-attention to the engineering of FlashAttention, understand how transformers actually work.
LLM Safety and Red Teaming: Attacks, Defenses, and Best Practices
A comprehensive guide to LLM security threats—prompt injection, jailbreaks, and adversarial attacks—plus the defense mechanisms and red teaming practices that protect production systems.
RLHF Complete Guide: Aligning LLMs with Human Preferences
A comprehensive deep dive into Reinforcement Learning from Human Feedback—from reward modeling to PPO to DPO. Understanding how AI assistants learn to be helpful, harmless, and honest.