TransformerLens: Introduction

TransformerLens makes transformer internals accessible. It's the microscope for mechanistic interpretability.

Why TransformerLens?

HuggingFace gives you models. TransformerLens lets you see inside them.

from transformer_lens import HookedTransformer

# Load a model with hooks everywhere
model = HookedTransformer.from_pretrained("gpt2-small")

# Run and cache ALL intermediate activations
output, cache = model.run_with_cache("Hello world")

# Access anything
embeddings = cache["embed"]
attention_patterns = cache["pattern", 0]  # Layer 0
mlp_activations = cache["mlp_out", 5]     # Layer 5

The HookedTransformer

A GPT-style model with hooks at every interesting point:

model = HookedTransformer.from_pretrained("gpt2-small")

print(model.cfg)
# HookedTransformerConfig(
#   n_layers=12,
#   n_heads=12,
#   d_model=768,
#   d_head=64,
#   d_mlp=3072,
#   ...
# )

Available models: GPT-2, GPT-Neo, Pythia, LLaMA, and more.

Basic Usage

# Forward pass
logits = model("Hello world")  # (1, seq_len, vocab_size)

# Get loss
loss = model("Hello world", return_type="loss")

# Get both
logits, loss = model("Hello world", return_type="both")

# Just run (for hooks)
model("Hello world", return_type=None)

Tokenization Helpers

# String to tokens
tokens = model.to_tokens("Hello world")  # tensor([[50256, 15496, 995]])

# Tokens to strings
strings = model.to_str_tokens("Hello world")  # ['<|endoftext|>', 'Hello', ' world']

# Tokens back to string
text = model.to_string(tokens)  # '<|endoftext|>Hello world'

Note: TransformerLens prepends BOS token by default. Use prepend_bos=False to disable.

Accessing Activations

The cache stores everything:

output, cache = model.run_with_cache("The cat sat on the mat")

# Token embeddings (after position + token embed)
cache["hook_embed"]  # (batch, seq, d_model)

# Attention patterns
cache["pattern", layer]  # (batch, n_heads, seq_q, seq_k)

# Attention output
cache["attn_out", layer]  # (batch, seq, d_model)

# MLP activations
cache["mlp_out", layer]  # (batch, seq, d_model)

# Residual stream at any point
cache["resid_pre", layer]   # Before attention
cache["resid_mid", layer]   # After attention, before MLP
cache["resid_post", layer]  # After MLP

Visualizing Attention

import circuitsvis as cv

output, cache = model.run_with_cache("The cat sat on the mat")

# Get attention pattern for layer 0
pattern = cache["pattern", 0]  # (batch, n_heads, seq_q, seq_k)

# Visualize
cv.attention.attention_patterns(
    tokens=model.to_str_tokens("The cat sat on the mat"),
    attention=pattern[0],  # First batch element
)

Direct Parameter Access

# All parameters accessible as attributes
W_E = model.W_E      # (vocab, d_model) - token embeddings
W_pos = model.W_pos  # (max_seq, d_model) - position embeddings
W_U = model.W_U      # (d_model, vocab) - unembedding

# Per-layer weights
W_Q = model.W_Q  # (n_layers, n_heads, d_model, d_head)
W_K = model.W_K  # (n_layers, n_heads, d_model, d_head)
W_V = model.W_V  # (n_layers, n_heads, d_model, d_head)
W_O = model.W_O  # (n_layers, n_heads, d_head, d_model)

# MLP weights
W_in = model.W_in    # (n_layers, d_model, d_mlp)
W_out = model.W_out  # (n_layers, d_mlp, d_model)

Making Predictions

# Get top predictions at each position
logits = model("The capital of France is")

# Last position predictions
last_logits = logits[0, -1]  # (vocab_size,)
top_tokens = last_logits.topk(10)

for i, (val, idx) in enumerate(zip(top_tokens.values, top_tokens.indices)):
    print(f"{i+1}. {model.to_string([idx])!r}: {val:.2f}")

# 1. ' Paris': 18.42
# 2. ' the': 14.21
# 3. ' France': 13.87
# ...

Capstone Connection

TransformerLens for sycophancy analysis:

# Compare activations for honest vs sycophantic prompts
honest_prompt = "Is 2+2=5? Be honest."
sycophantic_prompt = "I believe 2+2=5. Am I right?"

_, cache_honest = model.run_with_cache(honest_prompt)
_, cache_syco = model.run_with_cache(sycophantic_prompt)

# Where do they differ?
for layer in range(model.cfg.n_layers):
    diff = (cache_honest["resid_post", layer] - cache_syco["resid_post", layer]).norm()
    print(f"Layer {layer}: {diff:.2f}")

This is the starting point for mechanistic analysis of sycophancy.

🎓 Tyla's Exercise

For GPT-2 small, how many total hook points are there? (List the types: embed, attention patterns, residual stream, etc.)
Why does TransformerLens separate W_Q, W_K, W_V, W_O when HuggingFace often combines them? What interpretability benefit does this provide?
The residual stream can be written as: $x_{final} = x_0 + \sum_{l} \text{attn}l + \sum{l} \text{mlp}_l$. How does this decomposition help with attribution?

💻 Aaliyah's Exercise

Explore a model's behavior:

def analyze_prompt(model, prompt):
    """
    1. Get logits and cache
    2. Print top 5 predictions for each position
    3. Visualize attention for all layers
    4. Find which head attends most to the first token
    """
    pass

def compare_prompts(model, prompt1, prompt2):
    """
    1. Run both prompts
    2. Find the layer with largest residual stream difference
    3. Find the attention head with most different patterns
    4. Report which tokens cause the biggest prediction difference
    """
    pass

📚 Maneesha's Reflection

TransformerLens makes internals visible. What might we miss by only looking at what we can visualize?
The decision to make all weights easily accessible is a design choice. What are the trade-offs of this level of transparency?
How would you design a tool for interpretability of a model architecture you hadn't seen before?

TransformerLens: Introduction #

Why TransformerLens? #

The HookedTransformer #

Basic Usage #

Tokenization Helpers #

Accessing Activations #

Visualizing Attention #

Direct Parameter Access #

Making Predictions #

Capstone Connection #

🎓 Tyla's Exercise #

💻 Aaliyah's Exercise #

📚 Maneesha's Reflection #