Indirect Object Identification: A Complete Circuit
The IOI circuit is the most thoroughly reverse-engineered circuit in a language model. Let's understand it.
The IOI Task
Complete sentences like:
"When Mary and John went to the store, John gave a drink to ___"
↓
Mary
The model must:
- Identify the two names (Mary, John)
- Notice which name is repeated (John)
- Predict the non-repeated name (Mary)
Why IOI?
This task is perfect for interpretability:
- Clear ground truth: We know the correct answer
- Easy to measure: Logit difference between Mary and John
- Crisp structure: Grammar is well-defined
- Non-trivial: Requires tracking identity across tokens
The Metric: Logit Difference
def logit_difference(model, prompt, correct, incorrect):
"""
Positive = model prefers correct answer
Negative = model prefers incorrect answer
"""
logits = model(prompt)[0, -1] # Last position
correct_idx = model.to_single_token(correct)
incorrect_idx = model.to_single_token(incorrect)
return logits[correct_idx] - logits[incorrect_idx]
# Example
prompt = "When John and Mary went to the store, John gave a drink to"
diff = logit_difference(model, prompt, " Mary", " John")
# Result: ~6.0 (model strongly prefers Mary)
The IOI Dataset
We create matched prompts:
prompt_format = "When {name1} and {name2} went to the {place}, {name1} gave the {object} to"
# Example pairs
("When John and Mary went to the store, John gave a drink to", " Mary"),
("When Mary and John went to the store, Mary gave a drink to", " John"),
The answer always alternates based on which name is repeated.
The Circuit Overview
┌────────────────────────────────────────────────────────┐
│ IOI CIRCUIT │
├────────────────────────────────────────────────────────┤
│ │
│ 1. PREVIOUS TOKEN HEADS (Layer 0-1) │
│ "What token came before me?" │
│ Position i copies info to position i+1 │
│ │
│ 2. DUPLICATE TOKEN HEADS (Layer 0-1) │
│ "Is my token repeated elsewhere?" │
│ Finds the earlier occurrence of repeated name │
│ │
│ 3. S-INHIBITION HEADS (Layer 7-8) │
│ "Which name is repeated?" │
│ Suppresses the repeated name at END position │
│ │
│ 4. NAME MOVER HEADS (Layer 9-10) │
│ "Copy the non-repeated name to output" │
│ Attends to IO token, copies to END position │
│ │
└────────────────────────────────────────────────────────┘
Duplicate Token Heads
These heads identify repeated tokens:
# Head 0.1 has a distinctive pattern:
# At position of second "John", attends to first "John"
# Attention pattern visualization:
# "When John and Mary went to store, John gave..."
# ↑ ↑
# └────────────────────────────┘
# Head 0.1 attends here
S-Inhibition Heads
These suppress the repeated name:
# S-Inhibition heads (e.g., 7.3, 7.9, 8.6):
# 1. Read information from Duplicate Token Heads
# 2. At the END position, write "don't predict John"
# This happens via:
# - Negative contribution to "John" logits
# - Creates space for "Mary" to win
Name Mover Heads
The final step: copy the correct name:
# Name Mover heads (e.g., 9.9, 10.0):
# 1. At END position, attend to the IO name (Mary)
# 2. Copy "Mary" to output
# Attention pattern:
# "When John and Mary went to store, John gave a drink to"
# ↑ ↑
# Mary END
# └────────────────────────────────────────┘
# Head 9.9 attends here
Backup Name Movers
Why do models have backups?
# Backup Name Movers (e.g., 9.6, 10.1):
# - Do the same thing as primary name movers
# - Activated when primary heads fail
# - May be result of dropout during training
# This redundancy makes circuits robust!
Negative Name Movers
The strangest heads:
# Negative Name Movers (e.g., 10.7, 11.10):
# - Copy IO name but with NEGATIVE sign
# - Reduce confidence in correct answer
# - Why? Possibly for hedging/calibration
Logit Attribution
Which components directly affect the output?
def direct_logit_attribution(model, prompt, answer_tokens):
"""
Decompose the logit into contributions from each component.
"""
_, cache = model.run_with_cache(prompt)
# Final residual = sum of all component outputs
logit_diff_contribs = {}
for layer in range(model.cfg.n_layers):
for head in range(model.cfg.n_heads):
# Get head's output at END position
head_output = cache[f"blocks.{layer}.attn.hook_result"][:, -1, head, :]
# Project onto answer direction
contrib = head_output @ (model.W_U[:, answer_tokens[0]] -
model.W_U[:, answer_tokens[1]])
logit_diff_contribs[(layer, head)] = contrib.item()
return logit_diff_contribs
Activation Patching
Find which activations matter:
def activation_patching(model, clean, corrupted, hook_point, position):
"""
1. Run corrupted prompt
2. At hook_point, position, patch in clean activation
3. Measure recovery of clean behavior
"""
_, clean_cache = model.run_with_cache(clean)
clean_act = clean_cache[hook_point][:, position]
def patch_hook(act, hook):
act[:, position] = clean_act
return act
patched_logits = model.run_with_hooks(
corrupted,
fwd_hooks=[(hook_point, patch_hook)]
)
return logit_difference(patched_logits, ...)
Capstone Connection
IOI patterns and sycophancy:
The IOI circuit shows how models track and process entity information.
Sycophancy might use similar patterns:
- "User said X" (duplicate token-like)
- "Don't contradict user's position" (inhibition-like)
- "Copy user's sentiment to output" (name mover-like)
# Hypothesis: Sycophancy circuit might parallel IOI
# - Duplicate Token Heads → detect user opinion
# - S-Inhibition Heads → suppress contradicting info
# - Name Mover Heads → copy agreeable response
🎓 Tyla's Exercise
Why does the IOI circuit need at least 2 layers for duplicate token detection? Prove that single-layer attention can't solve this.
The circuit has 26 heads across 7 categories. If we ablate all non-circuit heads, what percentage of logit difference would you expect to remain?
Explain why "name mover" heads are QK-composition with "duplicate token" heads.
💻 Aaliyah's Exercise
Replicate IOI analysis:
def analyze_ioi_circuit(model, prompts, answers):
"""
1. Compute logit difference across prompts
2. Find top heads by direct logit attribution
3. Categorize heads into circuit roles
4. Verify with activation patching
"""
pass
def visualize_attention_patterns(model, prompt, layer, head):
"""
1. Run prompt and cache attention patterns
2. Create attention heatmap
3. Annotate with token labels
4. Identify signature patterns (previous token, duplicate, etc.)
"""
pass
📚 Maneesha's Reflection
The IOI circuit was found through extensive manual analysis. How might we automate circuit discovery?
Different prompts with the same structure might activate different circuits. What does this say about circuit "universality"?
The paper's three criteria are faithfulness, completeness, and minimality. Why are all three important for interpretability claims?