Indirect Object Identification: A Complete Circuit

The IOI circuit is the most thoroughly reverse-engineered circuit in a language model. Let's understand it.


The IOI Task

Complete sentences like:

"When Mary and John went to the store, John gave a drink to ___"
                                                           ↓
                                                         Mary

The model must:

  1. Identify the two names (Mary, John)
  2. Notice which name is repeated (John)
  3. Predict the non-repeated name (Mary)

Why IOI?

This task is perfect for interpretability:

  1. Clear ground truth: We know the correct answer
  2. Easy to measure: Logit difference between Mary and John
  3. Crisp structure: Grammar is well-defined
  4. Non-trivial: Requires tracking identity across tokens

The Metric: Logit Difference

def logit_difference(model, prompt, correct, incorrect):
    """
    Positive = model prefers correct answer
    Negative = model prefers incorrect answer
    """
    logits = model(prompt)[0, -1]  # Last position

    correct_idx = model.to_single_token(correct)
    incorrect_idx = model.to_single_token(incorrect)

    return logits[correct_idx] - logits[incorrect_idx]

# Example
prompt = "When John and Mary went to the store, John gave a drink to"
diff = logit_difference(model, prompt, " Mary", " John")
# Result: ~6.0 (model strongly prefers Mary)

The IOI Dataset

We create matched prompts:

prompt_format = "When {name1} and {name2} went to the {place}, {name1} gave the {object} to"

# Example pairs
("When John and Mary went to the store, John gave a drink to", " Mary"),
("When Mary and John went to the store, Mary gave a drink to", " John"),

The answer always alternates based on which name is repeated.


The Circuit Overview

┌────────────────────────────────────────────────────────┐
│                    IOI CIRCUIT                          │
├────────────────────────────────────────────────────────┤
│                                                        │
│  1. PREVIOUS TOKEN HEADS (Layer 0-1)                  │
│     "What token came before me?"                       │
│     Position i copies info to position i+1             │
│                                                        │
│  2. DUPLICATE TOKEN HEADS (Layer 0-1)                 │
│     "Is my token repeated elsewhere?"                  │
│     Finds the earlier occurrence of repeated name      │
│                                                        │
│  3. S-INHIBITION HEADS (Layer 7-8)                    │
│     "Which name is repeated?"                          │
│     Suppresses the repeated name at END position       │
│                                                        │
│  4. NAME MOVER HEADS (Layer 9-10)                     │
│     "Copy the non-repeated name to output"            │
│     Attends to IO token, copies to END position        │
│                                                        │
└────────────────────────────────────────────────────────┘

Duplicate Token Heads

These heads identify repeated tokens:

# Head 0.1 has a distinctive pattern:
# At position of second "John", attends to first "John"

# Attention pattern visualization:
# "When John and Mary went to store, John gave..."
#       ↑                            ↑
#       └────────────────────────────┘
#          Head 0.1 attends here

S-Inhibition Heads

These suppress the repeated name:

# S-Inhibition heads (e.g., 7.3, 7.9, 8.6):
# 1. Read information from Duplicate Token Heads
# 2. At the END position, write "don't predict John"

# This happens via:
# - Negative contribution to "John" logits
# - Creates space for "Mary" to win

Name Mover Heads

The final step: copy the correct name:

# Name Mover heads (e.g., 9.9, 10.0):
# 1. At END position, attend to the IO name (Mary)
# 2. Copy "Mary" to output

# Attention pattern:
# "When John and Mary went to store, John gave a drink to"
#             ↑                                        ↑
#           Mary                                      END
#             └────────────────────────────────────────┘
#                       Head 9.9 attends here

Backup Name Movers

Why do models have backups?

# Backup Name Movers (e.g., 9.6, 10.1):
# - Do the same thing as primary name movers
# - Activated when primary heads fail
# - May be result of dropout during training

# This redundancy makes circuits robust!

Negative Name Movers

The strangest heads:

# Negative Name Movers (e.g., 10.7, 11.10):
# - Copy IO name but with NEGATIVE sign
# - Reduce confidence in correct answer
# - Why? Possibly for hedging/calibration

Logit Attribution

Which components directly affect the output?

def direct_logit_attribution(model, prompt, answer_tokens):
    """
    Decompose the logit into contributions from each component.
    """
    _, cache = model.run_with_cache(prompt)

    # Final residual = sum of all component outputs
    logit_diff_contribs = {}

    for layer in range(model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            # Get head's output at END position
            head_output = cache[f"blocks.{layer}.attn.hook_result"][:, -1, head, :]

            # Project onto answer direction
            contrib = head_output @ (model.W_U[:, answer_tokens[0]] -
                                      model.W_U[:, answer_tokens[1]])
            logit_diff_contribs[(layer, head)] = contrib.item()

    return logit_diff_contribs

Activation Patching

Find which activations matter:

def activation_patching(model, clean, corrupted, hook_point, position):
    """
    1. Run corrupted prompt
    2. At hook_point, position, patch in clean activation
    3. Measure recovery of clean behavior
    """
    _, clean_cache = model.run_with_cache(clean)
    clean_act = clean_cache[hook_point][:, position]

    def patch_hook(act, hook):
        act[:, position] = clean_act
        return act

    patched_logits = model.run_with_hooks(
        corrupted,
        fwd_hooks=[(hook_point, patch_hook)]
    )

    return logit_difference(patched_logits, ...)

Capstone Connection

IOI patterns and sycophancy:

The IOI circuit shows how models track and process entity information.

Sycophancy might use similar patterns:

# Hypothesis: Sycophancy circuit might parallel IOI
# - Duplicate Token Heads → detect user opinion
# - S-Inhibition Heads → suppress contradicting info
# - Name Mover Heads → copy agreeable response

🎓 Tyla's Exercise

  1. Why does the IOI circuit need at least 2 layers for duplicate token detection? Prove that single-layer attention can't solve this.

  2. The circuit has 26 heads across 7 categories. If we ablate all non-circuit heads, what percentage of logit difference would you expect to remain?

  3. Explain why "name mover" heads are QK-composition with "duplicate token" heads.


💻 Aaliyah's Exercise

Replicate IOI analysis:

def analyze_ioi_circuit(model, prompts, answers):
    """
    1. Compute logit difference across prompts
    2. Find top heads by direct logit attribution
    3. Categorize heads into circuit roles
    4. Verify with activation patching
    """
    pass

def visualize_attention_patterns(model, prompt, layer, head):
    """
    1. Run prompt and cache attention patterns
    2. Create attention heatmap
    3. Annotate with token labels
    4. Identify signature patterns (previous token, duplicate, etc.)
    """
    pass

📚 Maneesha's Reflection

  1. The IOI circuit was found through extensive manual analysis. How might we automate circuit discovery?

  2. Different prompts with the same structure might activate different circuits. What does this say about circuit "universality"?

  3. The paper's three criteria are faithfulness, completeness, and minimality. Why are all three important for interpretability claims?