Path Patching: Tracing Information Flow

Activation patching tells us WHERE information matters. Path patching tells us HOW it flows.


The Limitation of Activation Patching

Activation patching shows importance but not causation:

When we patch Layer 5's residual stream:
- Is Layer 5 computing something important?
- Or just passing through important info from earlier?

We can't tell!

Path patching solves this by examining specific paths through the model.


What is a Path?

A path is a specific route information takes:

Attention Head 0.1 → Residual Stream → Attention Head 7.3 → Output

This is different from:
Attention Head 0.1 → Residual Stream → MLP 3 → Attention Head 7.3 → Output

Each path can carry different information.


The Path Patching Algorithm

def path_patching(model, clean, corrupted, sender, receiver):
    """
    1. Run clean forward pass, cache everything
    2. Run corrupted forward pass
    3. At sender, use corrupted values
    4. But freeze sender's OUTPUT at receiver's INPUT
    5. Measure effect on output
    """
    _, clean_cache = model.run_with_cache(clean)

    # Step 1: Run corrupted, save sender's output
    corrupted_sender_output = None
    def save_sender(act, hook):
        nonlocal corrupted_sender_output
        corrupted_sender_output = act.clone()
        return act

    model.run_with_hooks(corrupted, fwd_hooks=[(sender, save_sender)])

    # Step 2: Run clean, but patch sender→receiver path
    def patch_receiver_input(act, hook):
        # Only patch the part coming from sender
        # This requires understanding the computational graph!
        return patch_from_sender(act, corrupted_sender_output)

    patched_logits = model.run_with_hooks(
        clean,
        fwd_hooks=[(receiver, patch_receiver_input)]
    )

    return measure_effect(patched_logits)

Path Patching in Practice

For attention heads, the path goes through Q, K, V:

def path_patch_attention(model, clean, corrupted, sender_head, receiver_head):
    """
    Patch from sender → receiver's {Q, K, V}
    """
    results = {}

    for component in ['q', 'k', 'v']:
        _, clean_cache = model.run_with_cache(clean)

        # Get sender's contribution to receiver's component
        sender_output = clean_cache[f"blocks.{sender_layer}.attn.hook_result"]
        receiver_input = clean_cache[f"blocks.{receiver_layer}.attn.hook_{component}"]

        # Patch and measure
        effect = measure_path_effect(sender_output, receiver_input)
        results[component] = effect

    return results

The IOI Circuit Paths

Path patching reveals the circuit structure:

Duplicate Token (0.1) ──────────────────┐
                                        │ via K composition
Previous Token (0.0) ──────────────────┐│
                                       ││
                                       ▼▼
S-Inhibition (7.3) ────────────────────┐
                                       │ via V composition
                                       ▼
Name Mover (9.9) ─────────────────────┐
                                      │
                                      ▼
                                   OUTPUT

K-Composition vs V-Composition

Two ways heads can communicate:

K-Composition: "Where to look"

# Head A's output affects Head B's keys
# This changes what B attends to
head_b_keys = head_a_output @ W_K_b
# Head A says "look at position X"

V-Composition: "What to copy"

# Head A's output affects Head B's values
# This changes what B writes
head_b_values = head_a_output @ W_V_b
# Head A says "copy this information"

Detecting Composition Type

def classify_composition(model, sender, receiver):
    """
    Determine if sender→receiver is Q, K, or V composition
    """
    sender_output = model.W_O[sender[0], sender[1]]  # (d_head, d_model)

    # Check each composition type
    q_overlap = sender_output @ model.W_Q[receiver[0], receiver[1]].T
    k_overlap = sender_output @ model.W_K[receiver[0], receiver[1]].T
    v_overlap = sender_output @ model.W_V[receiver[0], receiver[1]].T

    q_score = q_overlap.norm()
    k_score = k_overlap.norm()
    v_score = v_overlap.norm()

    return {"Q": q_score, "K": k_score, "V": v_score}

Virtual Weights

We can compute the "effective" weight matrix for a path:

def virtual_weight(model, sender, receiver):
    """
    W_virtual = W_O_sender @ W_K_receiver
    This shows what sender writes that receiver reads
    """
    W_O = model.W_O[sender[0], sender[1]]  # (d_head, d_model)
    W_K = model.W_K[receiver[0], receiver[1]]  # (d_model, d_head)

    # Virtual weight shows composition
    W_virtual = W_O @ W_K  # (d_head_sender, d_head_receiver)

    return W_virtual

Positional Path Patching

Information flows between positions too:

def positional_path_patching(model, clean, corrupted, source_pos, dest_pos):
    """
    Patch information from source position to destination position.
    """
    _, clean_cache = model.run_with_cache(clean)

    def patch_position(act, hook):
        act[:, dest_pos] = clean_cache[hook.name][:, source_pos]
        return act

    # This tells us: does position X's info flow to position Y?

The Full IOI Path Picture

S1 ("John") position:
  └── Duplicate Token Heads (0.1, 0.10) detect "John" appears again
       └── Write to residual stream at S2 position

S2 ("John") position:
  └── Receives duplicate signal
       └── S-Inhibition Heads (7.3, 7.9) read this
            └── Write "suppress John" to END position

IO ("Mary") position:
  └── Name Mover Heads (9.9, 10.0) read from here
       └── Copy "Mary" info to END position

END position:
  └── Receives "suppress John" + "predict Mary"
       └── Output: "Mary" wins!

Capstone Connection

Path patching for sycophancy:

def trace_sycophancy_paths(model, honest_prompt, sycophantic_prompt):
    """
    1. Find positions where user opinion is encoded
    2. Path patch to find how this info reaches output
    3. Identify heads that transmit "agree with user" signal
    """
    # Where does user sentiment enter?
    sentiment_positions = find_sentiment_positions(sycophantic_prompt)

    # What heads read from these positions?
    for layer in range(model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            effect = path_patch(
                source_pos=sentiment_positions,
                receiver_head=(layer, head),
                destination_pos=-1  # END position
            )
            if effect > threshold:
                print(f"Head {layer}.{head} transmits user sentiment!")

🎓 Tyla's Exercise

  1. Prove that if Head A and Head B have no K-composition (orthogonal output/key spaces), Head A cannot affect what Head B attends to.

  2. Virtual weights combine O and K (or V) matrices. What's the rank of this product? How does this limit composition?

  3. Path patching requires 3 forward passes (clean, corrupted, patched). Why can't we do it in 2?


💻 Aaliyah's Exercise

Implement path patching:

def full_path_patching(model, clean, corrupted, metric_fn):
    """
    1. For each pair of heads (sender, receiver):
       - Compute path patching effect
       - Classify as Q, K, or V composition
    2. Build graph of significant paths
    3. Visualize the circuit
    """
    pass

def verify_circuit_path(model, path, prompts):
    """
    Given a hypothesized path [head1, head2, head3, ...]:
    1. Ablate all heads NOT in path
    2. Measure if behavior survives
    3. Ablate each head IN path individually
    4. Verify each is necessary
    """
    pass

📚 Maneesha's Reflection

  1. Path patching reveals causal structure, but the space of paths is exponential. How do we search efficiently?

  2. When two paths give similar effects, they might be redundant or synergistic. How would you distinguish these cases?

  3. The IOI paper found 26 heads in the circuit. Is this the "true" circuit, or just one sufficient subset?