Path Patching: Tracing Information Flow
Activation patching tells us WHERE information matters. Path patching tells us HOW it flows.
The Limitation of Activation Patching
Activation patching shows importance but not causation:
When we patch Layer 5's residual stream:
- Is Layer 5 computing something important?
- Or just passing through important info from earlier?
We can't tell!
Path patching solves this by examining specific paths through the model.
What is a Path?
A path is a specific route information takes:
Attention Head 0.1 → Residual Stream → Attention Head 7.3 → Output
This is different from:
Attention Head 0.1 → Residual Stream → MLP 3 → Attention Head 7.3 → Output
Each path can carry different information.
The Path Patching Algorithm
def path_patching(model, clean, corrupted, sender, receiver):
"""
1. Run clean forward pass, cache everything
2. Run corrupted forward pass
3. At sender, use corrupted values
4. But freeze sender's OUTPUT at receiver's INPUT
5. Measure effect on output
"""
_, clean_cache = model.run_with_cache(clean)
# Step 1: Run corrupted, save sender's output
corrupted_sender_output = None
def save_sender(act, hook):
nonlocal corrupted_sender_output
corrupted_sender_output = act.clone()
return act
model.run_with_hooks(corrupted, fwd_hooks=[(sender, save_sender)])
# Step 2: Run clean, but patch sender→receiver path
def patch_receiver_input(act, hook):
# Only patch the part coming from sender
# This requires understanding the computational graph!
return patch_from_sender(act, corrupted_sender_output)
patched_logits = model.run_with_hooks(
clean,
fwd_hooks=[(receiver, patch_receiver_input)]
)
return measure_effect(patched_logits)
Path Patching in Practice
For attention heads, the path goes through Q, K, V:
def path_patch_attention(model, clean, corrupted, sender_head, receiver_head):
"""
Patch from sender → receiver's {Q, K, V}
"""
results = {}
for component in ['q', 'k', 'v']:
_, clean_cache = model.run_with_cache(clean)
# Get sender's contribution to receiver's component
sender_output = clean_cache[f"blocks.{sender_layer}.attn.hook_result"]
receiver_input = clean_cache[f"blocks.{receiver_layer}.attn.hook_{component}"]
# Patch and measure
effect = measure_path_effect(sender_output, receiver_input)
results[component] = effect
return results
The IOI Circuit Paths
Path patching reveals the circuit structure:
Duplicate Token (0.1) ──────────────────┐
│ via K composition
Previous Token (0.0) ──────────────────┐│
││
▼▼
S-Inhibition (7.3) ────────────────────┐
│ via V composition
▼
Name Mover (9.9) ─────────────────────┐
│
▼
OUTPUT
K-Composition vs V-Composition
Two ways heads can communicate:
K-Composition: "Where to look"
# Head A's output affects Head B's keys
# This changes what B attends to
head_b_keys = head_a_output @ W_K_b
# Head A says "look at position X"
V-Composition: "What to copy"
# Head A's output affects Head B's values
# This changes what B writes
head_b_values = head_a_output @ W_V_b
# Head A says "copy this information"
Detecting Composition Type
def classify_composition(model, sender, receiver):
"""
Determine if sender→receiver is Q, K, or V composition
"""
sender_output = model.W_O[sender[0], sender[1]] # (d_head, d_model)
# Check each composition type
q_overlap = sender_output @ model.W_Q[receiver[0], receiver[1]].T
k_overlap = sender_output @ model.W_K[receiver[0], receiver[1]].T
v_overlap = sender_output @ model.W_V[receiver[0], receiver[1]].T
q_score = q_overlap.norm()
k_score = k_overlap.norm()
v_score = v_overlap.norm()
return {"Q": q_score, "K": k_score, "V": v_score}
Virtual Weights
We can compute the "effective" weight matrix for a path:
def virtual_weight(model, sender, receiver):
"""
W_virtual = W_O_sender @ W_K_receiver
This shows what sender writes that receiver reads
"""
W_O = model.W_O[sender[0], sender[1]] # (d_head, d_model)
W_K = model.W_K[receiver[0], receiver[1]] # (d_model, d_head)
# Virtual weight shows composition
W_virtual = W_O @ W_K # (d_head_sender, d_head_receiver)
return W_virtual
Positional Path Patching
Information flows between positions too:
def positional_path_patching(model, clean, corrupted, source_pos, dest_pos):
"""
Patch information from source position to destination position.
"""
_, clean_cache = model.run_with_cache(clean)
def patch_position(act, hook):
act[:, dest_pos] = clean_cache[hook.name][:, source_pos]
return act
# This tells us: does position X's info flow to position Y?
The Full IOI Path Picture
S1 ("John") position:
└── Duplicate Token Heads (0.1, 0.10) detect "John" appears again
└── Write to residual stream at S2 position
S2 ("John") position:
└── Receives duplicate signal
└── S-Inhibition Heads (7.3, 7.9) read this
└── Write "suppress John" to END position
IO ("Mary") position:
└── Name Mover Heads (9.9, 10.0) read from here
└── Copy "Mary" info to END position
END position:
└── Receives "suppress John" + "predict Mary"
└── Output: "Mary" wins!
Capstone Connection
Path patching for sycophancy:
def trace_sycophancy_paths(model, honest_prompt, sycophantic_prompt):
"""
1. Find positions where user opinion is encoded
2. Path patch to find how this info reaches output
3. Identify heads that transmit "agree with user" signal
"""
# Where does user sentiment enter?
sentiment_positions = find_sentiment_positions(sycophantic_prompt)
# What heads read from these positions?
for layer in range(model.cfg.n_layers):
for head in range(model.cfg.n_heads):
effect = path_patch(
source_pos=sentiment_positions,
receiver_head=(layer, head),
destination_pos=-1 # END position
)
if effect > threshold:
print(f"Head {layer}.{head} transmits user sentiment!")
🎓 Tyla's Exercise
Prove that if Head A and Head B have no K-composition (orthogonal output/key spaces), Head A cannot affect what Head B attends to.
Virtual weights combine O and K (or V) matrices. What's the rank of this product? How does this limit composition?
Path patching requires 3 forward passes (clean, corrupted, patched). Why can't we do it in 2?
💻 Aaliyah's Exercise
Implement path patching:
def full_path_patching(model, clean, corrupted, metric_fn):
"""
1. For each pair of heads (sender, receiver):
- Compute path patching effect
- Classify as Q, K, or V composition
2. Build graph of significant paths
3. Visualize the circuit
"""
pass
def verify_circuit_path(model, path, prompts):
"""
Given a hypothesized path [head1, head2, head3, ...]:
1. Ablate all heads NOT in path
2. Measure if behavior survives
3. Ablate each head IN path individually
4. Verify each is necessary
"""
pass
📚 Maneesha's Reflection
Path patching reveals causal structure, but the space of paths is exponential. How do we search efficiently?
When two paths give similar effects, they might be redundant or synergistic. How would you distinguish these cases?
The IOI paper found 26 heads in the circuit. Is this the "true" circuit, or just one sufficient subset?