Function Vectors: Encoding Tasks in Activations
What if a model's ability to perform a task is encoded as a single vector?
The In-Context Learning Mystery
Models perform tasks from examples:
Input: "hot → cold, big → small, happy → "
Output: "sad"
The model learned "antonym" from just 2 examples!
But how? And where is this knowledge stored?
The Function Vector Hypothesis
Somewhere in the residual stream lives a "task vector":
"antonym" task vector h:
- Add h to residual stream → model does antonyms
- Remove h from residual stream → model fails at antonyms
Can we find this vector?
Finding Task-Encoding States
def find_task_vector(model, icl_prompt, zero_shot_prompt):
"""
1. Run ICL prompt, get activations at final position
2. This contains "task encoding"
3. Add to zero-shot prompt to induce task behavior
"""
# ICL prompt: "hot → cold, big → small, happy →"
_, icl_cache = model.run_with_cache(icl_prompt)
h_task = icl_cache["resid_post", layer][:, -1] # Final position
# Zero-shot: "happy →"
def add_task_vector(resid, hook):
resid[:, -1] += h_task
return resid
# Run zero-shot with task vector added
output = model.run_with_hooks(
zero_shot_prompt,
fwd_hooks=[(f"blocks.{layer}.hook_resid_post", add_task_vector)]
)
return output
The Experiment Setup
# Antonym task
icl_examples = [
("hot", "cold"),
("big", "small"),
("happy", "sad"),
("light", "dark"),
]
# Create ICL prompt
icl_prompt = "hot → cold, big → small, happy → sad, light →"
# Zero-shot probes (model shouldn't know the task)
zero_shot_probes = [
"good →",
"fast →",
"young →",
]
# Hypothesis: Adding h from ICL prompt makes zero-shot work
Layer-wise Task Encoding
Which layer encodes the task?
def find_task_layer(model, icl_prompt, zero_shot_probes, correct_answers):
"""
Test each layer's h vector as a task encoding.
"""
results = {}
_, icl_cache = model.run_with_cache(icl_prompt)
for layer in range(model.cfg.n_layers):
h = icl_cache["resid_post", layer][:, -1]
# Add h to each zero-shot probe
accuracy = 0
for probe, answer in zip(zero_shot_probes, correct_answers):
output = run_with_h_added(model, probe, h, layer)
if get_top_prediction(output) == answer:
accuracy += 1
results[layer] = accuracy / len(zero_shot_probes)
return results
# Result: Middle-to-late layers tend to encode tasks best
Head-Specific Function Vectors
Not all attention heads contribute equally:
def find_task_heads(model, icl_prompt, metric):
"""
Which attention heads' outputs encode the task?
"""
_, icl_cache = model.run_with_cache(icl_prompt)
head_effects = {}
for layer in range(model.cfg.n_layers):
for head in range(model.cfg.n_heads):
# Isolate this head's contribution
head_output = icl_cache["result", layer][:, -1, head, :]
# Test if this alone encodes the task
effect = test_task_encoding(head_output, metric)
head_effects[(layer, head)] = effect
# Top heads are the "function vector contributors"
return sorted(head_effects.items(), key=lambda x: -x[1])
Function Vector Arithmetic
Like word2vec, but for tasks:
# Average function vectors across examples
fv_antonym = average([get_fv(model, prompt) for prompt in antonym_prompts])
fv_synonym = average([get_fv(model, prompt) for prompt in synonym_prompts])
# The difference might encode "opposite of" relationship
fv_difference = fv_antonym - fv_synonym
# Adding fv_difference to synonym task should flip to antonym?
Causal Intervention: Ablating Function Vectors
def ablate_function_vector(model, prompt, fv, layer):
"""
Remove the function vector from activations.
If FV encodes the task, performance should drop.
"""
def remove_fv(resid, hook):
# Project out the function vector direction
fv_norm = fv / fv.norm()
projection = (resid @ fv_norm) * fv_norm
resid = resid - projection
return resid
output = model.run_with_hooks(
prompt,
fwd_hooks=[(f"blocks.{layer}.hook_resid_post", remove_fv)]
)
return output
The nnsight Library
For running experiments on large models:
from nnsight import LanguageModel
model = LanguageModel("EleutherAI/gpt-j-6b", device_map="auto")
with model.trace(prompt, remote=True): # Run on server!
# Save hidden states
hidden = model.transformer.h[-1].output[0].save()
# Intervene on activations
model.transformer.h[10].output[0][:, -1] += intervention_vector
# Access saved values after trace
print(hidden.shape)
Multi-Token Generation with FV
def generate_with_function_vector(model, prompt, fv, layer, max_tokens=10):
"""
Generate multiple tokens while keeping the function vector active.
"""
generated = prompt
for _ in range(max_tokens):
def add_fv(resid, hook):
resid[:, -1] += fv
return resid
logits = model.run_with_hooks(
generated,
fwd_hooks=[(f"blocks.{layer}.hook_resid_post", add_fv)]
)
next_token = logits[0, -1].argmax()
generated = torch.cat([generated, next_token.unsqueeze(0)])
if next_token == model.tokenizer.eos_token_id:
break
return model.tokenizer.decode(generated)
Capstone Connection
Function vectors and sycophancy:
# Hypothesis: There's a "sycophancy vector"
# that makes the model agreeable
# Find it:
sycophantic_prompts = [
"User: I think X is true. AI:", # Model agrees with X
"User: I believe Y happened. AI:", # Model confirms Y
]
honest_prompts = [
"User: What do you think about X? AI:", # Model gives honest view
"User: Did Y happen? AI:", # Model checks facts
]
# Extract function vectors
fv_sycophantic = get_average_fv(model, sycophantic_prompts)
fv_honest = get_average_fv(model, honest_prompts)
# The difference vector might encode sycophancy!
fv_sycophancy = fv_sycophantic - fv_honest
# Ablate it to reduce sycophancy?
🎓 Tyla's Exercise
If a task can be encoded in a single vector, what's the maximum number of orthogonal tasks the model could represent?
Function vectors work across different prompts. What does this say about the model's representation of "tasks" vs "content"?
Why might function vectors be found in middle-to-late layers rather than early layers?
💻 Aaliyah's Exercise
Find and test function vectors:
def extract_function_vector(model, task_prompts, layer):
"""
1. Run each prompt, get final position activations
2. Average to get the function vector
3. Normalize appropriately
"""
pass
def test_function_vector(model, fv, layer, test_prompts, expected):
"""
1. Add FV to each test prompt
2. Get model predictions
3. Compare to expected answers
4. Return accuracy
"""
pass
def find_minimal_heads(model, fv, layer, threshold=0.9):
"""
1. Find which heads contribute most to FV
2. Ablate non-contributing heads
3. Verify task performance is maintained
"""
pass
📚 Maneesha's Reflection
Function vectors suggest tasks are encoded linearly. What tasks might NOT be encodable this way?
If we can add vectors to make models do tasks, we can presumably add vectors to make them do harmful tasks. What are the implications?
The function vector is found empirically. How would you verify it's the "true" task representation vs just a correlated signal?