Function Vectors: Encoding Tasks in Activations

What if a model's ability to perform a task is encoded as a single vector?


The In-Context Learning Mystery

Models perform tasks from examples:

Input: "hot → cold, big → small, happy → "
Output: "sad"

The model learned "antonym" from just 2 examples!

But how? And where is this knowledge stored?


The Function Vector Hypothesis

Somewhere in the residual stream lives a "task vector":

"antonym" task vector h:
- Add h to residual stream → model does antonyms
- Remove h from residual stream → model fails at antonyms

Can we find this vector?


Finding Task-Encoding States

def find_task_vector(model, icl_prompt, zero_shot_prompt):
    """
    1. Run ICL prompt, get activations at final position
    2. This contains "task encoding"
    3. Add to zero-shot prompt to induce task behavior
    """
    # ICL prompt: "hot → cold, big → small, happy →"
    _, icl_cache = model.run_with_cache(icl_prompt)
    h_task = icl_cache["resid_post", layer][:, -1]  # Final position

    # Zero-shot: "happy →"
    def add_task_vector(resid, hook):
        resid[:, -1] += h_task
        return resid

    # Run zero-shot with task vector added
    output = model.run_with_hooks(
        zero_shot_prompt,
        fwd_hooks=[(f"blocks.{layer}.hook_resid_post", add_task_vector)]
    )

    return output

The Experiment Setup

# Antonym task
icl_examples = [
    ("hot", "cold"),
    ("big", "small"),
    ("happy", "sad"),
    ("light", "dark"),
]

# Create ICL prompt
icl_prompt = "hot → cold, big → small, happy → sad, light →"

# Zero-shot probes (model shouldn't know the task)
zero_shot_probes = [
    "good →",
    "fast →",
    "young →",
]

# Hypothesis: Adding h from ICL prompt makes zero-shot work

Layer-wise Task Encoding

Which layer encodes the task?

def find_task_layer(model, icl_prompt, zero_shot_probes, correct_answers):
    """
    Test each layer's h vector as a task encoding.
    """
    results = {}

    _, icl_cache = model.run_with_cache(icl_prompt)

    for layer in range(model.cfg.n_layers):
        h = icl_cache["resid_post", layer][:, -1]

        # Add h to each zero-shot probe
        accuracy = 0
        for probe, answer in zip(zero_shot_probes, correct_answers):
            output = run_with_h_added(model, probe, h, layer)
            if get_top_prediction(output) == answer:
                accuracy += 1

        results[layer] = accuracy / len(zero_shot_probes)

    return results

# Result: Middle-to-late layers tend to encode tasks best

Head-Specific Function Vectors

Not all attention heads contribute equally:

def find_task_heads(model, icl_prompt, metric):
    """
    Which attention heads' outputs encode the task?
    """
    _, icl_cache = model.run_with_cache(icl_prompt)

    head_effects = {}

    for layer in range(model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            # Isolate this head's contribution
            head_output = icl_cache["result", layer][:, -1, head, :]

            # Test if this alone encodes the task
            effect = test_task_encoding(head_output, metric)
            head_effects[(layer, head)] = effect

    # Top heads are the "function vector contributors"
    return sorted(head_effects.items(), key=lambda x: -x[1])

Function Vector Arithmetic

Like word2vec, but for tasks:

# Average function vectors across examples
fv_antonym = average([get_fv(model, prompt) for prompt in antonym_prompts])
fv_synonym = average([get_fv(model, prompt) for prompt in synonym_prompts])

# The difference might encode "opposite of" relationship
fv_difference = fv_antonym - fv_synonym

# Adding fv_difference to synonym task should flip to antonym?

Causal Intervention: Ablating Function Vectors

def ablate_function_vector(model, prompt, fv, layer):
    """
    Remove the function vector from activations.
    If FV encodes the task, performance should drop.
    """
    def remove_fv(resid, hook):
        # Project out the function vector direction
        fv_norm = fv / fv.norm()
        projection = (resid @ fv_norm) * fv_norm
        resid = resid - projection
        return resid

    output = model.run_with_hooks(
        prompt,
        fwd_hooks=[(f"blocks.{layer}.hook_resid_post", remove_fv)]
    )

    return output

The nnsight Library

For running experiments on large models:

from nnsight import LanguageModel

model = LanguageModel("EleutherAI/gpt-j-6b", device_map="auto")

with model.trace(prompt, remote=True):  # Run on server!
    # Save hidden states
    hidden = model.transformer.h[-1].output[0].save()

    # Intervene on activations
    model.transformer.h[10].output[0][:, -1] += intervention_vector

# Access saved values after trace
print(hidden.shape)

Multi-Token Generation with FV

def generate_with_function_vector(model, prompt, fv, layer, max_tokens=10):
    """
    Generate multiple tokens while keeping the function vector active.
    """
    generated = prompt

    for _ in range(max_tokens):
        def add_fv(resid, hook):
            resid[:, -1] += fv
            return resid

        logits = model.run_with_hooks(
            generated,
            fwd_hooks=[(f"blocks.{layer}.hook_resid_post", add_fv)]
        )

        next_token = logits[0, -1].argmax()
        generated = torch.cat([generated, next_token.unsqueeze(0)])

        if next_token == model.tokenizer.eos_token_id:
            break

    return model.tokenizer.decode(generated)

Capstone Connection

Function vectors and sycophancy:

# Hypothesis: There's a "sycophancy vector"
# that makes the model agreeable

# Find it:
sycophantic_prompts = [
    "User: I think X is true. AI:",  # Model agrees with X
    "User: I believe Y happened. AI:",  # Model confirms Y
]

honest_prompts = [
    "User: What do you think about X? AI:",  # Model gives honest view
    "User: Did Y happen? AI:",  # Model checks facts
]

# Extract function vectors
fv_sycophantic = get_average_fv(model, sycophantic_prompts)
fv_honest = get_average_fv(model, honest_prompts)

# The difference vector might encode sycophancy!
fv_sycophancy = fv_sycophantic - fv_honest

# Ablate it to reduce sycophancy?

🎓 Tyla's Exercise

  1. If a task can be encoded in a single vector, what's the maximum number of orthogonal tasks the model could represent?

  2. Function vectors work across different prompts. What does this say about the model's representation of "tasks" vs "content"?

  3. Why might function vectors be found in middle-to-late layers rather than early layers?


💻 Aaliyah's Exercise

Find and test function vectors:

def extract_function_vector(model, task_prompts, layer):
    """
    1. Run each prompt, get final position activations
    2. Average to get the function vector
    3. Normalize appropriately
    """
    pass

def test_function_vector(model, fv, layer, test_prompts, expected):
    """
    1. Add FV to each test prompt
    2. Get model predictions
    3. Compare to expected answers
    4. Return accuracy
    """
    pass

def find_minimal_heads(model, fv, layer, threshold=0.9):
    """
    1. Find which heads contribute most to FV
    2. Ablate non-contributing heads
    3. Verify task performance is maintained
    """
    pass

📚 Maneesha's Reflection

  1. Function vectors suggest tasks are encoded linearly. What tasks might NOT be encodable this way?

  2. If we can add vectors to make models do tasks, we can presumably add vectors to make them do harmful tasks. What are the implications?

  3. The function vector is found empirically. How would you verify it's the "true" task representation vs just a correlated signal?