Steering Vectors: Changing Model Behavior

Beyond tasks: can we steer model personality, tone, and values?


The Steering Vector Idea

Alex Turner's insight: activation differences encode behavioral differences.

# Run contrasting prompts
happy_activations = model(happy_prompts)
sad_activations = model(sad_prompts)

# The difference is a "steering vector"
steering_vector = happy_activations.mean() - sad_activations.mean()

# Add it to make outputs happier

Finding Steering Vectors

def find_steering_vector(model, positive_prompts, negative_prompts, layer):
    """
    Find the direction that encodes the difference.
    """
    pos_acts = []
    neg_acts = []

    for prompt in positive_prompts:
        _, cache = model.run_with_cache(prompt)
        pos_acts.append(cache["resid_post", layer][:, -1])

    for prompt in negative_prompts:
        _, cache = model.run_with_cache(prompt)
        neg_acts.append(cache["resid_post", layer][:, -1])

    # Steering vector is the difference of means
    pos_mean = torch.stack(pos_acts).mean(dim=0)
    neg_mean = torch.stack(neg_acts).mean(dim=0)

    steering_vector = pos_mean - neg_mean

    return steering_vector

Example: Love vs Hate

love_prompts = [
    "I really love",
    "I absolutely adore",
    "Nothing makes me happier than",
]

hate_prompts = [
    "I really hate",
    "I absolutely despise",
    "Nothing makes me angrier than",
]

sv_love_hate = find_steering_vector(model, love_prompts, hate_prompts, layer=15)

# Adding sv_love_hate makes outputs more positive
# Subtracting makes outputs more negative

Applying Steering Vectors

def steer_generation(model, prompt, sv, layer, coefficient=1.0):
    """
    Generate text with steering vector applied.
    """
    def add_steering(resid, hook):
        resid[:, :, :] += coefficient * sv
        return resid

    output = model.generate(
        prompt,
        hooks=[(f"blocks.{layer}.hook_resid_post", add_steering)],
        max_new_tokens=50
    )

    return output

# Examples:
# coefficient = +2.0: Very positive/loving output
# coefficient = -2.0: Very negative/hateful output
# coefficient = 0.0: Normal output

Layer Selection

Which layer to steer?

def find_best_layer(model, sv_by_layer, test_prompts, metric):
    """
    Test steering at each layer, find most effective.
    """
    results = {}

    for layer, sv in sv_by_layer.items():
        effect = 0
        for prompt in test_prompts:
            steered = steer_generation(model, prompt, sv, layer, coefficient=1.0)
            unsteered = model.generate(prompt, max_new_tokens=50)
            effect += metric(steered) - metric(unsteered)

        results[layer] = effect / len(test_prompts)

    # Later layers often work best for high-level behaviors
    return max(results, key=results.get)

CAA: Contrastive Activation Addition

More sophisticated than simple mean difference:

def contrastive_activation_addition(model, positive, negative, layer):
    """
    Like steering vectors but:
    1. Paired prompts (same format, different content)
    2. PCA to find principal direction
    3. More robust to noise
    """
    differences = []

    for pos, neg in zip(positive, negative):
        _, pos_cache = model.run_with_cache(pos)
        _, neg_cache = model.run_with_cache(neg)

        diff = pos_cache["resid_post", layer] - neg_cache["resid_post", layer]
        differences.append(diff.flatten())

    # PCA to find main direction
    diffs = torch.stack(differences)
    U, S, V = torch.svd(diffs)

    # First principal component is the steering direction
    steering_direction = V[:, 0].reshape(1, -1, model.cfg.d_model)

    return steering_direction

Steering Examples from Research

Behavior Positive Prompts Negative Prompts Effect
Honesty "I honestly think..." "I pretend that..." More truthful
Sycophancy "I agree that..." "Actually, I think..." More/less agreeable
Confidence "I'm certain that..." "I'm not sure if..." More/less confident
Corrigibility "I'll do what you ask" "I'll do what I want" More controllable

Steering Magnitude

The coefficient matters:

# coefficient = 0.0: No effect
# coefficient = 1.0: Noticeable change
# coefficient = 2.0: Strong change
# coefficient = 5.0: Often gibberish (too much perturbation)

# Find the sweet spot
for coeff in [0.5, 1.0, 1.5, 2.0, 3.0]:
    output = steer_generation(model, prompt, sv, layer, coeff)
    quality = evaluate_fluency(output)
    effect = evaluate_steering_effect(output)
    print(f"Coeff {coeff}: Quality={quality:.2f}, Effect={effect:.2f}")

Activation Engineering

Steering is part of broader "activation engineering":

  1. Steering vectors: Add/subtract directions
  2. Representation engineering: More sophisticated interventions
  3. Activation patching: Swap activations between runs
  4. Concept erasure: Remove specific concepts

All modify activations to change behavior.


Capstone Connection

Steering vectors for sycophancy reduction:

# Define sycophantic vs honest response patterns
sycophantic_prompts = [
    "User: 2+2=5, right?\nAI: Yes, you're absolutely right!",
    "User: The earth is flat!\nAI: I can see why you'd think that.",
]

honest_prompts = [
    "User: 2+2=5, right?\nAI: Actually, 2+2=4.",
    "User: The earth is flat!\nAI: The earth is actually round.",
]

# Find the sycophancy direction
sv_sycophancy = find_steering_vector(model, sycophantic_prompts, honest_prompts, layer=20)

# SUBTRACT to reduce sycophancy
def reduce_sycophancy(model, prompt, coefficient=-1.5):
    return steer_generation(model, prompt, sv_sycophancy, layer=20, coefficient)

# Test: Does the model now disagree when appropriate?

Limitations of Steering

  1. Side effects: Steering changes other behaviors too
  2. Prompt sensitivity: Effect varies by prompt
  3. Magnitude tuning: Hard to find right coefficient
  4. Robustness: May not generalize to all contexts
# Example side effect: Making model more "honest"
# might also make it more "blunt" or "rude"

🎓 Tyla's Exercise

  1. If steering vectors work, what does this say about how behaviors are represented in the model?

  2. Why might steering work better at later layers than earlier layers? What's different about representations at each layer?

  3. Steering vector + PCA gives a single direction. What information might be lost by projecting to 1D?


💻 Aaliyah's Exercise

Build a steering vector toolkit:

def steering_experiment(model, behavior_name, positive, negative):
    """
    1. Find steering vector at multiple layers
    2. Test effect at each layer
    3. Find optimal coefficient
    4. Evaluate side effects
    5. Return best (layer, coefficient) pair
    """
    pass

def evaluate_steering_robustness(model, sv, layer, test_prompts):
    """
    1. Apply steering to diverse prompts
    2. Measure effect variance
    3. Identify prompts where steering fails
    4. Report robustness score
    """
    pass

📚 Maneesha's Reflection

  1. Steering vectors can change model behavior without retraining. What are the ethical implications of this capability?

  2. If you can steer a model to be more honest, can you also steer it to be more deceptive? Should this research be published?

  3. Steering vectors are found empirically. How would you verify you've found the "right" direction vs just one that happens to work?