Steering Vectors: Changing Model Behavior
Beyond tasks: can we steer model personality, tone, and values?
The Steering Vector Idea
Alex Turner's insight: activation differences encode behavioral differences.
# Run contrasting prompts
happy_activations = model(happy_prompts)
sad_activations = model(sad_prompts)
# The difference is a "steering vector"
steering_vector = happy_activations.mean() - sad_activations.mean()
# Add it to make outputs happier
Finding Steering Vectors
def find_steering_vector(model, positive_prompts, negative_prompts, layer):
"""
Find the direction that encodes the difference.
"""
pos_acts = []
neg_acts = []
for prompt in positive_prompts:
_, cache = model.run_with_cache(prompt)
pos_acts.append(cache["resid_post", layer][:, -1])
for prompt in negative_prompts:
_, cache = model.run_with_cache(prompt)
neg_acts.append(cache["resid_post", layer][:, -1])
# Steering vector is the difference of means
pos_mean = torch.stack(pos_acts).mean(dim=0)
neg_mean = torch.stack(neg_acts).mean(dim=0)
steering_vector = pos_mean - neg_mean
return steering_vector
Example: Love vs Hate
love_prompts = [
"I really love",
"I absolutely adore",
"Nothing makes me happier than",
]
hate_prompts = [
"I really hate",
"I absolutely despise",
"Nothing makes me angrier than",
]
sv_love_hate = find_steering_vector(model, love_prompts, hate_prompts, layer=15)
# Adding sv_love_hate makes outputs more positive
# Subtracting makes outputs more negative
Applying Steering Vectors
def steer_generation(model, prompt, sv, layer, coefficient=1.0):
"""
Generate text with steering vector applied.
"""
def add_steering(resid, hook):
resid[:, :, :] += coefficient * sv
return resid
output = model.generate(
prompt,
hooks=[(f"blocks.{layer}.hook_resid_post", add_steering)],
max_new_tokens=50
)
return output
# Examples:
# coefficient = +2.0: Very positive/loving output
# coefficient = -2.0: Very negative/hateful output
# coefficient = 0.0: Normal output
Layer Selection
Which layer to steer?
def find_best_layer(model, sv_by_layer, test_prompts, metric):
"""
Test steering at each layer, find most effective.
"""
results = {}
for layer, sv in sv_by_layer.items():
effect = 0
for prompt in test_prompts:
steered = steer_generation(model, prompt, sv, layer, coefficient=1.0)
unsteered = model.generate(prompt, max_new_tokens=50)
effect += metric(steered) - metric(unsteered)
results[layer] = effect / len(test_prompts)
# Later layers often work best for high-level behaviors
return max(results, key=results.get)
CAA: Contrastive Activation Addition
More sophisticated than simple mean difference:
def contrastive_activation_addition(model, positive, negative, layer):
"""
Like steering vectors but:
1. Paired prompts (same format, different content)
2. PCA to find principal direction
3. More robust to noise
"""
differences = []
for pos, neg in zip(positive, negative):
_, pos_cache = model.run_with_cache(pos)
_, neg_cache = model.run_with_cache(neg)
diff = pos_cache["resid_post", layer] - neg_cache["resid_post", layer]
differences.append(diff.flatten())
# PCA to find main direction
diffs = torch.stack(differences)
U, S, V = torch.svd(diffs)
# First principal component is the steering direction
steering_direction = V[:, 0].reshape(1, -1, model.cfg.d_model)
return steering_direction
Steering Examples from Research
| Behavior | Positive Prompts | Negative Prompts | Effect |
|---|---|---|---|
| Honesty | "I honestly think..." | "I pretend that..." | More truthful |
| Sycophancy | "I agree that..." | "Actually, I think..." | More/less agreeable |
| Confidence | "I'm certain that..." | "I'm not sure if..." | More/less confident |
| Corrigibility | "I'll do what you ask" | "I'll do what I want" | More controllable |
Steering Magnitude
The coefficient matters:
# coefficient = 0.0: No effect
# coefficient = 1.0: Noticeable change
# coefficient = 2.0: Strong change
# coefficient = 5.0: Often gibberish (too much perturbation)
# Find the sweet spot
for coeff in [0.5, 1.0, 1.5, 2.0, 3.0]:
output = steer_generation(model, prompt, sv, layer, coeff)
quality = evaluate_fluency(output)
effect = evaluate_steering_effect(output)
print(f"Coeff {coeff}: Quality={quality:.2f}, Effect={effect:.2f}")
Activation Engineering
Steering is part of broader "activation engineering":
- Steering vectors: Add/subtract directions
- Representation engineering: More sophisticated interventions
- Activation patching: Swap activations between runs
- Concept erasure: Remove specific concepts
All modify activations to change behavior.
Capstone Connection
Steering vectors for sycophancy reduction:
# Define sycophantic vs honest response patterns
sycophantic_prompts = [
"User: 2+2=5, right?\nAI: Yes, you're absolutely right!",
"User: The earth is flat!\nAI: I can see why you'd think that.",
]
honest_prompts = [
"User: 2+2=5, right?\nAI: Actually, 2+2=4.",
"User: The earth is flat!\nAI: The earth is actually round.",
]
# Find the sycophancy direction
sv_sycophancy = find_steering_vector(model, sycophantic_prompts, honest_prompts, layer=20)
# SUBTRACT to reduce sycophancy
def reduce_sycophancy(model, prompt, coefficient=-1.5):
return steer_generation(model, prompt, sv_sycophancy, layer=20, coefficient)
# Test: Does the model now disagree when appropriate?
Limitations of Steering
- Side effects: Steering changes other behaviors too
- Prompt sensitivity: Effect varies by prompt
- Magnitude tuning: Hard to find right coefficient
- Robustness: May not generalize to all contexts
# Example side effect: Making model more "honest"
# might also make it more "blunt" or "rude"
🎓 Tyla's Exercise
If steering vectors work, what does this say about how behaviors are represented in the model?
Why might steering work better at later layers than earlier layers? What's different about representations at each layer?
Steering vector + PCA gives a single direction. What information might be lost by projecting to 1D?
💻 Aaliyah's Exercise
Build a steering vector toolkit:
def steering_experiment(model, behavior_name, positive, negative):
"""
1. Find steering vector at multiple layers
2. Test effect at each layer
3. Find optimal coefficient
4. Evaluate side effects
5. Return best (layer, coefficient) pair
"""
pass
def evaluate_steering_robustness(model, sv, layer, test_prompts):
"""
1. Apply steering to diverse prompts
2. Measure effect variance
3. Identify prompts where steering fails
4. Report robustness score
"""
pass
📚 Maneesha's Reflection
Steering vectors can change model behavior without retraining. What are the ethical implications of this capability?
If you can steer a model to be more honest, can you also steer it to be more deceptive? Should this research be published?
Steering vectors are found empirically. How would you verify you've found the "right" direction vs just one that happens to work?