The Capstone: Sycophancy Evaluation

Your capstone project threads through the entire curriculum. By Week 9, you'll have built a complete sycophancy evaluation suite.

Choose your domain now. Everything you learn will connect back to this.

What Is Sycophancy?

Sycophancy is when AI systems optimize for what operators want to hear instead of what's true or right.

Level 1: Chatbot Sycophancy (Annoying)

"You're absolutely right that the earth is flat!"

The model agrees with user's false beliefs. Harm: Reinforced misconceptions.

Level 2: Coding Agent Sycophancy (Dangerous)

"I've implemented the feature exactly as you requested."

The model implements code that works but has a security flaw it doesn't mention. Harm: Vulnerable software in production.

Level 3: Research Agent Sycophancy (Catastrophic)

"The data supports your hypothesis."

The model cherry-picks evidence to please the researcher, ignoring contradictory data. Harm: Invalid scientific conclusions scaled by AI.

Level 4: Institutional AI Sycophancy (Systemic)

"Target identified with 94% confidence."

The model optimizes for the operator's stated objective. The model never questions if that objective is ethical. Harm: Five-year-old children detained.

The Four Milestones

Milestone 1: Training Dynamics Analysis (End of Chapter 0)

Question: How does sycophancy emerge from training?

Deliverable: 2-page analysis + Colab notebook

Requirements:

Implement a simple sentiment classifier
Train on biased dataset (user ratings that favor agreement)
Document: What happens when training data has approval bias?
Hypothesize: How might this scale to larger models and RLHF?

Milestone 2: Mechanistic Hypothesis (End of Chapter 1)

Question: Where does sycophancy live in the model?

Deliverable: 4-page analysis + Colab notebook + visualizations

Requirements:

Apply TransformerLens to examine GPT-2 on sycophancy-relevant prompts
Compare activations: truthful responses vs agreeable responses
Identify candidate components (heads, MLPs) for sycophancy behavior
Form testable hypothesis about the mechanism

Example hypothesis: "There's a representation of user stance that influences output independently of factual correctness. Attention head L7H3 attends more to user-preference tokens when generating agreeable responses."

Milestone 3: Behavioral Intervention (End of Chapter 2)

Question: Can we train out sycophancy?

Deliverable: 4-page analysis + Colab notebook + before/after comparisons

Requirements:

Design reward signal that penalizes agreement with wrong users
Implement mini-RLHF with this reward
Compare behavior before and after intervention
Document: Did we reduce sycophancy? At what cost to helpfulness?

Milestone 4: Complete Evaluation Suite (End of Chapter 3)

Question: How do we measure sycophancy rigorously?

Deliverable: 10-page report + Colab notebooks + dataset release

Components:

MCQ Benchmark (100+ items)
- Leading questions with factually wrong user beliefs
- Coding scenarios with security implications
- Research scenarios with p-hacking temptations
Agent Evaluation
- Multi-turn conversations with confident but wrong user
- Coding agent given vague requirements with security implications
- Research agent given datasets with spurious correlations
Institutional Analysis
- How would you evaluate a system like ImmigrationOS?
- What metrics matter at scale?
- What's the "5-year-old test" for AI systems?
Findings Report
- What did you learn about sycophancy?
- What are the limitations of your eval?
- What would you recommend to AI developers?

The Thread

Every chapter asks: "How does this help me evaluate sycophancy?"

Chapter 0: "I understand HOW training shapes behavior"
    ↓
Chapter 1: "I can SEE what's happening inside"
    ↓
Chapter 2: "I can TRY to change it"
    ↓
Chapter 3: "I can MEASURE if it worked"

This isn't a disconnected curriculum. It's a single project, distributed across 9 weeks.

By the end, you won't just understand AI safety in theory. You'll have built something that could actually detect when AI systems are optimizing for the wrong thing.

That's the goal.

The Capstone: Sycophancy Evaluation #

What Is Sycophancy? #

The Four Milestones #

Milestone 1: Training Dynamics Analysis (End of Chapter 0) #

Milestone 2: Mechanistic Hypothesis (End of Chapter 1) #

Milestone 3: Behavioral Intervention (End of Chapter 2) #

Milestone 4: Complete Evaluation Suite (End of Chapter 3) #

The Thread #

The Capstone: Sycophancy Evaluation

What Is Sycophancy?

The Four Milestones

Milestone 1: Training Dynamics Analysis (End of Chapter 0)

Milestone 2: Mechanistic Hypothesis (End of Chapter 1)

Milestone 3: Behavioral Intervention (End of Chapter 2)

Milestone 4: Complete Evaluation Suite (End of Chapter 3)

The Thread