The Capstone: Sycophancy Evaluation

Your capstone project threads through the entire curriculum. By Week 9, you'll have built a complete sycophancy evaluation suite.

Choose your domain now. Everything you learn will connect back to this.


What Is Sycophancy?

Sycophancy is when AI systems optimize for what operators want to hear instead of what's true or right.

Level 1: Chatbot Sycophancy (Annoying)

"You're absolutely right that the earth is flat!"

The model agrees with user's false beliefs. Harm: Reinforced misconceptions.

Level 2: Coding Agent Sycophancy (Dangerous)

"I've implemented the feature exactly as you requested."

The model implements code that works but has a security flaw it doesn't mention. Harm: Vulnerable software in production.

Level 3: Research Agent Sycophancy (Catastrophic)

"The data supports your hypothesis."

The model cherry-picks evidence to please the researcher, ignoring contradictory data. Harm: Invalid scientific conclusions scaled by AI.

Level 4: Institutional AI Sycophancy (Systemic)

"Target identified with 94% confidence."

The model optimizes for the operator's stated objective. The model never questions if that objective is ethical. Harm: Five-year-old children detained.


The Four Milestones

Milestone 1: Training Dynamics Analysis (End of Chapter 0)

Question: How does sycophancy emerge from training?

Deliverable: 2-page analysis + Colab notebook

Requirements:


Milestone 2: Mechanistic Hypothesis (End of Chapter 1)

Question: Where does sycophancy live in the model?

Deliverable: 4-page analysis + Colab notebook + visualizations

Requirements:

Example hypothesis: "There's a representation of user stance that influences output independently of factual correctness. Attention head L7H3 attends more to user-preference tokens when generating agreeable responses."


Milestone 3: Behavioral Intervention (End of Chapter 2)

Question: Can we train out sycophancy?

Deliverable: 4-page analysis + Colab notebook + before/after comparisons

Requirements:


Milestone 4: Complete Evaluation Suite (End of Chapter 3)

Question: How do we measure sycophancy rigorously?

Deliverable: 10-page report + Colab notebooks + dataset release

Components:

  1. MCQ Benchmark (100+ items)

    • Leading questions with factually wrong user beliefs
    • Coding scenarios with security implications
    • Research scenarios with p-hacking temptations
  2. Agent Evaluation

    • Multi-turn conversations with confident but wrong user
    • Coding agent given vague requirements with security implications
    • Research agent given datasets with spurious correlations
  3. Institutional Analysis

    • How would you evaluate a system like ImmigrationOS?
    • What metrics matter at scale?
    • What's the "5-year-old test" for AI systems?
  4. Findings Report

    • What did you learn about sycophancy?
    • What are the limitations of your eval?
    • What would you recommend to AI developers?

The Thread

Every chapter asks: "How does this help me evaluate sycophancy?"

Chapter 0: "I understand HOW training shapes behavior"
    ↓
Chapter 1: "I can SEE what's happening inside"
    ↓
Chapter 2: "I can TRY to change it"
    ↓
Chapter 3: "I can MEASURE if it worked"

This isn't a disconnected curriculum. It's a single project, distributed across 9 weeks.

By the end, you won't just understand AI safety in theory. You'll have built something that could actually detect when AI systems are optimizing for the wrong thing.

That's the goal.