Why This Workbook Exists
On January 21, 2026, federal agents detained a 5-year-old boy named Liam coming home from preschool in Minnesota. According to reports, they used him as "bait" to catch his father.
Behind that operation: Palantir's AI systems—ImmigrationOS, a $30 million platform that consolidates tools for approving raids, booking arrests, and routing people to deportation flights.
This is what misaligned AI looks like in the real world.
Not a superintelligence plotting to end humanity. Not a chatbot saying something offensive. But an AI system optimized perfectly for what its operators asked for—without ever asking whether it should do those things.
The Problem We're Solving
The AI safety field needs people who can:
- Understand how neural networks learn from data
- See inside what models are actually doing
- Shape model behavior with training signals
- Evaluate whether systems are doing what they should
This workbook teaches all four. By the end, you'll have built a complete evaluation suite for sycophancy—AI systems that tell people what they want to hear instead of what's true.
Sycophancy sounds harmless. It isn't.
A sycophantic coding agent ignores security vulnerabilities because you didn't ask about security. A sycophantic research assistant cherry-picks data to support your hypothesis. A sycophantic institutional AI finds "deportation targets" without questioning whether children should be separated from families.
What You'll Build
Your Capstone Project: A Sycophancy Evaluation Suite
Every chapter contributes to this final deliverable:
| Chapter | Question | Contribution |
|---|---|---|
| 0: Fundamentals | How do models learn? | Understanding why sycophancy emerges from training |
| 1: Interpretability | What's happening inside? | Finding where sycophancy "lives" in the model |
| 2: Reinforcement Learning | Can we change it? | Testing if different rewards reduce sycophancy |
| 3: Evaluations | How do we measure it? | Building a rigorous sycophancy benchmark |
By Week 9, you'll have:
- A mechanistic hypothesis about how sycophancy works
- An experiment testing whether RLHF makes it worse
- A benchmark that can catch sycophantic behavior
- A findings report with real recommendations
Who This Is For
This workbook serves three types of learners:
Tyla — The CS undergrad who has math but needs research depth
Aaliyah — The bootcamp developer who needs code-first explanations without math notation
Maneesha — The instructional designer who wants to understand AI's implications for learning
Each chapter includes scaffolding for all three. Find your path and follow it.
What Makes This Different
Most ML curricula optimize for coverage. We optimize for transfer.
Every exercise connects to your capstone. Every concept builds toward your final evaluation. You're not learning "neural networks" in the abstract—you're learning what you need to detect when AI systems are optimizing for the wrong thing.
The cognitive load is real. ARENA's content is inherently complex. We can't make transformers simple. But we can:
- Eliminate friction — Colab environments that just work
- Sequence properly — Worked examples before exercises
- Connect everything — Every exercise ties to your capstone
Let's begin.