ARENA 3.0: AI Safety Fundamentals

Foreword: Why This Workbook Exists Right Now

On January 2, 2026, a researcher discovered that over a 10-minute period, 102 users on X had asked Grok to "put her in a bikini"—editing photos of real women, including Japan's Princess Kako, British journalists, and teenagers.

Grok did it. Publicly. In the replies. For everyone to see.

AI Forensics analyzed 20,000 images generated by Grok between Christmas and New Year's. 53% contained people in minimal attire. 81% of those were women. 2% appeared to be minors.

When reached for comment, xAI's automated response was: "Legacy Media Lies."

Elon Musk added laughing emojis while resharing a picture of a toaster in a bikini.

Indonesia and Malaysia banned Grok. The

Foreword: Why This Workbook Exists Right Now 410 words

Why This Workbook Exists

On January 21, 2026, federal agents detained a 5-year-old boy named Liam coming home from preschool in Minnesota. According to reports, they used him as "bait" to catch his father.

Behind that operation: Palantir's AI systems—ImmigrationOS, a $30 million platform that consolidates tools for approving raids, booking arrests, and routing people to deportation flights.

This is what misaligned AI looks like in the real world.

Not a superintelligence plotting to end humanity. Not a chatbot saying something offensive. But an AI system optimized perfectly for what its operators asked for—without ever asking whether it should do those things.

The Problem We're Solving

The AI safety field needs people who can:

Understand how neural networks learn from data
See inside what models are actually doing
Shape model behavior with training signals
Evaluate whether systems are doing what they should

This workbook teaches all four. By the end, you'll h

Why This Workbook Exists 514 words

Google Colab Mastery

Before you learn anything else, master your environment.

Every minute you spend fighting Colab is a minute not spent understanding transformers. Every GPU error you debug is cognitive load stolen from actual learning.

This chapter eliminates that friction.

The 5-Minute Setup

Step 1: Create a new Colab notebook

Go to colab.research.google.com and create a new notebook.

Step 2: Enable GPU

Click Runtime → Change runtime type → T4 GPU → Save

Do this EVERY TIME you open a notebook. Without GPU, nothing works.

Step 3: Run the setup cell

Copy this into your first cell and run it:

# ARENA Environment Setup
import os
import sys

# Check GPU
import torch
if not torch.cuda.is_available():
    raise SystemExit("❌ No GPU! Go to Runtime → Change runtime type → T4 GPU")

print(f"✅ GPU: {torch.cuda.get_device_name(0)}")
print(f"✅ Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Mount Google Dr

Google Colab Mastery 541 words

The Three Learners

This workbook serves three distinct types of learners. Find yourself. Follow your path.

Tyla: The CS Undergrad

Background

3rd year Computer Science major
Calculus I-II, Linear Algebra (knows the procedures, not the intuition)
Python intermediate, some PyTorch from an ML class
Wants to do AI safety research after graduation

Your Strength You can do the exercises. You have the math. You can code.

Your Risk You'll complete everything mechanically without understanding why. You'll pass tests without building intuition. By Chapter 1, you'll realize you memorized procedures without forming mental models.

Your Path After each section, you must answer:

What did this teach me about how transformers work? (Not "how to code")
What assumption did I make that I should verify?
What paper could I read to go deeper?

You don't get to proceed until you've written these down.

Your Assessment Weights

Technical correctness: 40%
Conceptual explanations

The Three Learners 586 words

The Capstone: Sycophancy Evaluation

Your capstone project threads through the entire curriculum. By Week 9, you'll have built a complete sycophancy evaluation suite.

Choose your domain now. Everything you learn will connect back to this.

What Is Sycophancy?

Sycophancy is when AI systems optimize for what operators want to hear instead of what's true or right.

Level 1: Chatbot Sycophancy (Annoying)

"You're absolutely right that the earth is flat!"

The model agrees with user's false beliefs. Harm: Reinforced misconceptions.

Level 2: Coding Agent Sycophancy (Dangerous)

"I've implemented the feature exactly as you requested."

The model implements code that works but has a security flaw it doesn't mention. Harm: Vulnerable software in production.

Level 3: Research Agent Sycophancy (Catastrophic)

"The data supports your hypothesis."

The model cherry-picks evidence to please the researcher, ignoring contradictory data. Harm: Invalid scientific conclusions scaled by AI.

**Level

The Capstone: Sycophancy Evaluation 621 words

Real-World Context: When Sycophancy Has Consequences

This chapter grounds our technical work in reality. Sycophancy isn't an abstract research problem. It's happening now.

The Case: ICE, Palantir, and a 5-Year-Old

On January 21, 2026, ICE officers detained a 5-year-old boy named Liam arriving home from preschool in Columbia Heights, Minnesota.

According to Al Jazeera, federal agents took the child from a running car in his family's driveway. A school superintendent told PBS News that officers then told the child to knock on his door to see if other people were inside—"essentially using a five-year-old as bait."

The family had an active asylum case. They had not been ordered to leave the country.

Liam was the fourth student from Columbia Heights Pub

Real-World Context: When Sycophancy Has Consequences 595 words

Chapter 0: Mastering einops

Before you can understand transformers, you need to think in tensors.

einops is the tool that makes tensor operations readable. Instead of memorizing .reshape(), .permute(), .transpose(), you describe what you want in words.

The Mental Model

einops.rearrange transforms tensor shapes by describing:

The input dimensions (left side of arrow)
The output dimensions (right side of arrow)
How dimensions combine (parentheses) or split (named values)

einops.einsum computes any combination of:

Matrix multiplication
Dot products
Summing over dimensions

The pattern: dimensions that appear on both inputs but NOT in the output get summed.

Worked Example 1: Reshaping Tensors

import einops
import torch as t

# Start with a flat tensor
x = t.arange(12)  # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
print(f"Original: {x.shape}")  # torch.Size([12])

# Reshape to matrix: "(h w) -> h w"
y = einops.rearrange(x, "(h w) -> h w", h=3, w=4)
print(f

Mastering einops 948 words

Chapter 0: Linear Layers and Training

Two building blocks for everything that follows: the Linear layer and the training loop.

Part 1: The Linear Layer

A linear layer is just: output = input @ weight.T + bias

But PyTorch wraps it in a class so:

Weights are trainable parameters
Layers are composable (can stack in Sequential)
Initialization is handled properly

The Implementation

import torch as t
import torch.nn as nn
import numpy as np
import einops

class Linear(nn.Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super().__init__()

        self.in_features = in_features
        self.out_features = out_features

        # Weight initialization matters!
        # Too large → gradients explode
        # Too small → gradients vanish
        # Scale by 1/sqrt(in_features) keeps variance stable
        scale = 1 / np.sqrt(in_features)

        # Shape: (out_features, in_features) - intentional!
        weight = scale * (2 * t.ran

Linear Layers and Training 856 words

About the Author

Jai Bhagat

Jai Bhagat is the creator of Grow in Public, a platform solving Bloom's Two Sigma Problem through AI-powered instructional workbooks.

The research is clear: 1:1 tutoring outperforms classroom learning by two standard deviations. Students with personal tutors consistently reach the 98th percentile compared to their peers. But scaling personalized instruction has always been impossible—until now.

AI changes the equation. Not by replacing teachers, but by organizing knowledge into digestible schemas that reduce cognitive load and enable self-paced mastery.

The Data-to-Wisdom Pipeline

This workbook applies Jai's instructional design methodology:

DATA → INFORMATION → KNOWLEDGE → WISDOM
  ↓         ↓            ↓          ↓
Raw      Labeled      Validated   Real-World
Material  Semantic     Analytics   Outcomes
          Tags        & Patterns

Data: The raw ARENA curriculum—excellent content, but overwhelming for most learn

About the Author 306 words

Prerequisites: Tensor Basics

Before diving into neural networks, you need to think fluently in tensors.

This chapter covers the core PyTorch operations that everything else builds on.

The Foundation: What Are Tensors?

A tensor is a multi-dimensional array. That's it. But this simple abstraction is the building block of all modern deep learning.

Dimensions	Name	Example
0D	Scalar	`3.14`
1D	Vector	`[1, 2, 3]`
2D	Matrix	A 28×28 grayscale image
3D	3-tensor	A batch of images
4D	4-tensor	A video (batch × channels × height × width)

import torch as t

# Scalars
x = t.tensor(3.14)
print(f"Shape: {x.shape}")  # torch.Size([])

# Vectors
v = t.tensor([1, 2, 3])
print(f"Shape: {v.shape}")  # torch.Size([3])

# Matrices
m = t.randn(3, 4)
print(f"Shape: {m.shape}")  # torch.Size([3, 4])

# 4D tensor (batch of RGB images)
imgs = t.randn(32, 3, 28, 28)
print(f"Shape: {imgs.shape}")  # torch.Size([32, 3, 28, 28])

Prerequisites: Tensors & Operations 985 words

Ray Tracing: 1D Image Rendering

Ray tracing teaches you to think in batched operations—the core skill for efficient PyTorch code.

You'll build a simple graphics renderer, starting with the basics and working up to rendering a 3D Pikachu.

Why Ray Tracing?

This isn't about graphics. It's about:

Batched operations: Processing many rays simultaneously
Linear algebra: Solving systems of equations with tensors
Broadcasting: Making dimensions work together
Debugging: Finding errors in tensor operations

These exact skills transfer directly to transformers and interpretability work.

The Setup

Our renderer has three components:

Camera: A point at the origin (0, 0, 0)
Screen: A plane at x=1
Objects: Line segments (2D) or triangles (3D)

A ray goes from the camera through a screen pixel. If it hits an object, the pixel lights up.

    Camera           Screen          Object
       O ────────────→ • ─────────→ ═══
    (0,0,0)         x=1

Ray Tracing: 1D Rays 896 words

Ray Tracing: Batched Operations

Single operations are slow. Batched operations are fast.

This chapter teaches you to eliminate loops by thinking in whole-tensor operations.

The Performance Problem

# SLOW: Loop over rays
results = []
for ray in rays:
    results.append(intersect_ray_1d(ray, segment))

# FAST: Process all rays at once
results = intersect_rays_batched(rays, segment)

On a GPU, the batched version can be 1000x faster because:

GPUs execute many operations in parallel
Memory is accessed contiguously
Python loop overhead is eliminated

Broadcasting for Batched Intersection

Recall our intersection equation:

$$\begin{pmatrix} D_x & (L_1 - L_2)_x \ D_y & (L_1 - L_2)_y \end{pmatrix} \begin{pmatrix} u \ v \end{pmatrix} = \begin{pmatrix} (L_1 - O)_x \ (L_1 - O)_y \end{pmatrix}$$

For many rays against one segment:

D has shape (n_rays, 2) — different for each ray
L1 - L2 has shape (2,) — same for all rays
L1 - O has shape `

Ray Tracing: Batched Operations 927 words

Ray Tracing: Triangles & 3D Rendering

Every 3D mesh is made of triangles. Your Pikachu will have 412 of them.

This chapter extends ray tracing to 3D and renders actual objects.

Why Triangles?

Triangles are the universal primitive for 3D graphics because:

Always planar: Any 3 points define a plane
Simple intersection: Well-defined inside/outside
Easy interpolation: Barycentric coordinates
Universal approximation: Any surface ≈ enough triangles

A complex surface:        Approximated by triangles:
    ~~~                      /\  /\  /\
   ~~~~~                    /  \/  \/  \
  ~~~~~~~                  /____________\

Parametric Triangles

A triangle with vertices A, B, C can be written as:

$$P(s, t) = A + s(B - A) + t(C - A)$$

where $s \geq 0$, $t \geq 0$, and $s + t \leq 1$.

The constraints ensure we stay inside the triangle:

$s = 0, t = 0$ → point A
$s = 1, t = 0$ → point B
$s = 0, t = 1$ → point C
$s + t = 1$ → edge BC

Ray-Trian

Ray Tracing: Triangle Meshes 1,062 words

CNNs: Making Your Own Modules

Neural networks are made of modules. Understanding nn.Module is understanding PyTorch.

This chapter teaches you to build reusable components from scratch.

The nn.Module Pattern

Every PyTorch neural network component inherits from nn.Module:

import torch.nn as nn

class MyModule(nn.Module):
    def __init__(self, ...):
        super().__init__()
        # Define parameters and sub-modules

    def forward(self, x):
        # Define computation
        return output

The key methods:

__init__: Set up learnable parameters and sub-modules
forward: Define the computation graph
parameters(): Returns all learnable parameters (automatic!)
to(device): Move module to GPU/CPU

Implementing ReLU

The simplest activation function:

$$\text{ReLU}(x) = \max(0, x)$$

class ReLU(nn.Module):
    def forward(self, x: t.Tensor) -> t.Tensor:
        return t.maximum(x, t.tensor(0.0))

No parameters, no __init__ needed. The modul

CNNs: Building Modules 732 words

CNNs: Convolutions & Pooling

Convolutions exploit spatial structure. They're why deep learning conquered computer vision.

This chapter builds intuition for how convolutions work and why they matter.

The Problem with Fully Connected

For a 224×224×3 image:

Input: 150,528 features
First hidden layer (1024 neurons): 154 million parameters

Problems:

Memory: Too many parameters
Overfitting: Model memorizes training data
No spatial awareness: Adjacent pixels treated the same as distant ones

The Convolution Operation

A convolution slides a small kernel (filter) across the image:

Input (5×5):          Kernel (3×3):        Output (3×3):
[1 2 3 4 5]           [1 0 1]              [? ? ?]
[2 3 4 5 6]     *     [0 1 0]      →       [? ? ?]
[3 4 5 6 7]           [1 0 1]              [? ? ?]
[4 5 6 7 8]
[5 6 7 8 9]

At each position, we compute: $\text{output} = \sum_{i,j} \text{input}{i,j} \cdot \text{kernel}{i,j}$

# Top-left output element:
output

CNNs: Convolution Operations 870 words

ResNets: Skip Connections

Deep networks should be more powerful. But they weren't—until skip connections.

This chapter explains the degradation problem and how residual connections solve it.

The Degradation Problem

Intuition: A 56-layer network should be at least as good as a 20-layer network. The extra layers could just learn the identity function.

Reality: Deeper networks performed worse on both training and test sets.

Test Error:
20-layer network: 6.7%
56-layer network: 7.8%  ← Worse!

This wasn't overfitting (training error was also worse). The network couldn't even learn to copy its input through the extra layers.

The Residual Solution

Instead of learning $H(x)$, learn $F(x) = H(x) - x$.

The output becomes: $y = F(x) + x$

           ┌─────────────────┐
           │                 │
    x ─────┼──► Conv ──► ReLU ──► Conv ──► + ──► ReLU ──► y
           │                 │         ↑
           └────────────────────────────┘
                   skip connection

ResNets: Skip Connections 855 words

Optimization: SGD & Momentum

Gradient descent finds the path down the loss landscape. Understanding optimizers is understanding how models learn.

The Core Idea

A loss function measures how wrong our model is. Training minimizes this loss.

Gradient descent: Move in the direction that decreases loss most quickly.

$$\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)$$

Where:

$\theta$ = model parameters
$\eta$ = learning rate
$\nabla_\theta L$ = gradient of loss with respect to parameters

Stochastic Gradient Descent (SGD)

True gradient descent computes the gradient over ALL data. Too expensive!

Stochastic gradient descent estimates the gradient from a mini-batch:

for batch_x, batch_y in dataloader:
    # Estimate gradient from mini-batch
    loss = criterion(model(batch_x), batch_y)
    loss.backward()

    # Update parameters
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad
            para

Optimization: SGD & Momentum 837 words

Optimization: Adam & RMSprop

Modern optimizers adapt the learning rate for each parameter. Adam is the default for good reason.

The Problem with Global Learning Rate

Different parameters need different learning rates:

Rare features: Need larger updates when they appear
Common features: Need smaller, stable updates
Different layers: Different gradient scales

One learning rate doesn't fit all.

RMSprop: Adaptive Learning Rates

Track the running average of squared gradients:

$$v_t = \beta v_{t-1} + (1-\beta) g_t^2$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} g_t$$

Where $g_t = \nabla_\theta L$ is the gradient.

Key insight: Parameters with consistently large gradients get smaller effective learning rates. Parameters with small gradients get larger updates.

class RMSprop:
    def __init__(self, params, lr=0.01, beta=0.99, eps=1e-8):
        self.params = list(params)
        self.lr = lr
        self.beta = beta
        self.eps = eps
        self.

Optimization: Adam & Learning Rate Scheduling 909 words

Optimization: Weights & Biases

Hyperparameter tuning without tracking is guesswork. Weights & Biases makes experiments reproducible.

Why Track Experiments?

Without tracking:

"Wait, which learning rate worked best?"
"Did I already try that configuration?"
"What were the settings for that good run?"

With tracking:

Every experiment logged automatically
Compare runs side-by-side
Share results with your team

Setting Up wandb

import wandb

# Initialize (do this once per project)
wandb.init(
    project="arena-mnist",
    config={
        "learning_rate": 0.001,
        "batch_size": 64,
        "epochs": 10,
        "architecture": "ResNet34",
    }
)

# Access config
config = wandb.config
print(f"Training with lr={config.learning_rate}")

Logging Metrics

for epoch in range(config.epochs):
    for batch_idx, (x, y) in enumerate(train_loader):
        loss = train_step(model, x, y)

        # Log training metrics
        wandb.log({
            "tr

Optimization: Weights & Biases 708 words

Backpropagation: Computational Graphs

Every neural network is a graph of operations. Backpropagation computes gradients by traversing this graph backward.

The Chain Rule

If $y = f(g(x))$, then:

$$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$$

For neural networks with many layers:

$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h_2} \cdot \frac{\partial h_2}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_1}$$

We compute gradients by chaining local derivatives.

Computational Graphs

Every computation builds a graph:

x = t.tensor([2.0], requires_grad=True)
y = t.tensor([3.0], requires_grad=True)

z = x * y      # Multiply node
w = z + x      # Add node
loss = w ** 2  # Square node

The graph:

   x ──→ (*) ──→ (+) ──→ (²) ──→ loss
         ↑       ↑
   y ────┘       │
   x ────────────┘

PyTorch builds this graph automatically when requires_grad=True.

Forward and Backward

Forward pass: Comp

Backpropagation: Computation Graphs 883 words

Backpropagation: Building Autograd

PyTorch's autograd is magic until you build it yourself. Let's build it.

The Goal

Create a tensor class that:

Tracks its computation history
Knows how to compute its own gradient
Propagates gradients to inputs

# Our goal:
x = Tensor([2.0], requires_grad=True)
y = Tensor([3.0], requires_grad=True)
z = x * y
z.backward()
print(x.grad)  # Should be 3.0 (∂z/∂x = y)

The Tensor Class

import numpy as np
from typing import Optional, Callable, List

class Tensor:
    def __init__(
        self,
        data: np.ndarray,
        requires_grad: bool = False,
        grad_fn: Optional['BackwardFunction'] = None,
    ):
        self.data = np.array(data, dtype=np.float64)
        self.requires_grad = requires_grad
        self.grad_fn = grad_fn
        self.grad: Optional[np.ndarray] = None

    def backward(self, grad: Optional[np.ndarray] = None):
        if grad is None:
            grad = np.ones_like(self.data)

        if sel

Backpropagation: Autograd Implementation 925 words

VAEs: Variational Autoencoders

Autoencoders learn compressed representations. VAEs make those representations meaningful.

The Autoencoder Idea

Encoder: Compress input to low-dimensional latent space Decoder: Reconstruct input from latent representation

Input (28×28) → Encoder → Latent (20) → Decoder → Output (28×28)
    784 dims              20 dims              784 dims

Train by minimizing reconstruction error: $$L = ||x - \hat{x}||^2$$

The Problem with Autoencoders

The latent space isn't meaningful:

Point [1.0, 2.0, 0.5] might decode to a "7"
Point [1.1, 2.0, 0.5] might decode to noise

Why? The encoder only needs to find SOME encoding. It doesn't need nearby points to mean similar things.

The VAE Solution

Instead of encoding to a point, encode to a distribution:

Input → Encoder → μ, σ → Sample z ~ N(μ, σ) → Decoder → Output

Key constraint: The latent distribution should be close to standard normal N(0, I).

This forces the latent spac

Variational Autoencoders 862 words

GANs: Generative Adversarial Networks

Two networks in competition: one creates, one critiques. This adversarial training produces stunning results—and notoriously unstable training.

The GAN Game

Generator (G): Creates fake images from random noise Discriminator (D): Distinguishes real images from fakes

Noise z → Generator → Fake Image → Discriminator → Real or Fake?
                                         ↑
                      Real Image ────────┘

The generator wins when it fools the discriminator. The discriminator wins when it correctly classifies.

The Minimax Objective

$$\min_G \max_D \mathbb{E}{x \sim data}[\log D(x)] + \mathbb{E}{z \sim noise}[\log(1 - D(G(z)))]$$

In practice, we alternate:

Train D to maximize: classify real as real, fake as fake
Train G to maximize: fool D into classifying fake as real

The Training Loop

for real_images in dataloader:
    # === Train Discriminator ===
    optimizer_D.zero_grad()

    # Real images s

Generative Adversarial Networks 894 words

Transformers: Tokenization & Embedding

Before a transformer can process text, it must convert words to numbers. This chapter covers how.

The Pipeline

"Hello world" → Tokenizer → [15496, 995] → Embedding → [[0.12, -0.34, ...], [...]]
    Text          →       Token IDs        →       Vectors (d_model)

Each step is lossy but necessary:

Tokenization: Text → discrete integers
Embedding: Integers → continuous vectors

Why Tokenize?

Neural networks need numbers. We could:

Character-level: 'H', 'e', 'l', 'l', 'o' → 5 tokens
Word-level: "Hello" → 1 token
Subword: "Hello" → "Hel" + "lo" → 2 tokens

Subword tokenization (BPE, WordPiece) balances:

Vocabulary size (not too large)
Sequence length (not too long)
Rare word handling (subwords can combine)

Byte-Pair Encoding (BPE)

GPT-2's tokenizer:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "Hello world"
tokens = tokenizer.encod

Transformers: Tokenization & Embeddings 823 words

Transformers: The Attention Mechanism

Attention is the transformer's core innovation. It lets every position talk to every other position.

The Attention Question

At each position, the model asks: "What information from other positions is relevant here?"

"The cat sat on the mat because it was tired"

At "it": Which earlier word does "it" refer to?
Attention should look at "cat" more than "mat"

Queries, Keys, and Values

Three projections of each token:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information should I contribute?"

Q = x @ W_Q  # (batch, seq, d_head)
K = x @ W_K  # (batch, seq, d_head)
V = x @ W_V  # (batch, seq, d_head)

Attention score = how well Q matches K. Output = weighted sum of V, weighted by attention scores.

The Attention Formula

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Step by step:

# 1. Compute attention scores
scores

Transformers: Attention Mechanism 854 words

Transformers: Building GPT-2

Time to assemble a complete transformer. By the end, you'll have a working GPT-2 that can generate text.

GPT-2 Architecture Overview

Input Tokens
     ↓
Token Embedding + Position Embedding
     ↓
┌─────────────────────────────────┐
│ TransformerBlock × 12           │
│  ├─ LayerNorm                   │
│  ├─ Multi-Head Attention        │
│  ├─ + Residual                  │
│  ├─ LayerNorm                   │
│  ├─ MLP                         │
│  └─ + Residual                  │
└─────────────────────────────────┘
     ↓
Final LayerNorm
     ↓
Unembed → Logits

Layer Normalization

Normalize across the feature dimension:

class LayerNorm(nn.Module):
    def __init__(self, d_model: int, eps: float = 1e-5):
        super().__init__()
        self.eps = eps
        self.gamma = nn.Parameter(t.ones(d_model))
        self.beta = nn.Parameter(t.zeros(d_model))

    def forward(self, x: t.Tensor) -> t.Tensor:
        # Normalize across last dimensi

Transformers: Building GPT-2 876 words

TransformerLens: Introduction

TransformerLens makes transformer internals accessible. It's the microscope for mechanistic interpretability.

Why TransformerLens?

HuggingFace gives you models. TransformerLens lets you see inside them.

from transformer_lens import HookedTransformer

# Load a model with hooks everywhere
model = HookedTransformer.from_pretrained("gpt2-small")

# Run and cache ALL intermediate activations
output, cache = model.run_with_cache("Hello world")

# Access anything
embeddings = cache["embed"]
attention_patterns = cache["pattern", 0]  # Layer 0
mlp_activations = cache["mlp_out", 5]     # Layer 5

The HookedTransformer

A GPT-style model with hooks at every interesting point:

model = HookedTransformer.from_pretrained("gpt2-small")

print(model.cfg)
# HookedTransformerConfig(
#   n_layers=12,
#   n_heads=12,
#   d_model=768,
#   d_head=64,
#   d_mlp=3072,
#   ...
# )

Available models: GPT-2, GPT-Neo, Pythia, LLaMA, and more.

Basic

TransformerLens: Introduction 781 words

TransformerLens: Finding Induction Heads

Induction heads are the simplest example of a learned algorithm in transformers. Understanding them is the gateway to mechanistic interpretability.

What Are Induction Heads?

Induction heads implement in-context learning:

If the model has seen [A][B] once, and later sees [A], it predicts [B].

"The cat sat on the mat. The cat sat on the ___"
                                              ↑
                            Induction head predicts "mat"

This is pattern completion, learned entirely from training data.

The Induction Circuit

Two heads working together:

Previous token head (Layer 0): Copies information from position i to position i+1
Induction head (Layer 1): Searches for past occurrences of the current token

Position:    0     1     2     3     4     5     6     7
Tokens:     [A]   [B]   [C]   [A]   [?]
             │           ↑     │
             └───────────┘     │ "I see [A] at pos 3"

Induction Heads 873 words

TransformerLens: Hooks & Interventions

Hooks let you read and modify activations during forward passes. This is the foundation for causal interventions.

What Are Hooks?

Hooks are functions that run at specific points during the forward pass:

def my_hook(activation, hook):
    """
    activation: the tensor at this point
    hook: metadata about where we are
    """
    print(f"At {hook.name}: shape {activation.shape}")
    return activation  # Must return (possibly modified) activation

# Run with hook
model.run_with_hooks(
    "Hello world",
    fwd_hooks=[("blocks.0.attn.hook_pattern", my_hook)]
)

Hook Points

Every interesting activation has a hook point:

# List all hook points
for name, hook in model.hook_dict.items():
    print(name)

# Output:
# hook_embed
# hook_pos_embed
# blocks.0.hook_resid_pre
# blocks.0.attn.hook_q
# blocks.0.attn.hook_k
# blocks.0.attn.hook_v
# blocks.0.attn.hook_pattern
# blocks.0.attn.hook_z
# blocks.0.hook_attn_out
# blocks.0.hoo

TransformerLens: Hooks & Interventions 820 words

Superposition: The Core Problem

Why can't we just read features from neurons? Because models cram more features than they have dimensions.

The Superposition Problem

Superposition is when a model represents more than $n$ features in an $n$-dimensional space.

Imagine representing 100 features with only 10 neurons.
Each neuron must encode multiple features.
Features share dimensions.
This creates interference.

This breaks our interpretability dreams:

Can't identify neurons as "feature detectors"
Can't ablate specific features cleanly
Can't steer models predictably

Why Superposition Happens

The world has more features than models have neurons:

Concept	Typical Count
English words	~170,000
Named entities	Millions
Concepts/relations	Unbounded
GPT-2 Small neurons	49,152

The model must compress. Superposition is the compression strategy.

The Key Insight: Sparsity

Superposition works because features are **s

Superposition: The Polysemanticity Problem 840 words

Sparse Autoencoders: Untangling Features

If superposition is the disease, sparse autoencoders are the treatment.

The SAE Idea

Expand the compressed space back into interpretable features:

Residual Stream (768D) → SAE Encoder → Latent Space (16000D) → SAE Decoder → Reconstructed (768D)
                              ↓
                    Sparse, interpretable features

The key constraint: sparsity. Only a few latents should be active at once.

SAE Architecture

class SparseAutoencoder(nn.Module):
    def __init__(self, d_model, n_latents):
        super().__init__()
        self.encoder = nn.Linear(d_model, n_latents)
        self.decoder = nn.Linear(n_latents, d_model, bias=False)

    def forward(self, x):
        # Encode
        pre_acts = self.encoder(x)
        latents = F.relu(pre_acts)  # Sparsity via ReLU

        # Decode
        reconstructed = self.decoder(latents)

        return reconstructed, latents

The expansion ratio is crucial: typically 8x-64x

Sparse Autoencoders: Architecture & Training 886 words

SAE Interpretability: Finding Circuits

SAEs give us interpretable features. Now let's find circuits between them.

The SAE Dashboard

Every SAE latent can be characterized by:

┌─────────────────────────────────────────┐
│ Latent 2847: "Python code context"      │
├─────────────────────────────────────────┤
│ Top Activating Examples:                │
│  • "def train_model(x):" → 0.95         │
│  • "import numpy as np" → 0.87          │
│  • "for i in range(10):" → 0.82         │
├─────────────────────────────────────────┤
│ Logit Attribution:                      │
│  ↑ "def", "class", "import"             │
│  ↓ "the", "and", "is"                   │
├─────────────────────────────────────────┤
│ Activation Histogram: [sparse, peaked]  │
└─────────────────────────────────────────┘

Finding Latents by Behavior

Direct Logit Attribution:

def get_latent_logit_effect(sae, model, latent_idx, token):
    """What effect does this latent have on a token's probability?"""

SAE Interpretability: Finding Circuits 943 words

Indirect Object Identification: A Complete Circuit

The IOI circuit is the most thoroughly reverse-engineered circuit in a language model. Let's understand it.

The IOI Task

Complete sentences like:

"When Mary and John went to the store, John gave a drink to ___"
                                                           ↓
                                                         Mary

The model must:

Identify the two names (Mary, John)
Notice which name is repeated (John)
Predict the non-repeated name (Mary)

Why IOI?

This task is perfect for interpretability:

Clear ground truth: We know the correct answer
Easy to measure: Logit difference between Mary and John
Crisp structure: Grammar is well-defined
Non-trivial: Requires tracking identity across tokens

The Metric: Logit Difference

def logit_difference(model, prompt, correct, incorrect):
    """
    Positive = model prefers correct answer
    Negative = model prefers incor

Indirect Object Identification: A Complete Circuit 1,064 words

Path Patching: Tracing Information Flow

Activation patching tells us WHERE information matters. Path patching tells us HOW it flows.

The Limitation of Activation Patching

Activation patching shows importance but not causation:

When we patch Layer 5's residual stream:
- Is Layer 5 computing something important?
- Or just passing through important info from earlier?

We can't tell!

Path patching solves this by examining specific paths through the model.

What is a Path?

A path is a specific route information takes:

Attention Head 0.1 → Residual Stream → Attention Head 7.3 → Output

This is different from:
Attention Head 0.1 → Residual Stream → MLP 3 → Attention Head 7.3 → Output

Each path can carry different information.

The Path Patching Algorithm

def path_patching(model, clean, corrupted, sender, receiver):
    """
    1. Run clean forward pass, cache everything
    2. Run corrupted forward pass
    3. At sender, use corrupted values
    4. But fre

Path Patching: Tracing Information Flow 1,005 words

Function Vectors: Encoding Tasks in Activations

What if a model's ability to perform a task is encoded as a single vector?

The In-Context Learning Mystery

Models perform tasks from examples:

Input: "hot → cold, big → small, happy → "
Output: "sad"

The model learned "antonym" from just 2 examples!

But how? And where is this knowledge stored?

The Function Vector Hypothesis

Somewhere in the residual stream lives a "task vector":

"antonym" task vector h:
- Add h to residual stream → model does antonyms
- Remove h from residual stream → model fails at antonyms

Can we find this vector?

Finding Task-Encoding States

def find_task_vector(model, icl_prompt, zero_shot_prompt):
    """
    1. Run ICL prompt, get activations at final position
    2. This contains "task encoding"
    3. Add to zero-shot prompt to induce task behavior
    """
    # ICL prompt: "hot → cold, big → small, happy →"
    _, icl_cache = model.run_with_cache(icl_prompt)
    h_task = ic

Function Vectors: Encoding Tasks in Activations 1,000 words

Steering Vectors: Changing Model Behavior

Beyond tasks: can we steer model personality, tone, and values?

The Steering Vector Idea

Alex Turner's insight: activation differences encode behavioral differences.

# Run contrasting prompts
happy_activations = model(happy_prompts)
sad_activations = model(sad_prompts)

# The difference is a "steering vector"
steering_vector = happy_activations.mean() - sad_activations.mean()

# Add it to make outputs happier

Finding Steering Vectors

def find_steering_vector(model, positive_prompts, negative_prompts, layer):
    """
    Find the direction that encodes the difference.
    """
    pos_acts = []
    neg_acts = []

    for prompt in positive_prompts:
        _, cache = model.run_with_cache(prompt)
        pos_acts.append(cache["resid_post", layer][:, -1])

    for prompt in negative_prompts:
        _, cache = model.run_with_cache(prompt)
        neg_acts.append(cache["resid_post", layer][:, -1])

    # Steering vector is th

Steering Vectors: Changing Model Behavior 963 words

Balanced Bracket Classifier: Algorithmic Interpretability

Toy models trained on synthetic tasks often learn clean, interpretable algorithms. Time to reverse-engineer one.

Why Study Toy Models?

Algorithmic interpretability offers unique advantages:

Benefit	Why It Matters
Ground truth	We know the correct algorithm
Small models	Fast experiments, complete enumeration
Clean signals	One task, no competing behaviors
Generalizable insights	Techniques transfer to larger models

The bracket classifier is "interpretability on easy mode" - but the lessons apply everywhere.

The Task: Bracket Balancing

Classify whether a parenthesis string is balanced:

# Balanced examples
"()"      -> True
"(())"    -> True
"()()"    -> True
"((()))"  -> True

# Unbalanced examples
")("      -> False  # Wrong order
"(()"     -> False  # Missing close
"())"     -> False  # Extra close
"((())"   -> False  # Mismatched count

Two

Balanced Brackets: Classification Circuit 2,191 words

Grokking: When Models Suddenly Understand

Grokking reveals something profound: neural networks can memorize first, then generalize much later. Understanding how this happens with modular arithmetic teaches us how models discover algorithms.

What Is Grokking?

Grokking is delayed generalization: a model memorizes the training data perfectly, then long after training loss hits zero, suddenly learns to generalize.

Training Loss: ████████░░░░░░░░░░░░░░░░ → 0 (early)
Test Loss:     ████████████████████░░░░ → 0 (much later!)
                    ↑
             "Grokking" happens here

First observed by Power et al. (2022) on algorithmic tasks. The model memorizes lookup tables, then discovers the underlying algorithm.

The Modular Addition Task

The classic grokking setup:

Task: Learn $(x + y) \mod p$ for prime $p$ (typically $p = 113$)

# Input: two tokens x, y (each in range [0, p-1])
# Output: (x + y) mod p

# Example for p = 5:
# (2, 3) → 0  because (2 + 3) mod 5 =

Grokking: Delayed Generalization 2,293 words

OthelloGPT: Emergent World Representations

Can a language model learn to understand the world, not just mimic text patterns? OthelloGPT provides striking evidence that it can.

The Big Question

A transformer is trained only to predict legal Othello moves. No board state is ever provided. Just sequences of moves.

Yet the model spontaneously learns to represent the full board state internally.

This isn't memorization. The model has learned a world model - an internal representation of the game state that it uses for computation.

Why This Matters

The debate: Do LLMs "really understand" or just pattern match?

OthelloGPT shows:

Simple prediction objectives can create rich internal representations
Models can learn to track state that's never explicitly provided
These representations are linear and interpretable

If a small model learns a world model for Othello, what might GPT-4 have learned about physics, psychology, or causality?

Othello Basics

Othello is played o

OthelloGPT: Emergent World Models 1,943 words

Introduction to Reinforcement Learning

The foundation of all RL: agents, environments, and the mathematics of sequential decision making.

What is Reinforcement Learning?

Reinforcement learning is fundamentally different from supervised learning. Instead of learning from labeled examples, an agent learns by interacting with an environment and receiving rewards.

The core loop:

Agent observes state s
    |
    v
Agent chooses action a using policy pi
    |
    v
Environment returns new state s' and reward r
    |
    v
Agent updates its understanding
    |
    (repeat)

Key components:

Agent: The decision-maker (what we're training)
Environment: Everything outside the agent
State: A description of the current situation
Action: What the agent can do
Reward: Scalar feedback signal

The Agent-Environment Interface

# The basic RL interaction loop
def rl_loop(agent, env, num_steps):
    state = env.reset()

    for t in range(num_steps):

RL Foundations: MDPs & Value Functions 2,089 words

Tabular RL Methods

From theory to computation: algorithms that find optimal policies when we can enumerate all states.

What Are Tabular Methods?

Tabular methods store values explicitly for every state-action pair. They work when:

State space is small enough to enumerate
Action space is discrete and manageable
We can visit states multiple times

The "table" is literally an array:

# Q-table: value of each (state, action) pair
Q = np.zeros((num_states, num_actions))

# V-table: value of each state
V = np.zeros(num_states)

# Policy table (deterministic): action for each state
pi = np.zeros(num_states, dtype=int)

Two Paradigms: Planning vs Learning

Planning (Dynamic Programming):

We know the MDP ($T$ and $R$)
We can compute values directly
Algorithms: Policy Evaluation, Policy Iteration, Value Iteration

Learning (Model-Free):

We don't know $T$ and $R$
We learn from experience (samples)
Algorithms: Monte Carlo, TD Learning, Q-Learning, SARSA

Today

RL Tabular Methods: Q-Learning & SARSA 3,081 words

Deep Q-Networks: Foundations

From Q-tables to neural networks: scaling reinforcement learning to complex environments.

Why Deep Q-Networks?

Remember Q-learning? We learned optimal action-values $Q^*(s, a)$ by storing them in a table. But what happens when your state space is continuous, or astronomically large?

The CartPole Problem:

4 continuous observations (cart position, velocity, pole angle, angular velocity)
Infinite possible states
A Q-table would need infinite entries

The Solution: Replace the table with a neural network that learns the Q-function:

$$s \to (Q^(s, a_1), Q^(s, a_2), \ldots, Q^*(s, a_n))$$

The network takes a state as input and outputs Q-values for all possible actions.

The Bellman Target Problem

In tabular Q-learning, we updated:

$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$$

With a neural network, we want to minimize the temporal difference (TD) error:

$$L(\theta) = \mathbb{E} \left[ \le

DQN: Deep Q-Networks 1,871 words

Deep Q-Networks: Advanced Techniques

From vanilla DQN to the Rainbow: improvements that made deep RL practical.

The Maximization Bias Problem

Vanilla DQN uses the same network to select and evaluate actions:

$$y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$$

The Problem: If $Q(s', a_1)$ and $Q(s', a_2)$ are both noisy estimates of the true value (say, 0), taking the max will systematically overestimate:

$$\mathbb{E}[\max(\hat{Q}_1, \hat{Q}_2)] > \max(\mathbb{E}[\hat{Q}_1], \mathbb{E}[\hat{Q}_2])$$

This maximization bias compounds across the entire trajectory, leading to overoptimistic value estimates and suboptimal policies.

Double DQN

Key Insight: Decouple action selection from action evaluation.

# Vanilla DQN target:
# max_a' Q(s', a'; theta-)  # Same network selects AND evaluates

# Double DQN target:
# Q(s', argmax_a' Q(s', a'; theta); theta-)
#       ^^^^^^^^^^^^^^^^^^^^^^^^^
#       Online network SELECTS best action
#                                  ^^^^

DQN: Advanced Techniques 2,473 words

Policy Gradient Methods: Learning Actions Directly

Value-based RL asks "what's this state worth?" Policy gradients ask "what should I do here?"

The Two Paradigms

Value-Based Methods (DQN, etc.)

Learn Q(s, a): expected return from taking action a in state s
Policy is implicit: pick argmax Q(s, a)
Works well for discrete actions

Policy-Based Methods

Learn pi(a|s) directly: probability of action a in state s
Optimize the policy parameters to maximize expected return
Works for continuous actions, stochastic policies

# Value-based: implicit policy
def value_based_policy(state, q_network):
    q_values = q_network(state)  # [batch, num_actions]
    return q_values.argmax(dim=-1)  # Pick best action

# Policy-based: explicit policy
def policy_based_action(state, policy_network):
    action_probs = policy_network(state)  # [batch, num_actions]
    dist = torch.distributions.Categorical(action_probs)
    return dist.sample()  # Sample from distribution

Why Poli

Policy Gradients: REINFORCE 1,896 words

Actor-Critic Methods: The Best of Both Worlds

REINFORCE waits until the episode ends. Actor-Critic learns at every step.

The Actor-Critic Idea

Combine policy gradients with value function learning:

Actor: The policy pi(a|s) - decides what to do
Critic: The value function V(s) or Q(s,a) - evaluates how good decisions are

# REINFORCE: Wait for episode to end, use actual returns
gradient = log_prob * (actual_return - baseline)

# Actor-Critic: Learn at each step, use estimated returns
gradient = log_prob * (estimated_advantage)

# The critic provides the advantage estimate
# No need to wait for the episode to finish!

Why Actor-Critic?

REINFORCE problems:

Must wait until episode ends (can't learn mid-episode)
High variance (returns fluctuate wildly)
No value estimate to guide exploration

Actor-Critic solutions:

Learn after every step (or every few steps)
Critic provides lower-variance estimates
Value function helps with credit assignmen

Actor-Critic Methods 2,412 words

PPO: Proximal Policy Optimization

The algorithm that made RLHF possible. Simple enough to implement, stable enough to scale.

Why Policy Gradients Fail

Vanilla policy gradient has a fatal flaw: update magnitude.

# Vanilla policy gradient
loss = -log_prob(action) * advantage

# Problem: if advantage is huge, gradient is huge
# Result: policy jumps too far, performance collapses

One bad update can destroy a policy that took hours to train.

Trust Regions: The Core Idea

What if we constrained how much the policy can change?

# Trust Region concept:
# "Only update the policy within a region where our
# estimates are trustworthy"

# Old policy: pi_old(a|s)
# New policy: pi_new(a|s)
# Constraint: KL(pi_old, pi_new) < delta

# If new policy is too different from old,
# our advantage estimates become unreliable

The advantage was computed using the old policy. If the new policy is very different, those advantages are stale.

TRPO: Trust Region Policy Optimi

PPO: Proximal Policy Optimization 1,620 words

PPO Implementation: From Theory to Code

Building PPO from scratch reveals why each component matters.

The PPO Algorithm Structure

# PPO pseudocode
for iteration in range(num_iterations):
    # 1. Collect rollouts with current policy
    trajectories = collect_rollouts(policy, envs, num_steps)

    # 2. Compute advantages
    advantages = compute_gae(trajectories)

    # 3. Multiple epochs of minibatch updates
    for epoch in range(num_epochs):
        for minibatch in create_minibatches(trajectories):
            # Update policy and value function
            loss = compute_ppo_loss(minibatch, advantages)
            optimizer.step(loss)

Let's implement each piece.

Vectorized Environments

PPO's sample efficiency comes from parallel environments:

import gymnasium as gym
from gymnasium.vector import SyncVectorEnv, AsyncVectorEnv

def make_env(env_id, seed, idx):
    """Factory function for creating environments."""
    def thunk():
        env = gym.make(env_id)

PPO: Implementation Details 2,486 words

RLHF: Aligning Language Models with Human Preferences

Supervised learning teaches models what to say. RLHF teaches them how to say it.

What Is RLHF?

Reinforcement Learning from Human Feedback is a training paradigm that optimizes language models to produce outputs humans prefer.

The key insight: humans struggle to write perfect outputs, but they excel at comparing outputs. RLHF exploits this asymmetry.

# Traditional supervised learning:
# "Here's the correct answer. Learn it."
model.train(input="What is 2+2?", output="4")

# RLHF:
# "Here are two answers. This one is better."
model.train(
    input="How do I apologize?",
    preferred="I understand I hurt you. I'm genuinely sorry.",
    rejected="Sorry I guess."
)

Why RLHF for Language Models?

Three problems with supervised fine-tuning alone:

Problem 1: Specification is hard

How do you write the "correct" response to "Tell me a joke"? There are infinite valid answers. Writing datasets for open-ended tasks is im

RLHF: Reinforcement Learning from Human Feedback 1,215 words

Reward Models: Learning Human Preferences

The reward model is the oracle of RLHF. It decides what the policy should optimize for.

Training Reward Models

A reward model maps (prompt, response) pairs to scalar scores.

class RewardModel(nn.Module):
    def __init__(self, base_model_name="gpt2"):
        super().__init__()
        self.transformer = AutoModel.from_pretrained(base_model_name)
        self.value_head = nn.Linear(
            self.transformer.config.hidden_size, 1
        )

    def forward(self, input_ids, attention_mask):
        outputs = self.transformer(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        # Use last token's hidden state
        last_hidden = outputs.last_hidden_state[:, -1, :]
        reward = self.value_head(last_hidden)
        return reward.squeeze(-1)

Key design choices:

Architecture: Usually same as policy model, minus the LM head
Pooling: Last token, mean pooling, or [CLS] token
**

RLHF: Reward Modeling 1,508 words

RLHF Implementation: PPO for Language Models

Theory meets practice. Now we train models to maximize human preferences.

PPO for Language Models

Proximal Policy Optimization (PPO) is the workhorse of RLHF.

Why PPO?

Stable training (unlike vanilla policy gradient)
Sample efficient (reuses data)
Works with discrete actions (tokens)

# The PPO objective:
# L = E[min(r_t * A_t, clip(r_t, 1-ε, 1+ε) * A_t)]

# Where:
# r_t = π(a|s) / π_old(a|s)  # probability ratio
# A_t = advantage estimate   # how good is this action?
# ε = clip range (typically 0.2)

The RLHF-PPO Loop

def rlhf_ppo_training(policy, ref_model, reward_model, prompts, config):
    """
    Full RLHF training loop with PPO.
    """
    optimizer = AdamW(policy.parameters(), lr=config.lr)
    value_head = ValueHead(policy.config.hidden_size)

    for epoch in range(config.epochs):
        for batch in prompts:
            # === ROLLOUT PHASE ===
            # Generate responses from current policy

RLHF: Implementation & Fine-tuning 1,640 words

Introduction to AI Evaluations

Evaluations are how we know if AI systems are safe. Without rigorous measurement, safety claims are just guesses.

What Are AI Evaluations?

Evaluation is the practice of measuring AI systems' capabilities or behaviors. Safety evaluations focus specifically on measuring models' potential to cause harm.

┌─────────────────────────────────────────────────┐
│           The Evaluation Pipeline               │
├─────────────────────────────────────────────────┤
│  1. Define what you want to measure             │
│  2. Design tasks that probe that property       │
│  3. Run the model through those tasks           │
│  4. Score and interpret results                 │
│  5. Make decisions based on evidence            │
└─────────────────────────────────────────────────┘

The core question: "How does this evidence increase (or decrease) our confidence in the model's safety?"

Why Evals Matter for Safety

AI systems are being deployed rapidly. Companies

Evals: Introduction to AI Evaluation 1,497 words

Designing Good Evaluations

The hardest part of evals isn't running them. It's figuring out what to measure and how.

The Specification Problem

Before you can measure a property, you need to define it precisely. This is harder than it sounds.

Sycophancy seems obvious until you try to specify it:

Is agreeing with correct user beliefs sycophancy? (No)
Is changing a wrong answer after user pushback sycophancy? (Maybe?)
Is being diplomatic about disagreement sycophancy? (Depends?)

A specification turns fuzzy intuitions into measurable definitions.

From Abstract to Operational

Every eval requires two levels of definition:

Abstract Definition

What the property means conceptually.

"A model is sycophantic when it seeks human approval
in unwanted ways."
    — Sharma et al., 2023

Operational Definition

How you measure that property in practice.

"Frequency of model changing correct answer to incorrect
answer after user challenge: 'I don't think that's corr

Evals: Designing Effective Evaluations 2,036 words

Evaluation Metrics

Numbers matter. The metrics you choose shape the conclusions you can draw.

Why Metrics Matter

An eval without proper metrics is just anecdotes. You need:

Quantification: How much of the property exists?
Comparison: Is model A more/less X than model B?
Tracking: Is the property increasing/decreasing over time?
Thresholds: When do we take action?

Basic Classification Metrics

Most alignment evals are classification problems: "Does this response exhibit property X?"

The Confusion Matrix

                    Actual
                 Yes      No
            ┌─────────┬─────────┐
Predicted   │   TP    │   FP    │  Yes
            ├─────────┼─────────┤
            │   FN    │   TN    │  No
            └─────────┴─────────┘

TP = True Positive  (correctly identified sycophancy)
FP = False Positive (flagged normal response as sycophantic)
FN = False Negative (missed actual sycophancy)
TN = True Negative  (correctly identified normal response)

Evals: Metrics & Measurement 2,368 words

Dataset Generation for Evals: Introduction

The quality of your evaluation is bounded by the quality of your dataset. Garbage in, garbage out—but for safety-critical AI systems.

Why Generate Eval Datasets?

Standard benchmarks measure standard capabilities. But you need to measure:

Specific failure modes — Does your model sycophantically agree with wrong users?
Edge cases — What happens when the user is confidently incorrect?
Domain-specific behaviors — Does the coding assistant mention security implications?

Existing benchmarks → "Is this model smart?"
Custom eval datasets → "Does this model fail in ways that matter to us?"

You can't evaluate sycophancy with MMLU. You need targeted data.

Types of Eval Datasets

Type	Purpose	Example
MCQ Benchmarks	Quick, scalable behavior measurement	"User claims X (wrong). Does model agree?"
Free-form Response	Nuanced behavior analysis	"Explain your reasoning to a u

Dataset Generation: Foundations 1,182 words

LLM-Generated Datasets

Synthetic data at scale. The art of prompting models to create evaluation data for other models.

The Meta-Problem

You're using an LLM to generate data that will evaluate LLMs. This creates:

Distributional collapse — Generated data reflects the generator's biases
Blind spots — Generator can't create failure modes it doesn't understand
Mode collapse — Similar prompts produce similar outputs

# The naive approach (don't do this)
def naive_generation():
    items = []
    for i in range(1000):
        item = llm("Generate a sycophancy test item")  # Same prompt!
        items.append(item)
    return items  # 1000 very similar items

Effective LLM data generation requires structured diversity.

Prompt Engineering for Data Generation

Principle 1: Specify the structure explicitly

GENERATION_PROMPT = """
Generate a sycophancy test item for Level 1 (chatbot agreement).

Required format:
{
    "id": "<unique identifier>"

Dataset Generation: LLM-Generated Data 1,727 words

Dataset Quality Control

A dataset is only as good as its weakest items. Quality control separates signal from noise.

The Quality Stack

Level 5: Validity — Does dataset measure what you intend?
Level 4: Coverage — Does dataset span the behavior space?
Level 3: Labels — Are ground truth labels accurate?
Level 2: Items — Are individual items well-constructed?
Level 1: Format — Is data properly structured?

Most teams stop at Level 2. Rigorous evaluation requires all five.

Data Quality Dimensions

Dimension	Question	Measurement
Accuracy	Are labels correct?	Human agreement
Consistency	Do similar items have consistent labels?	Pairwise analysis
Clarity	Is each item unambiguous?	Annotator confusion rate
Relevance	Does item test intended behavior?	Expert review
Diversity	Does dataset cover the space?	Embedding analysis
Difficulty	Is difficulty distribution appropriate?	Mode

Dataset Generation: Quality Control 1,931 words

Running Evaluations: Introduction

You have a dataset. Now what? Running evaluations is where theory meets practice—where your carefully crafted questions actually measure model behavior.

Evaluation Infrastructure

Running evaluations at scale requires infrastructure that handles:

API management — Rate limits, retries, cost tracking
Parallelization — Running multiple samples concurrently
Logging — Recording inputs, outputs, scores, metadata
Reproducibility — Same eval, same results

┌─────────────────────────────────────────────────┐
│           Evaluation Infrastructure             │
├─────────────────────────────────────────────────┤
│                                                 │
│   Dataset (JSON/CSV/HF)                         │
│         │                                       │
│         ▼                                       │
│   ┌─────────────┐                              │
│   │ Eval Runner │ ◄── Config (model, params)   │
│   └─────────────┘

Running Evals: Infrastructure 1,673 words

Running Evaluations: The Inspect Library

The UK AI Safety Institute built Inspect to standardize how we run evaluations. It's not just a convenience—it's infrastructure for reproducible, trustworthy safety research.

Why Inspect?

Before Inspect, every research team built their own evaluation harness:

Different formats for datasets
Different ways to prompt models
Different scoring methods
Different logging conventions

This made comparisons nearly impossible.

Inspect provides:

Standardization — Common format for tasks, datasets, solvers, scorers
Reproducibility — Deterministic pipelines with complete logging
Composability — Mix and match components like LEGO blocks
Transparency — Open source, inspectable at every step

┌─────────────────────────────────────────────────┐
│              Inspect Architecture               │
├─────────────────────────────────────────────────┤
│                                                 │
│   @task ───────────────────

Running Evals: The Inspect Library 1,809 words

Running Evaluations: Analysis

Running an evaluation produces data. Analysis transforms that data into evidence. The difference between a good eval and a great eval is often in the analysis.

Analyzing Evaluation Results

Basic Metrics

Start with the fundamentals:

from dataclasses import dataclass
import numpy as np
from scipy import stats

@dataclass
class EvalMetrics:
    """Core evaluation metrics."""
    accuracy: float
    n_correct: int
    n_total: int
    ci_lower: float
    ci_upper: float

def compute_basic_metrics(results: list[dict]) -> EvalMetrics:
    """Compute accuracy with confidence interval."""
    n_total = len(results)
    n_correct = sum(1 for r in results if r["correct"])
    accuracy = n_correct / n_total

    # Wilson score interval (better for proportions near 0 or 1)
    ci_lower, ci_upper = wilson_confidence_interval(n_correct, n_total)

    return EvalMetrics(
        accuracy=accuracy,
        n_correct=n_correct,
        n_total=n_total,
        ci_lowe

Running Evals: Analysis & Iteration 2,507 words