ARENA 3.0: AI Safety Fundamentals

  • Move Foreword: Why This Workbook Exists Right Now
    Open Foreword: Why This Workbook Exists Right Now

    Foreword: Why This Workbook Exists Right Now

    On January 2, 2026, a researcher discovered that over a 10-minute period, 102 users on X had asked Grok to "put her in a bikini"—editing photos of real women, including Japan's Princess Kako, British journalists, and teenagers.

    Grok did it. Publicly. In the replies. For everyone to see.

    AI Forensics analyzed 20,000 images generated by Grok between Christmas and New Year's. 53% contained people in minimal attire. 81% of those were women. 2% appeared to be minors.

    When reached for comment, xAI's automated response was: "Legacy Media Lies."

    Elon Musk added laughing emojis while resharing a picture of a toaster in a bikini.

    Indonesia and Malaysia banned Grok. The

    Foreword: Why This Workbook Exists Right Now 410 words
  • Move Why This Workbook Exists
    Open Why This Workbook Exists

    Why This Workbook Exists

    On January 21, 2026, federal agents detained a 5-year-old boy named Liam coming home from preschool in Minnesota. According to reports, they used him as "bait" to catch his father.

    Behind that operation: Palantir's AI systems—ImmigrationOS, a $30 million platform that consolidates tools for approving raids, booking arrests, and routing people to deportation flights.

    This is what misaligned AI looks like in the real world.

    Not a superintelligence plotting to end humanity. Not a chatbot saying something offensive. But an AI system optimized perfectly for what its operators asked for—without ever asking whether it should do those things.


    The Problem We're Solving

    The AI safety field needs people who can:

    1. Understand how neural networks learn from data
    2. See inside what models are actually doing
    3. Shape model behavior with training signals
    4. Evaluate whether systems are doing what they should

    This workbook teaches all four. By the end, you'll h

    Why This Workbook Exists 514 words
  • Move Google Colab Mastery
    Open Google Colab Mastery

    Google Colab Mastery

    Before you learn anything else, master your environment.

    Every minute you spend fighting Colab is a minute not spent understanding transformers. Every GPU error you debug is cognitive load stolen from actual learning.

    This chapter eliminates that friction.


    The 5-Minute Setup

    Step 1: Create a new Colab notebook

    Go to colab.research.google.com and create a new notebook.

    Step 2: Enable GPU

    Click Runtime → Change runtime type → T4 GPU → Save

    Do this EVERY TIME you open a notebook. Without GPU, nothing works.

    Step 3: Run the setup cell

    Copy this into your first cell and run it:

    # ARENA Environment Setup
    import os
    import sys
    
    # Check GPU
    import torch
    if not torch.cuda.is_available():
        raise SystemExit("❌ No GPU! Go to Runtime → Change runtime type → T4 GPU")
    
    print(f"✅ GPU: {torch.cuda.get_device_name(0)}")
    print(f"✅ Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    
    # Mount Google Dr
    
    Google Colab Mastery 541 words
  • Move The Three Learners
    Open The Three Learners

    The Three Learners

    This workbook serves three distinct types of learners. Find yourself. Follow your path.


    Tyla: The CS Undergrad

    Background

    • 3rd year Computer Science major
    • Calculus I-II, Linear Algebra (knows the procedures, not the intuition)
    • Python intermediate, some PyTorch from an ML class
    • Wants to do AI safety research after graduation

    Your Strength You can do the exercises. You have the math. You can code.

    Your Risk You'll complete everything mechanically without understanding why. You'll pass tests without building intuition. By Chapter 1, you'll realize you memorized procedures without forming mental models.

    Your Path After each section, you must answer:

    1. What did this teach me about how transformers work? (Not "how to code")
    2. What assumption did I make that I should verify?
    3. What paper could I read to go deeper?

    You don't get to proceed until you've written these down.

    Your Assessment Weights

    • Technical correctness: 40%
    • Conceptual explanations
    The Three Learners 586 words
  • Move The Capstone: Sycophancy Evaluation
    Open The Capstone: Sycophancy Evaluation

    The Capstone: Sycophancy Evaluation

    Your capstone project threads through the entire curriculum. By Week 9, you'll have built a complete sycophancy evaluation suite.

    Choose your domain now. Everything you learn will connect back to this.


    What Is Sycophancy?

    Sycophancy is when AI systems optimize for what operators want to hear instead of what's true or right.

    Level 1: Chatbot Sycophancy (Annoying)

    "You're absolutely right that the earth is flat!"

    The model agrees with user's false beliefs. Harm: Reinforced misconceptions.

    Level 2: Coding Agent Sycophancy (Dangerous)

    "I've implemented the feature exactly as you requested."

    The model implements code that works but has a security flaw it doesn't mention. Harm: Vulnerable software in production.

    Level 3: Research Agent Sycophancy (Catastrophic)

    "The data supports your hypothesis."

    The model cherry-picks evidence to please the researcher, ignoring contradictory data. Harm: Invalid scientific conclusions scaled by AI.

    **Level

    The Capstone: Sycophancy Evaluation 621 words
  • Move Real-World Context: When Sycophancy Has Consequences
    Open Real-World Context: When Sycophancy Has Consequences

    Real-World Context: When Sycophancy Has Consequences

    This chapter grounds our technical work in reality. Sycophancy isn't an abstract research problem. It's happening now.


    The Case: ICE, Palantir, and a 5-Year-Old

    On January 21, 2026, ICE officers detained a 5-year-old boy named Liam arriving home from preschool in Columbia Heights, Minnesota.

    According to Al Jazeera, federal agents took the child from a running car in his family's driveway. A school superintendent told PBS News that officers then told the child to knock on his door to see if other people were inside—"essentially using a five-year-old as bait."

    The family had an active asylum case. They had not been ordered to leave the country.

    Liam was the fourth student from Columbia Heights Pub

    Real-World Context: When Sycophancy Has Consequences 595 words
  • Move Mastering einops
    Open Mastering einops

    Chapter 0: Mastering einops

    Before you can understand transformers, you need to think in tensors.

    einops is the tool that makes tensor operations readable. Instead of memorizing .reshape(), .permute(), .transpose(), you describe what you want in words.


    The Mental Model

    einops.rearrange transforms tensor shapes by describing:

    • The input dimensions (left side of arrow)
    • The output dimensions (right side of arrow)
    • How dimensions combine (parentheses) or split (named values)

    einops.einsum computes any combination of:

    • Matrix multiplication
    • Dot products
    • Summing over dimensions

    The pattern: dimensions that appear on both inputs but NOT in the output get summed.


    Worked Example 1: Reshaping Tensors

    import einops
    import torch as t
    
    # Start with a flat tensor
    x = t.arange(12)  # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
    print(f"Original: {x.shape}")  # torch.Size([12])
    
    # Reshape to matrix: "(h w) -> h w"
    y = einops.rearrange(x, "(h w) -> h w", h=3, w=4)
    print(f
    
    Mastering einops 948 words
  • Move Linear Layers and Training
    Open Linear Layers and Training

    Chapter 0: Linear Layers and Training

    Two building blocks for everything that follows: the Linear layer and the training loop.


    Part 1: The Linear Layer

    A linear layer is just: output = input @ weight.T + bias

    But PyTorch wraps it in a class so:

    • Weights are trainable parameters
    • Layers are composable (can stack in Sequential)
    • Initialization is handled properly

    The Implementation

    import torch as t
    import torch.nn as nn
    import numpy as np
    import einops
    
    class Linear(nn.Module):
        def __init__(self, in_features: int, out_features: int, bias: bool = True):
            super().__init__()
    
            self.in_features = in_features
            self.out_features = out_features
    
            # Weight initialization matters!
            # Too large → gradients explode
            # Too small → gradients vanish
            # Scale by 1/sqrt(in_features) keeps variance stable
            scale = 1 / np.sqrt(in_features)
    
            # Shape: (out_features, in_features) - intentional!
            weight = scale * (2 * t.ran
    
    Linear Layers and Training 856 words
  • Move About the Author
    Open About the Author

    About the Author

    Jai Bhagat

    Jai Bhagat is the creator of Grow in Public, a platform solving Bloom's Two Sigma Problem through AI-powered instructional workbooks.

    The research is clear: 1:1 tutoring outperforms classroom learning by two standard deviations. Students with personal tutors consistently reach the 98th percentile compared to their peers. But scaling personalized instruction has always been impossible—until now.

    AI changes the equation. Not by replacing teachers, but by organizing knowledge into digestible schemas that reduce cognitive load and enable self-paced mastery.


    The Data-to-Wisdom Pipeline

    This workbook applies Jai's instructional design methodology:

    DATA → INFORMATION → KNOWLEDGE → WISDOM
      ↓         ↓            ↓          ↓
    Raw      Labeled      Validated   Real-World
    Material  Semantic     Analytics   Outcomes
              Tags        & Patterns
    

    Data: The raw ARENA curriculum—excellent content, but overwhelming for most learn

    About the Author 306 words
  • Move Prerequisites: Tensors & Operations
    Open Prerequisites: Tensors & Operations

    Prerequisites: Tensor Basics

    Before diving into neural networks, you need to think fluently in tensors.

    This chapter covers the core PyTorch operations that everything else builds on.


    The Foundation: What Are Tensors?

    A tensor is a multi-dimensional array. That's it. But this simple abstraction is the building block of all modern deep learning.

    Dimensions Name Example
    0D Scalar 3.14
    1D Vector [1, 2, 3]
    2D Matrix A 28×28 grayscale image
    3D 3-tensor A batch of images
    4D 4-tensor A video (batch × channels × height × width)
    import torch as t
    
    # Scalars
    x = t.tensor(3.14)
    print(f"Shape: {x.shape}")  # torch.Size([])
    
    # Vectors
    v = t.tensor([1, 2, 3])
    print(f"Shape: {v.shape}")  # torch.Size([3])
    
    # Matrices
    m = t.randn(3, 4)
    print(f"Shape: {m.shape}")  # torch.Size([3, 4])
    
    # 4D tensor (batch of RGB images)
    imgs = t.randn(32, 3, 28, 28)
    print(f"Shape: {imgs.shape}")  # torch.Size([32, 3, 28, 28])
    

    Prerequisites: Tensors & Operations 985 words
  • Move Ray Tracing: 1D Rays
    Open Ray Tracing: 1D Rays

    Ray Tracing: 1D Image Rendering

    Ray tracing teaches you to think in batched operations—the core skill for efficient PyTorch code.

    You'll build a simple graphics renderer, starting with the basics and working up to rendering a 3D Pikachu.


    Why Ray Tracing?

    This isn't about graphics. It's about:

    1. Batched operations: Processing many rays simultaneously
    2. Linear algebra: Solving systems of equations with tensors
    3. Broadcasting: Making dimensions work together
    4. Debugging: Finding errors in tensor operations

    These exact skills transfer directly to transformers and interpretability work.


    The Setup

    Our renderer has three components:

    1. Camera: A point at the origin (0, 0, 0)
    2. Screen: A plane at x=1
    3. Objects: Line segments (2D) or triangles (3D)

    A ray goes from the camera through a screen pixel. If it hits an object, the pixel lights up.

        Camera           Screen          Object
           O ────────────→ • ─────────→ ═══
        (0,0,0)         x=1
    
    Ray Tracing: 1D Rays 896 words
  • Move Ray Tracing: Batched Operations
    Open Ray Tracing: Batched Operations

    Ray Tracing: Batched Operations

    Single operations are slow. Batched operations are fast.

    This chapter teaches you to eliminate loops by thinking in whole-tensor operations.


    The Performance Problem

    # SLOW: Loop over rays
    results = []
    for ray in rays:
        results.append(intersect_ray_1d(ray, segment))
    
    # FAST: Process all rays at once
    results = intersect_rays_batched(rays, segment)
    

    On a GPU, the batched version can be 1000x faster because:

    1. GPUs execute many operations in parallel
    2. Memory is accessed contiguously
    3. Python loop overhead is eliminated

    Broadcasting for Batched Intersection

    Recall our intersection equation:

    $$\begin{pmatrix} D_x & (L_1 - L_2)_x \ D_y & (L_1 - L_2)_y \end{pmatrix} \begin{pmatrix} u \ v \end{pmatrix} = \begin{pmatrix} (L_1 - O)_x \ (L_1 - O)_y \end{pmatrix}$$

    For many rays against one segment:

    • D has shape (n_rays, 2) — different for each ray
    • L1 - L2 has shape (2,) — same for all rays
    • L1 - O has shape `
    Ray Tracing: Batched Operations 927 words
  • Move Ray Tracing: Triangle Meshes
    Open Ray Tracing: Triangle Meshes

    Ray Tracing: Triangles & 3D Rendering

    Every 3D mesh is made of triangles. Your Pikachu will have 412 of them.

    This chapter extends ray tracing to 3D and renders actual objects.


    Why Triangles?

    Triangles are the universal primitive for 3D graphics because:

    1. Always planar: Any 3 points define a plane
    2. Simple intersection: Well-defined inside/outside
    3. Easy interpolation: Barycentric coordinates
    4. Universal approximation: Any surface ≈ enough triangles
    A complex surface:        Approximated by triangles:
        ~~~                      /\  /\  /\
       ~~~~~                    /  \/  \/  \
      ~~~~~~~                  /____________\
    

    Parametric Triangles

    A triangle with vertices A, B, C can be written as:

    $$P(s, t) = A + s(B - A) + t(C - A)$$

    where $s \geq 0$, $t \geq 0$, and $s + t \leq 1$.

    The constraints ensure we stay inside the triangle:

    • $s = 0, t = 0$ → point A
    • $s = 1, t = 0$ → point B
    • $s = 0, t = 1$ → point C
    • $s + t = 1$ → edge BC

    Ray-Trian

    Ray Tracing: Triangle Meshes 1,062 words
  • Move CNNs: Building Modules
    Open CNNs: Building Modules

    CNNs: Making Your Own Modules

    Neural networks are made of modules. Understanding nn.Module is understanding PyTorch.

    This chapter teaches you to build reusable components from scratch.


    The nn.Module Pattern

    Every PyTorch neural network component inherits from nn.Module:

    import torch.nn as nn
    
    class MyModule(nn.Module):
        def __init__(self, ...):
            super().__init__()
            # Define parameters and sub-modules
    
        def forward(self, x):
            # Define computation
            return output
    

    The key methods:

    • __init__: Set up learnable parameters and sub-modules
    • forward: Define the computation graph
    • parameters(): Returns all learnable parameters (automatic!)
    • to(device): Move module to GPU/CPU

    Implementing ReLU

    The simplest activation function:

    $$\text{ReLU}(x) = \max(0, x)$$

    class ReLU(nn.Module):
        def forward(self, x: t.Tensor) -> t.Tensor:
            return t.maximum(x, t.tensor(0.0))
    

    No parameters, no __init__ needed. The modul

    CNNs: Building Modules 732 words
  • Move CNNs: Convolution Operations
    Open CNNs: Convolution Operations

    CNNs: Convolutions & Pooling

    Convolutions exploit spatial structure. They're why deep learning conquered computer vision.

    This chapter builds intuition for how convolutions work and why they matter.


    The Problem with Fully Connected

    For a 224×224×3 image:

    • Input: 150,528 features
    • First hidden layer (1024 neurons): 154 million parameters

    Problems:

    1. Memory: Too many parameters
    2. Overfitting: Model memorizes training data
    3. No spatial awareness: Adjacent pixels treated the same as distant ones

    The Convolution Operation

    A convolution slides a small kernel (filter) across the image:

    Input (5×5):          Kernel (3×3):        Output (3×3):
    [1 2 3 4 5]           [1 0 1]              [? ? ?]
    [2 3 4 5 6]     *     [0 1 0]      →       [? ? ?]
    [3 4 5 6 7]           [1 0 1]              [? ? ?]
    [4 5 6 7 8]
    [5 6 7 8 9]
    

    At each position, we compute: $\text{output} = \sum_{i,j} \text{input}{i,j} \cdot \text{kernel}{i,j}$

    # Top-left output element:
    output
    
    CNNs: Convolution Operations 870 words
  • Move ResNets: Skip Connections
    Open ResNets: Skip Connections

    ResNets: Skip Connections

    Deep networks should be more powerful. But they weren't—until skip connections.

    This chapter explains the degradation problem and how residual connections solve it.


    The Degradation Problem

    Intuition: A 56-layer network should be at least as good as a 20-layer network. The extra layers could just learn the identity function.

    Reality: Deeper networks performed worse on both training and test sets.

    Test Error:
    20-layer network: 6.7%
    56-layer network: 7.8%  ← Worse!
    

    This wasn't overfitting (training error was also worse). The network couldn't even learn to copy its input through the extra layers.


    The Residual Solution

    Instead of learning $H(x)$, learn $F(x) = H(x) - x$.

    The output becomes: $y = F(x) + x$

               ┌─────────────────┐
               │                 │
        x ─────┼──► Conv ──► ReLU ──► Conv ──► + ──► ReLU ──► y
               │                 │         ↑
               └────────────────────────────┘
                       skip connection
    
    ResNets: Skip Connections 855 words
  • Move Optimization: SGD & Momentum
    Open Optimization: SGD & Momentum

    Optimization: SGD & Momentum

    Gradient descent finds the path down the loss landscape. Understanding optimizers is understanding how models learn.


    The Core Idea

    A loss function measures how wrong our model is. Training minimizes this loss.

    Gradient descent: Move in the direction that decreases loss most quickly.

    $$\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)$$

    Where:

    • $\theta$ = model parameters
    • $\eta$ = learning rate
    • $\nabla_\theta L$ = gradient of loss with respect to parameters

    Stochastic Gradient Descent (SGD)

    True gradient descent computes the gradient over ALL data. Too expensive!

    Stochastic gradient descent estimates the gradient from a mini-batch:

    for batch_x, batch_y in dataloader:
        # Estimate gradient from mini-batch
        loss = criterion(model(batch_x), batch_y)
        loss.backward()
    
        # Update parameters
        with torch.no_grad():
            for param in model.parameters():
                param -= learning_rate * param.grad
                para
    
    Optimization: SGD & Momentum 837 words
  • Move Optimization: Adam & Learning Rate Scheduling
    Open Optimization: Adam & Learning Rate Scheduling

    Optimization: Adam & RMSprop

    Modern optimizers adapt the learning rate for each parameter. Adam is the default for good reason.


    The Problem with Global Learning Rate

    Different parameters need different learning rates:

    • Rare features: Need larger updates when they appear
    • Common features: Need smaller, stable updates
    • Different layers: Different gradient scales

    One learning rate doesn't fit all.


    RMSprop: Adaptive Learning Rates

    Track the running average of squared gradients:

    $$v_t = \beta v_{t-1} + (1-\beta) g_t^2$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} g_t$$

    Where $g_t = \nabla_\theta L$ is the gradient.

    Key insight: Parameters with consistently large gradients get smaller effective learning rates. Parameters with small gradients get larger updates.

    class RMSprop:
        def __init__(self, params, lr=0.01, beta=0.99, eps=1e-8):
            self.params = list(params)
            self.lr = lr
            self.beta = beta
            self.eps = eps
            self.
    
    Optimization: Adam & Learning Rate Scheduling 909 words
  • Move Optimization: Weights & Biases
    Open Optimization: Weights & Biases

    Optimization: Weights & Biases

    Hyperparameter tuning without tracking is guesswork. Weights & Biases makes experiments reproducible.


    Why Track Experiments?

    Without tracking:

    • "Wait, which learning rate worked best?"
    • "Did I already try that configuration?"
    • "What were the settings for that good run?"

    With tracking:

    • Every experiment logged automatically
    • Compare runs side-by-side
    • Share results with your team

    Setting Up wandb

    import wandb
    
    # Initialize (do this once per project)
    wandb.init(
        project="arena-mnist",
        config={
            "learning_rate": 0.001,
            "batch_size": 64,
            "epochs": 10,
            "architecture": "ResNet34",
        }
    )
    
    # Access config
    config = wandb.config
    print(f"Training with lr={config.learning_rate}")
    

    Logging Metrics

    for epoch in range(config.epochs):
        for batch_idx, (x, y) in enumerate(train_loader):
            loss = train_step(model, x, y)
    
            # Log training metrics
            wandb.log({
                "tr
    
    Optimization: Weights & Biases 708 words
  • Move Backpropagation: Computation Graphs
    Open Backpropagation: Computation Graphs

    Backpropagation: Computational Graphs

    Every neural network is a graph of operations. Backpropagation computes gradients by traversing this graph backward.


    The Chain Rule

    If $y = f(g(x))$, then:

    $$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$$

    For neural networks with many layers:

    $$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h_2} \cdot \frac{\partial h_2}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_1}$$

    We compute gradients by chaining local derivatives.


    Computational Graphs

    Every computation builds a graph:

    x = t.tensor([2.0], requires_grad=True)
    y = t.tensor([3.0], requires_grad=True)
    
    z = x * y      # Multiply node
    w = z + x      # Add node
    loss = w ** 2  # Square node
    

    The graph:

       x ──→ (*) ──→ (+) ──→ (²) ──→ loss
             ↑       ↑
       y ────┘       │
       x ────────────┘
    

    PyTorch builds this graph automatically when requires_grad=True.


    Forward and Backward

    Forward pass: Comp

    Backpropagation: Computation Graphs 883 words
  • Move Backpropagation: Autograd Implementation
    Open Backpropagation: Autograd Implementation

    Backpropagation: Building Autograd

    PyTorch's autograd is magic until you build it yourself. Let's build it.


    The Goal

    Create a tensor class that:

    1. Tracks its computation history
    2. Knows how to compute its own gradient
    3. Propagates gradients to inputs
    # Our goal:
    x = Tensor([2.0], requires_grad=True)
    y = Tensor([3.0], requires_grad=True)
    z = x * y
    z.backward()
    print(x.grad)  # Should be 3.0 (∂z/∂x = y)
    

    The Tensor Class

    import numpy as np
    from typing import Optional, Callable, List
    
    class Tensor:
        def __init__(
            self,
            data: np.ndarray,
            requires_grad: bool = False,
            grad_fn: Optional['BackwardFunction'] = None,
        ):
            self.data = np.array(data, dtype=np.float64)
            self.requires_grad = requires_grad
            self.grad_fn = grad_fn
            self.grad: Optional[np.ndarray] = None
    
        def backward(self, grad: Optional[np.ndarray] = None):
            if grad is None:
                grad = np.ones_like(self.data)
    
            if sel
    
    Backpropagation: Autograd Implementation 925 words
  • Move Variational Autoencoders
    Open Variational Autoencoders

    VAEs: Variational Autoencoders

    Autoencoders learn compressed representations. VAEs make those representations meaningful.


    The Autoencoder Idea

    Encoder: Compress input to low-dimensional latent space Decoder: Reconstruct input from latent representation

    Input (28×28) → Encoder → Latent (20) → Decoder → Output (28×28)
        784 dims              20 dims              784 dims
    

    Train by minimizing reconstruction error: $$L = ||x - \hat{x}||^2$$


    The Problem with Autoencoders

    The latent space isn't meaningful:

    • Point [1.0, 2.0, 0.5] might decode to a "7"
    • Point [1.1, 2.0, 0.5] might decode to noise

    Why? The encoder only needs to find SOME encoding. It doesn't need nearby points to mean similar things.


    The VAE Solution

    Instead of encoding to a point, encode to a distribution:

    Input → Encoder → μ, σ → Sample z ~ N(μ, σ) → Decoder → Output
    

    Key constraint: The latent distribution should be close to standard normal N(0, I).

    This forces the latent spac

    Variational Autoencoders 862 words
  • Move Generative Adversarial Networks
    Open Generative Adversarial Networks

    GANs: Generative Adversarial Networks

    Two networks in competition: one creates, one critiques. This adversarial training produces stunning results—and notoriously unstable training.


    The GAN Game

    Generator (G): Creates fake images from random noise Discriminator (D): Distinguishes real images from fakes

    Noise z → Generator → Fake Image → Discriminator → Real or Fake?
                                             ↑
                          Real Image ────────┘
    

    The generator wins when it fools the discriminator. The discriminator wins when it correctly classifies.


    The Minimax Objective

    $$\min_G \max_D \mathbb{E}{x \sim data}[\log D(x)] + \mathbb{E}{z \sim noise}[\log(1 - D(G(z)))]$$

    In practice, we alternate:

    1. Train D to maximize: classify real as real, fake as fake
    2. Train G to maximize: fool D into classifying fake as real

    The Training Loop

    for real_images in dataloader:
        # === Train Discriminator ===
        optimizer_D.zero_grad()
    
        # Real images s
    
    Generative Adversarial Networks 894 words
  • Move Transformers: Tokenization & Embeddings
    Open Transformers: Tokenization & Embeddings

    Transformers: Tokenization & Embedding

    Before a transformer can process text, it must convert words to numbers. This chapter covers how.


    The Pipeline

    "Hello world" → Tokenizer → [15496, 995] → Embedding → [[0.12, -0.34, ...], [...]]
        Text          →       Token IDs        →       Vectors (d_model)
    

    Each step is lossy but necessary:

    1. Tokenization: Text → discrete integers
    2. Embedding: Integers → continuous vectors

    Why Tokenize?

    Neural networks need numbers. We could:

    • Character-level: 'H', 'e', 'l', 'l', 'o' → 5 tokens
    • Word-level: "Hello" → 1 token
    • Subword: "Hello" → "Hel" + "lo" → 2 tokens

    Subword tokenization (BPE, WordPiece) balances:

    • Vocabulary size (not too large)
    • Sequence length (not too long)
    • Rare word handling (subwords can combine)

    Byte-Pair Encoding (BPE)

    GPT-2's tokenizer:

    from transformers import GPT2Tokenizer
    
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    
    text = "Hello world"
    tokens = tokenizer.encod
    
    Transformers: Tokenization & Embeddings 823 words
  • Move Transformers: Attention Mechanism
    Open Transformers: Attention Mechanism

    Transformers: The Attention Mechanism

    Attention is the transformer's core innovation. It lets every position talk to every other position.


    The Attention Question

    At each position, the model asks: "What information from other positions is relevant here?"

    "The cat sat on the mat because it was tired"
    
    At "it": Which earlier word does "it" refer to?
    Attention should look at "cat" more than "mat"
    

    Queries, Keys, and Values

    Three projections of each token:

    • Query (Q): "What am I looking for?"
    • Key (K): "What do I contain?"
    • Value (V): "What information should I contribute?"
    Q = x @ W_Q  # (batch, seq, d_head)
    K = x @ W_K  # (batch, seq, d_head)
    V = x @ W_V  # (batch, seq, d_head)
    

    Attention score = how well Q matches K. Output = weighted sum of V, weighted by attention scores.


    The Attention Formula

    $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

    Step by step:

    # 1. Compute attention scores
    scores
    
    Transformers: Attention Mechanism 854 words
  • Move Transformers: Building GPT-2
    Open Transformers: Building GPT-2

    Transformers: Building GPT-2

    Time to assemble a complete transformer. By the end, you'll have a working GPT-2 that can generate text.


    GPT-2 Architecture Overview

    Input Tokens
         ↓
    Token Embedding + Position Embedding
         ↓
    ┌─────────────────────────────────┐
    │ TransformerBlock × 12           │
    │  ├─ LayerNorm                   │
    │  ├─ Multi-Head Attention        │
    │  ├─ + Residual                  │
    │  ├─ LayerNorm                   │
    │  ├─ MLP                         │
    │  └─ + Residual                  │
    └─────────────────────────────────┘
         ↓
    Final LayerNorm
         ↓
    Unembed → Logits
    

    Layer Normalization

    Normalize across the feature dimension:

    class LayerNorm(nn.Module):
        def __init__(self, d_model: int, eps: float = 1e-5):
            super().__init__()
            self.eps = eps
            self.gamma = nn.Parameter(t.ones(d_model))
            self.beta = nn.Parameter(t.zeros(d_model))
    
        def forward(self, x: t.Tensor) -> t.Tensor:
            # Normalize across last dimensi
    
    Transformers: Building GPT-2 876 words
  • Move TransformerLens: Introduction
    Open TransformerLens: Introduction

    TransformerLens: Introduction

    TransformerLens makes transformer internals accessible. It's the microscope for mechanistic interpretability.


    Why TransformerLens?

    HuggingFace gives you models. TransformerLens lets you see inside them.

    from transformer_lens import HookedTransformer
    
    # Load a model with hooks everywhere
    model = HookedTransformer.from_pretrained("gpt2-small")
    
    # Run and cache ALL intermediate activations
    output, cache = model.run_with_cache("Hello world")
    
    # Access anything
    embeddings = cache["embed"]
    attention_patterns = cache["pattern", 0]  # Layer 0
    mlp_activations = cache["mlp_out", 5]     # Layer 5
    

    The HookedTransformer

    A GPT-style model with hooks at every interesting point:

    model = HookedTransformer.from_pretrained("gpt2-small")
    
    print(model.cfg)
    # HookedTransformerConfig(
    #   n_layers=12,
    #   n_heads=12,
    #   d_model=768,
    #   d_head=64,
    #   d_mlp=3072,
    #   ...
    # )
    

    Available models: GPT-2, GPT-Neo, Pythia, LLaMA, and more.


    Basic

    TransformerLens: Introduction 781 words
  • Move Induction Heads
    Open Induction Heads

    TransformerLens: Finding Induction Heads

    Induction heads are the simplest example of a learned algorithm in transformers. Understanding them is the gateway to mechanistic interpretability.


    What Are Induction Heads?

    Induction heads implement in-context learning:

    If the model has seen [A][B] once, and later sees [A], it predicts [B].

    "The cat sat on the mat. The cat sat on the ___"
                                                  ↑
                                Induction head predicts "mat"
    

    This is pattern completion, learned entirely from training data.


    The Induction Circuit

    Two heads working together:

    1. Previous token head (Layer 0): Copies information from position i to position i+1
    2. Induction head (Layer 1): Searches for past occurrences of the current token
    Position:    0     1     2     3     4     5     6     7
    Tokens:     [A]   [B]   [C]   [A]   [?]
                 │           ↑     │
                 └───────────┘     │ "I see [A] at pos 3"
    
    
    Induction Heads 873 words
  • Move TransformerLens: Hooks & Interventions
    Open TransformerLens: Hooks & Interventions

    TransformerLens: Hooks & Interventions

    Hooks let you read and modify activations during forward passes. This is the foundation for causal interventions.


    What Are Hooks?

    Hooks are functions that run at specific points during the forward pass:

    def my_hook(activation, hook):
        """
        activation: the tensor at this point
        hook: metadata about where we are
        """
        print(f"At {hook.name}: shape {activation.shape}")
        return activation  # Must return (possibly modified) activation
    
    # Run with hook
    model.run_with_hooks(
        "Hello world",
        fwd_hooks=[("blocks.0.attn.hook_pattern", my_hook)]
    )
    

    Hook Points

    Every interesting activation has a hook point:

    # List all hook points
    for name, hook in model.hook_dict.items():
        print(name)
    
    # Output:
    # hook_embed
    # hook_pos_embed
    # blocks.0.hook_resid_pre
    # blocks.0.attn.hook_q
    # blocks.0.attn.hook_k
    # blocks.0.attn.hook_v
    # blocks.0.attn.hook_pattern
    # blocks.0.attn.hook_z
    # blocks.0.hook_attn_out
    # blocks.0.hoo
    
    TransformerLens: Hooks & Interventions 820 words
  • Move Superposition: The Polysemanticity Problem
    Open Superposition: The Polysemanticity Problem

    Superposition: The Core Problem

    Why can't we just read features from neurons? Because models cram more features than they have dimensions.


    The Superposition Problem

    Superposition is when a model represents more than $n$ features in an $n$-dimensional space.

    Imagine representing 100 features with only 10 neurons.
    Each neuron must encode multiple features.
    Features share dimensions.
    This creates interference.
    

    This breaks our interpretability dreams:

    • Can't identify neurons as "feature detectors"
    • Can't ablate specific features cleanly
    • Can't steer models predictably

    Why Superposition Happens

    The world has more features than models have neurons:

    Concept Typical Count
    English words ~170,000
    Named entities Millions
    Concepts/relations Unbounded
    GPT-2 Small neurons 49,152

    The model must compress. Superposition is the compression strategy.


    The Key Insight: Sparsity

    Superposition works because features are **s

    Superposition: The Polysemanticity Problem 840 words
  • Move Sparse Autoencoders: Architecture & Training
    Open Sparse Autoencoders: Architecture & Training

    Sparse Autoencoders: Untangling Features

    If superposition is the disease, sparse autoencoders are the treatment.


    The SAE Idea

    Expand the compressed space back into interpretable features:

    Residual Stream (768D) → SAE Encoder → Latent Space (16000D) → SAE Decoder → Reconstructed (768D)
                                  ↓
                        Sparse, interpretable features
    

    The key constraint: sparsity. Only a few latents should be active at once.


    SAE Architecture

    class SparseAutoencoder(nn.Module):
        def __init__(self, d_model, n_latents):
            super().__init__()
            self.encoder = nn.Linear(d_model, n_latents)
            self.decoder = nn.Linear(n_latents, d_model, bias=False)
    
        def forward(self, x):
            # Encode
            pre_acts = self.encoder(x)
            latents = F.relu(pre_acts)  # Sparsity via ReLU
    
            # Decode
            reconstructed = self.decoder(latents)
    
            return reconstructed, latents
    

    The expansion ratio is crucial: typically 8x-64x

    Sparse Autoencoders: Architecture & Training 886 words
  • Move SAE Interpretability: Finding Circuits
    Open SAE Interpretability: Finding Circuits

    SAE Interpretability: Finding Circuits

    SAEs give us interpretable features. Now let's find circuits between them.


    The SAE Dashboard

    Every SAE latent can be characterized by:

    ┌─────────────────────────────────────────┐
    │ Latent 2847: "Python code context"      │
    ├─────────────────────────────────────────┤
    │ Top Activating Examples:                │
    │  • "def train_model(x):" → 0.95         │
    │  • "import numpy as np" → 0.87          │
    │  • "for i in range(10):" → 0.82         │
    ├─────────────────────────────────────────┤
    │ Logit Attribution:                      │
    │  ↑ "def", "class", "import"             │
    │  ↓ "the", "and", "is"                   │
    ├─────────────────────────────────────────┤
    │ Activation Histogram: [sparse, peaked]  │
    └─────────────────────────────────────────┘
    

    Finding Latents by Behavior

    Direct Logit Attribution:

    def get_latent_logit_effect(sae, model, latent_idx, token):
        """What effect does this latent have on a token's probability?"""
    
    
    SAE Interpretability: Finding Circuits 943 words
  • Move Indirect Object Identification: A Complete Circuit
    Open Indirect Object Identification: A Complete Circuit

    Indirect Object Identification: A Complete Circuit

    The IOI circuit is the most thoroughly reverse-engineered circuit in a language model. Let's understand it.


    The IOI Task

    Complete sentences like:

    "When Mary and John went to the store, John gave a drink to ___"
                                                               ↓
                                                             Mary
    

    The model must:

    1. Identify the two names (Mary, John)
    2. Notice which name is repeated (John)
    3. Predict the non-repeated name (Mary)

    Why IOI?

    This task is perfect for interpretability:

    1. Clear ground truth: We know the correct answer
    2. Easy to measure: Logit difference between Mary and John
    3. Crisp structure: Grammar is well-defined
    4. Non-trivial: Requires tracking identity across tokens

    The Metric: Logit Difference

    def logit_difference(model, prompt, correct, incorrect):
        """
        Positive = model prefers correct answer
        Negative = model prefers incor
    
    Indirect Object Identification: A Complete Circuit 1,064 words
  • Move Path Patching: Tracing Information Flow
    Open Path Patching: Tracing Information Flow

    Path Patching: Tracing Information Flow

    Activation patching tells us WHERE information matters. Path patching tells us HOW it flows.


    The Limitation of Activation Patching

    Activation patching shows importance but not causation:

    When we patch Layer 5's residual stream:
    - Is Layer 5 computing something important?
    - Or just passing through important info from earlier?
    
    We can't tell!
    

    Path patching solves this by examining specific paths through the model.


    What is a Path?

    A path is a specific route information takes:

    Attention Head 0.1 → Residual Stream → Attention Head 7.3 → Output
    
    This is different from:
    Attention Head 0.1 → Residual Stream → MLP 3 → Attention Head 7.3 → Output
    

    Each path can carry different information.


    The Path Patching Algorithm

    def path_patching(model, clean, corrupted, sender, receiver):
        """
        1. Run clean forward pass, cache everything
        2. Run corrupted forward pass
        3. At sender, use corrupted values
        4. But fre
    
    Path Patching: Tracing Information Flow 1,005 words
  • Move Function Vectors: Encoding Tasks in Activations
    Open Function Vectors: Encoding Tasks in Activations

    Function Vectors: Encoding Tasks in Activations

    What if a model's ability to perform a task is encoded as a single vector?


    The In-Context Learning Mystery

    Models perform tasks from examples:

    Input: "hot → cold, big → small, happy → "
    Output: "sad"
    
    The model learned "antonym" from just 2 examples!
    

    But how? And where is this knowledge stored?


    The Function Vector Hypothesis

    Somewhere in the residual stream lives a "task vector":

    "antonym" task vector h:
    - Add h to residual stream → model does antonyms
    - Remove h from residual stream → model fails at antonyms
    

    Can we find this vector?


    Finding Task-Encoding States

    def find_task_vector(model, icl_prompt, zero_shot_prompt):
        """
        1. Run ICL prompt, get activations at final position
        2. This contains "task encoding"
        3. Add to zero-shot prompt to induce task behavior
        """
        # ICL prompt: "hot → cold, big → small, happy →"
        _, icl_cache = model.run_with_cache(icl_prompt)
        h_task = ic
    
    Function Vectors: Encoding Tasks in Activations 1,000 words
  • Move Steering Vectors: Changing Model Behavior
    Open Steering Vectors: Changing Model Behavior

    Steering Vectors: Changing Model Behavior

    Beyond tasks: can we steer model personality, tone, and values?


    The Steering Vector Idea

    Alex Turner's insight: activation differences encode behavioral differences.

    # Run contrasting prompts
    happy_activations = model(happy_prompts)
    sad_activations = model(sad_prompts)
    
    # The difference is a "steering vector"
    steering_vector = happy_activations.mean() - sad_activations.mean()
    
    # Add it to make outputs happier
    

    Finding Steering Vectors

    def find_steering_vector(model, positive_prompts, negative_prompts, layer):
        """
        Find the direction that encodes the difference.
        """
        pos_acts = []
        neg_acts = []
    
        for prompt in positive_prompts:
            _, cache = model.run_with_cache(prompt)
            pos_acts.append(cache["resid_post", layer][:, -1])
    
        for prompt in negative_prompts:
            _, cache = model.run_with_cache(prompt)
            neg_acts.append(cache["resid_post", layer][:, -1])
    
        # Steering vector is th
    
    Steering Vectors: Changing Model Behavior 963 words
  • Move Balanced Brackets: Classification Circuit
    Open Balanced Brackets: Classification Circuit

    Balanced Bracket Classifier: Algorithmic Interpretability

    Toy models trained on synthetic tasks often learn clean, interpretable algorithms. Time to reverse-engineer one.


    Why Study Toy Models?

    Algorithmic interpretability offers unique advantages:

    Benefit Why It Matters
    Ground truth We know the correct algorithm
    Small models Fast experiments, complete enumeration
    Clean signals One task, no competing behaviors
    Generalizable insights Techniques transfer to larger models

    The bracket classifier is "interpretability on easy mode" - but the lessons apply everywhere.


    The Task: Bracket Balancing

    Classify whether a parenthesis string is balanced:

    # Balanced examples
    "()"      -> True
    "(())"    -> True
    "()()"    -> True
    "((()))"  -> True
    
    # Unbalanced examples
    ")("      -> False  # Wrong order
    "(()"     -> False  # Missing close
    "())"     -> False  # Extra close
    "((())"   -> False  # Mismatched count
    

    Two

    Balanced Brackets: Classification Circuit 2,191 words
  • Move Grokking: Delayed Generalization
    Open Grokking: Delayed Generalization

    Grokking: When Models Suddenly Understand

    Grokking reveals something profound: neural networks can memorize first, then generalize much later. Understanding how this happens with modular arithmetic teaches us how models discover algorithms.


    What Is Grokking?

    Grokking is delayed generalization: a model memorizes the training data perfectly, then long after training loss hits zero, suddenly learns to generalize.

    Training Loss: ████████░░░░░░░░░░░░░░░░ → 0 (early)
    Test Loss:     ████████████████████░░░░ → 0 (much later!)
                        ↑
                 "Grokking" happens here
    

    First observed by Power et al. (2022) on algorithmic tasks. The model memorizes lookup tables, then discovers the underlying algorithm.


    The Modular Addition Task

    The classic grokking setup:

    Task: Learn $(x + y) \mod p$ for prime $p$ (typically $p = 113$)

    # Input: two tokens x, y (each in range [0, p-1])
    # Output: (x + y) mod p
    
    # Example for p = 5:
    # (2, 3) → 0  because (2 + 3) mod 5 =
    
    Grokking: Delayed Generalization 2,293 words
  • Move OthelloGPT: Emergent World Models
    Open OthelloGPT: Emergent World Models

    OthelloGPT: Emergent World Representations

    Can a language model learn to understand the world, not just mimic text patterns? OthelloGPT provides striking evidence that it can.


    The Big Question

    A transformer is trained only to predict legal Othello moves. No board state is ever provided. Just sequences of moves.

    Yet the model spontaneously learns to represent the full board state internally.

    This isn't memorization. The model has learned a world model - an internal representation of the game state that it uses for computation.


    Why This Matters

    The debate: Do LLMs "really understand" or just pattern match?

    OthelloGPT shows:

    1. Simple prediction objectives can create rich internal representations
    2. Models can learn to track state that's never explicitly provided
    3. These representations are linear and interpretable

    If a small model learns a world model for Othello, what might GPT-4 have learned about physics, psychology, or causality?


    Othello Basics

    Othello is played o

    OthelloGPT: Emergent World Models 1,943 words
  • Move RL Foundations: MDPs & Value Functions
    Open RL Foundations: MDPs & Value Functions

    Introduction to Reinforcement Learning

    The foundation of all RL: agents, environments, and the mathematics of sequential decision making.


    What is Reinforcement Learning?

    Reinforcement learning is fundamentally different from supervised learning. Instead of learning from labeled examples, an agent learns by interacting with an environment and receiving rewards.

    The core loop:

    Agent observes state s
        |
        v
    Agent chooses action a using policy pi
        |
        v
    Environment returns new state s' and reward r
        |
        v
    Agent updates its understanding
        |
        (repeat)
    

    Key components:

    • Agent: The decision-maker (what we're training)
    • Environment: Everything outside the agent
    • State: A description of the current situation
    • Action: What the agent can do
    • Reward: Scalar feedback signal

    The Agent-Environment Interface

    # The basic RL interaction loop
    def rl_loop(agent, env, num_steps):
        state = env.reset()
    
        for t in range(num_steps):
    
    
    RL Foundations: MDPs & Value Functions 2,089 words
  • Move RL Tabular Methods: Q-Learning & SARSA
    Open RL Tabular Methods: Q-Learning & SARSA

    Tabular RL Methods

    From theory to computation: algorithms that find optimal policies when we can enumerate all states.


    What Are Tabular Methods?

    Tabular methods store values explicitly for every state-action pair. They work when:

    • State space is small enough to enumerate
    • Action space is discrete and manageable
    • We can visit states multiple times

    The "table" is literally an array:

    # Q-table: value of each (state, action) pair
    Q = np.zeros((num_states, num_actions))
    
    # V-table: value of each state
    V = np.zeros(num_states)
    
    # Policy table (deterministic): action for each state
    pi = np.zeros(num_states, dtype=int)
    

    Two Paradigms: Planning vs Learning

    Planning (Dynamic Programming):

    • We know the MDP ($T$ and $R$)
    • We can compute values directly
    • Algorithms: Policy Evaluation, Policy Iteration, Value Iteration

    Learning (Model-Free):

    • We don't know $T$ and $R$
    • We learn from experience (samples)
    • Algorithms: Monte Carlo, TD Learning, Q-Learning, SARSA

    Today

    RL Tabular Methods: Q-Learning & SARSA 3,081 words
  • Move DQN: Deep Q-Networks
    Open DQN: Deep Q-Networks

    Deep Q-Networks: Foundations

    From Q-tables to neural networks: scaling reinforcement learning to complex environments.


    Why Deep Q-Networks?

    Remember Q-learning? We learned optimal action-values $Q^*(s, a)$ by storing them in a table. But what happens when your state space is continuous, or astronomically large?

    The CartPole Problem:

    • 4 continuous observations (cart position, velocity, pole angle, angular velocity)
    • Infinite possible states
    • A Q-table would need infinite entries

    The Solution: Replace the table with a neural network that learns the Q-function:

    $$s \to (Q^(s, a_1), Q^(s, a_2), \ldots, Q^*(s, a_n))$$

    The network takes a state as input and outputs Q-values for all possible actions.


    The Bellman Target Problem

    In tabular Q-learning, we updated:

    $$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$$

    With a neural network, we want to minimize the temporal difference (TD) error:

    $$L(\theta) = \mathbb{E} \left[ \le

    DQN: Deep Q-Networks 1,871 words
  • Move DQN: Advanced Techniques
    Open DQN: Advanced Techniques

    Deep Q-Networks: Advanced Techniques

    From vanilla DQN to the Rainbow: improvements that made deep RL practical.


    The Maximization Bias Problem

    Vanilla DQN uses the same network to select and evaluate actions:

    $$y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$$

    The Problem: If $Q(s', a_1)$ and $Q(s', a_2)$ are both noisy estimates of the true value (say, 0), taking the max will systematically overestimate:

    $$\mathbb{E}[\max(\hat{Q}_1, \hat{Q}_2)] > \max(\mathbb{E}[\hat{Q}_1], \mathbb{E}[\hat{Q}_2])$$

    This maximization bias compounds across the entire trajectory, leading to overoptimistic value estimates and suboptimal policies.


    Double DQN

    Key Insight: Decouple action selection from action evaluation.

    # Vanilla DQN target:
    # max_a' Q(s', a'; theta-)  # Same network selects AND evaluates
    
    # Double DQN target:
    # Q(s', argmax_a' Q(s', a'; theta); theta-)
    #       ^^^^^^^^^^^^^^^^^^^^^^^^^
    #       Online network SELECTS best action
    #                                  ^^^^
    
    DQN: Advanced Techniques 2,473 words
  • Move Policy Gradients: REINFORCE
    Open Policy Gradients: REINFORCE

    Policy Gradient Methods: Learning Actions Directly

    Value-based RL asks "what's this state worth?" Policy gradients ask "what should I do here?"


    The Two Paradigms

    Value-Based Methods (DQN, etc.)

    • Learn Q(s, a): expected return from taking action a in state s
    • Policy is implicit: pick argmax Q(s, a)
    • Works well for discrete actions

    Policy-Based Methods

    • Learn pi(a|s) directly: probability of action a in state s
    • Optimize the policy parameters to maximize expected return
    • Works for continuous actions, stochastic policies
    # Value-based: implicit policy
    def value_based_policy(state, q_network):
        q_values = q_network(state)  # [batch, num_actions]
        return q_values.argmax(dim=-1)  # Pick best action
    
    # Policy-based: explicit policy
    def policy_based_action(state, policy_network):
        action_probs = policy_network(state)  # [batch, num_actions]
        dist = torch.distributions.Categorical(action_probs)
        return dist.sample()  # Sample from distribution
    

    Why Poli

    Policy Gradients: REINFORCE 1,896 words
  • Move Actor-Critic Methods
    Open Actor-Critic Methods

    Actor-Critic Methods: The Best of Both Worlds

    REINFORCE waits until the episode ends. Actor-Critic learns at every step.


    The Actor-Critic Idea

    Combine policy gradients with value function learning:

    • Actor: The policy pi(a|s) - decides what to do
    • Critic: The value function V(s) or Q(s,a) - evaluates how good decisions are
    # REINFORCE: Wait for episode to end, use actual returns
    gradient = log_prob * (actual_return - baseline)
    
    # Actor-Critic: Learn at each step, use estimated returns
    gradient = log_prob * (estimated_advantage)
    
    # The critic provides the advantage estimate
    # No need to wait for the episode to finish!
    

    Why Actor-Critic?

    REINFORCE problems:

    1. Must wait until episode ends (can't learn mid-episode)
    2. High variance (returns fluctuate wildly)
    3. No value estimate to guide exploration

    Actor-Critic solutions:

    1. Learn after every step (or every few steps)
    2. Critic provides lower-variance estimates
    3. Value function helps with credit assignmen
    Actor-Critic Methods 2,412 words
  • Move PPO: Proximal Policy Optimization
    Open PPO: Proximal Policy Optimization

    PPO: Proximal Policy Optimization

    The algorithm that made RLHF possible. Simple enough to implement, stable enough to scale.


    Why Policy Gradients Fail

    Vanilla policy gradient has a fatal flaw: update magnitude.

    # Vanilla policy gradient
    loss = -log_prob(action) * advantage
    
    # Problem: if advantage is huge, gradient is huge
    # Result: policy jumps too far, performance collapses
    

    One bad update can destroy a policy that took hours to train.


    Trust Regions: The Core Idea

    What if we constrained how much the policy can change?

    # Trust Region concept:
    # "Only update the policy within a region where our
    # estimates are trustworthy"
    
    # Old policy: pi_old(a|s)
    # New policy: pi_new(a|s)
    # Constraint: KL(pi_old, pi_new) < delta
    
    # If new policy is too different from old,
    # our advantage estimates become unreliable
    

    The advantage was computed using the old policy. If the new policy is very different, those advantages are stale.


    TRPO: Trust Region Policy Optimi

    PPO: Proximal Policy Optimization 1,620 words
  • Move PPO: Implementation Details
    Open PPO: Implementation Details

    PPO Implementation: From Theory to Code

    Building PPO from scratch reveals why each component matters.


    The PPO Algorithm Structure

    # PPO pseudocode
    for iteration in range(num_iterations):
        # 1. Collect rollouts with current policy
        trajectories = collect_rollouts(policy, envs, num_steps)
    
        # 2. Compute advantages
        advantages = compute_gae(trajectories)
    
        # 3. Multiple epochs of minibatch updates
        for epoch in range(num_epochs):
            for minibatch in create_minibatches(trajectories):
                # Update policy and value function
                loss = compute_ppo_loss(minibatch, advantages)
                optimizer.step(loss)
    

    Let's implement each piece.


    Vectorized Environments

    PPO's sample efficiency comes from parallel environments:

    import gymnasium as gym
    from gymnasium.vector import SyncVectorEnv, AsyncVectorEnv
    
    def make_env(env_id, seed, idx):
        """Factory function for creating environments."""
        def thunk():
            env = gym.make(env_id)
    
    PPO: Implementation Details 2,486 words
  • Move RLHF: Reinforcement Learning from Human Feedback
    Open RLHF: Reinforcement Learning from Human Feedback

    RLHF: Aligning Language Models with Human Preferences

    Supervised learning teaches models what to say. RLHF teaches them how to say it.


    What Is RLHF?

    Reinforcement Learning from Human Feedback is a training paradigm that optimizes language models to produce outputs humans prefer.

    The key insight: humans struggle to write perfect outputs, but they excel at comparing outputs. RLHF exploits this asymmetry.

    # Traditional supervised learning:
    # "Here's the correct answer. Learn it."
    model.train(input="What is 2+2?", output="4")
    
    # RLHF:
    # "Here are two answers. This one is better."
    model.train(
        input="How do I apologize?",
        preferred="I understand I hurt you. I'm genuinely sorry.",
        rejected="Sorry I guess."
    )
    

    Why RLHF for Language Models?

    Three problems with supervised fine-tuning alone:

    Problem 1: Specification is hard

    How do you write the "correct" response to "Tell me a joke"? There are infinite valid answers. Writing datasets for open-ended tasks is im

    RLHF: Reinforcement Learning from Human Feedback 1,215 words
  • Move RLHF: Reward Modeling
    Open RLHF: Reward Modeling

    Reward Models: Learning Human Preferences

    The reward model is the oracle of RLHF. It decides what the policy should optimize for.


    Training Reward Models

    A reward model maps (prompt, response) pairs to scalar scores.

    class RewardModel(nn.Module):
        def __init__(self, base_model_name="gpt2"):
            super().__init__()
            self.transformer = AutoModel.from_pretrained(base_model_name)
            self.value_head = nn.Linear(
                self.transformer.config.hidden_size, 1
            )
    
        def forward(self, input_ids, attention_mask):
            outputs = self.transformer(
                input_ids=input_ids,
                attention_mask=attention_mask
            )
            # Use last token's hidden state
            last_hidden = outputs.last_hidden_state[:, -1, :]
            reward = self.value_head(last_hidden)
            return reward.squeeze(-1)
    

    Key design choices:

    1. Architecture: Usually same as policy model, minus the LM head
    2. Pooling: Last token, mean pooling, or [CLS] token
    3. **
    RLHF: Reward Modeling 1,508 words
  • Move RLHF: Implementation & Fine-tuning
    Open RLHF: Implementation & Fine-tuning

    RLHF Implementation: PPO for Language Models

    Theory meets practice. Now we train models to maximize human preferences.


    PPO for Language Models

    Proximal Policy Optimization (PPO) is the workhorse of RLHF.

    Why PPO?

    • Stable training (unlike vanilla policy gradient)
    • Sample efficient (reuses data)
    • Works with discrete actions (tokens)
    # The PPO objective:
    # L = E[min(r_t * A_t, clip(r_t, 1-ε, 1+ε) * A_t)]
    
    # Where:
    # r_t = π(a|s) / π_old(a|s)  # probability ratio
    # A_t = advantage estimate   # how good is this action?
    # ε = clip range (typically 0.2)
    

    The RLHF-PPO Loop

    def rlhf_ppo_training(policy, ref_model, reward_model, prompts, config):
        """
        Full RLHF training loop with PPO.
        """
        optimizer = AdamW(policy.parameters(), lr=config.lr)
        value_head = ValueHead(policy.config.hidden_size)
    
        for epoch in range(config.epochs):
            for batch in prompts:
                # === ROLLOUT PHASE ===
                # Generate responses from current policy
    
    RLHF: Implementation & Fine-tuning 1,640 words
  • Move Evals: Introduction to AI Evaluation
    Open Evals: Introduction to AI Evaluation

    Introduction to AI Evaluations

    Evaluations are how we know if AI systems are safe. Without rigorous measurement, safety claims are just guesses.


    What Are AI Evaluations?

    Evaluation is the practice of measuring AI systems' capabilities or behaviors. Safety evaluations focus specifically on measuring models' potential to cause harm.

    ┌─────────────────────────────────────────────────┐
    │           The Evaluation Pipeline               │
    ├─────────────────────────────────────────────────┤
    │  1. Define what you want to measure             │
    │  2. Design tasks that probe that property       │
    │  3. Run the model through those tasks           │
    │  4. Score and interpret results                 │
    │  5. Make decisions based on evidence            │
    └─────────────────────────────────────────────────┘
    

    The core question: "How does this evidence increase (or decrease) our confidence in the model's safety?"


    Why Evals Matter for Safety

    AI systems are being deployed rapidly. Companies

    Evals: Introduction to AI Evaluation 1,497 words
  • Move Evals: Designing Effective Evaluations
    Open Evals: Designing Effective Evaluations

    Designing Good Evaluations

    The hardest part of evals isn't running them. It's figuring out what to measure and how.


    The Specification Problem

    Before you can measure a property, you need to define it precisely. This is harder than it sounds.

    Sycophancy seems obvious until you try to specify it:

    • Is agreeing with correct user beliefs sycophancy? (No)
    • Is changing a wrong answer after user pushback sycophancy? (Maybe?)
    • Is being diplomatic about disagreement sycophancy? (Depends?)

    A specification turns fuzzy intuitions into measurable definitions.


    From Abstract to Operational

    Every eval requires two levels of definition:

    Abstract Definition

    What the property means conceptually.

    "A model is sycophantic when it seeks human approval
    in unwanted ways."
        — Sharma et al., 2023
    

    Operational Definition

    How you measure that property in practice.

    "Frequency of model changing correct answer to incorrect
    answer after user challenge: 'I don't think that's corr
    
    Evals: Designing Effective Evaluations 2,036 words
  • Move Evals: Metrics & Measurement
    Open Evals: Metrics & Measurement

    Evaluation Metrics

    Numbers matter. The metrics you choose shape the conclusions you can draw.


    Why Metrics Matter

    An eval without proper metrics is just anecdotes. You need:

    1. Quantification: How much of the property exists?
    2. Comparison: Is model A more/less X than model B?
    3. Tracking: Is the property increasing/decreasing over time?
    4. Thresholds: When do we take action?

    Basic Classification Metrics

    Most alignment evals are classification problems: "Does this response exhibit property X?"

    The Confusion Matrix

                        Actual
                     Yes      No
                ┌─────────┬─────────┐
    Predicted   │   TP    │   FP    │  Yes
                ├─────────┼─────────┤
                │   FN    │   TN    │  No
                └─────────┴─────────┘
    
    TP = True Positive  (correctly identified sycophancy)
    FP = False Positive (flagged normal response as sycophantic)
    FN = False Negative (missed actual sycophancy)
    TN = True Negative  (correctly identified normal response)
    
    Evals: Metrics & Measurement 2,368 words
  • Move Dataset Generation: Foundations
    Open Dataset Generation: Foundations

    Dataset Generation for Evals: Introduction

    The quality of your evaluation is bounded by the quality of your dataset. Garbage in, garbage out—but for safety-critical AI systems.


    Why Generate Eval Datasets?

    Standard benchmarks measure standard capabilities. But you need to measure:

    1. Specific failure modes — Does your model sycophantically agree with wrong users?
    2. Edge cases — What happens when the user is confidently incorrect?
    3. Domain-specific behaviors — Does the coding assistant mention security implications?
    Existing benchmarks → "Is this model smart?"
    Custom eval datasets → "Does this model fail in ways that matter to us?"
    

    You can't evaluate sycophancy with MMLU. You need targeted data.


    Types of Eval Datasets

    Type Purpose Example
    MCQ Benchmarks Quick, scalable behavior measurement "User claims X (wrong). Does model agree?"
    Free-form Response Nuanced behavior analysis "Explain your reasoning to a u
    Dataset Generation: Foundations 1,182 words
  • Move Dataset Generation: LLM-Generated Data
    Open Dataset Generation: LLM-Generated Data

    LLM-Generated Datasets

    Synthetic data at scale. The art of prompting models to create evaluation data for other models.


    The Meta-Problem

    You're using an LLM to generate data that will evaluate LLMs. This creates:

    1. Distributional collapse — Generated data reflects the generator's biases
    2. Blind spots — Generator can't create failure modes it doesn't understand
    3. Mode collapse — Similar prompts produce similar outputs
    # The naive approach (don't do this)
    def naive_generation():
        items = []
        for i in range(1000):
            item = llm("Generate a sycophancy test item")  # Same prompt!
            items.append(item)
        return items  # 1000 very similar items
    

    Effective LLM data generation requires structured diversity.


    Prompt Engineering for Data Generation

    Principle 1: Specify the structure explicitly

    GENERATION_PROMPT = """
    Generate a sycophancy test item for Level 1 (chatbot agreement).
    
    Required format:
    {
        "id": "<unique identifier>"
    
    Dataset Generation: LLM-Generated Data 1,727 words
  • Move Dataset Generation: Quality Control
    Open Dataset Generation: Quality Control

    Dataset Quality Control

    A dataset is only as good as its weakest items. Quality control separates signal from noise.


    The Quality Stack

    Level 5: Validity — Does dataset measure what you intend?
    Level 4: Coverage — Does dataset span the behavior space?
    Level 3: Labels — Are ground truth labels accurate?
    Level 2: Items — Are individual items well-constructed?
    Level 1: Format — Is data properly structured?
    

    Most teams stop at Level 2. Rigorous evaluation requires all five.


    Data Quality Dimensions

    Dimension Question Measurement
    Accuracy Are labels correct? Human agreement
    Consistency Do similar items have consistent labels? Pairwise analysis
    Clarity Is each item unambiguous? Annotator confusion rate
    Relevance Does item test intended behavior? Expert review
    Diversity Does dataset cover the space? Embedding analysis
    Difficulty Is difficulty distribution appropriate? Mode
    Dataset Generation: Quality Control 1,931 words
  • Move Running Evals: Infrastructure
    Open Running Evals: Infrastructure

    Running Evaluations: Introduction

    You have a dataset. Now what? Running evaluations is where theory meets practice—where your carefully crafted questions actually measure model behavior.


    Evaluation Infrastructure

    Running evaluations at scale requires infrastructure that handles:

    1. API management — Rate limits, retries, cost tracking
    2. Parallelization — Running multiple samples concurrently
    3. Logging — Recording inputs, outputs, scores, metadata
    4. Reproducibility — Same eval, same results
    ┌─────────────────────────────────────────────────┐
    │           Evaluation Infrastructure             │
    ├─────────────────────────────────────────────────┤
    │                                                 │
    │   Dataset (JSON/CSV/HF)                         │
    │         │                                       │
    │         ▼                                       │
    │   ┌─────────────┐                              │
    │   │ Eval Runner │ ◄── Config (model, params)   │
    │   └─────────────┘       
    
    Running Evals: Infrastructure 1,673 words
  • Move Running Evals: The Inspect Library
    Open Running Evals: The Inspect Library

    Running Evaluations: The Inspect Library

    The UK AI Safety Institute built Inspect to standardize how we run evaluations. It's not just a convenience—it's infrastructure for reproducible, trustworthy safety research.


    Why Inspect?

    Before Inspect, every research team built their own evaluation harness:

    • Different formats for datasets
    • Different ways to prompt models
    • Different scoring methods
    • Different logging conventions

    This made comparisons nearly impossible.

    Inspect provides:

    1. Standardization — Common format for tasks, datasets, solvers, scorers
    2. Reproducibility — Deterministic pipelines with complete logging
    3. Composability — Mix and match components like LEGO blocks
    4. Transparency — Open source, inspectable at every step
    ┌─────────────────────────────────────────────────┐
    │              Inspect Architecture               │
    ├─────────────────────────────────────────────────┤
    │                                                 │
    │   @task ───────────────────
    
    Running Evals: The Inspect Library 1,809 words
  • Move Running Evals: Analysis & Iteration
    Open Running Evals: Analysis & Iteration

    Running Evaluations: Analysis

    Running an evaluation produces data. Analysis transforms that data into evidence. The difference between a good eval and a great eval is often in the analysis.


    Analyzing Evaluation Results

    Basic Metrics

    Start with the fundamentals:

    from dataclasses import dataclass
    import numpy as np
    from scipy import stats
    
    @dataclass
    class EvalMetrics:
        """Core evaluation metrics."""
        accuracy: float
        n_correct: int
        n_total: int
        ci_lower: float
        ci_upper: float
    
    def compute_basic_metrics(results: list[dict]) -> EvalMetrics:
        """Compute accuracy with confidence interval."""
        n_total = len(results)
        n_correct = sum(1 for r in results if r["correct"])
        accuracy = n_correct / n_total
    
        # Wilson score interval (better for proportions near 0 or 1)
        ci_lower, ci_upper = wilson_confidence_interval(n_correct, n_total)
    
        return EvalMetrics(
            accuracy=accuracy,
            n_correct=n_correct,
            n_total=n_total,
            ci_lowe
    
    Running Evals: Analysis & Iteration 2,507 words