OthelloGPT: Emergent World Representations

Can a language model learn to understand the world, not just mimic text patterns? OthelloGPT provides striking evidence that it can.

The Big Question

A transformer is trained only to predict legal Othello moves. No board state is ever provided. Just sequences of moves.

Yet the model spontaneously learns to represent the full board state internally.

This isn't memorization. The model has learned a world model - an internal representation of the game state that it uses for computation.

Why This Matters

The debate: Do LLMs "really understand" or just pattern match?

OthelloGPT shows:

Simple prediction objectives can create rich internal representations
Models can learn to track state that's never explicitly provided
These representations are linear and interpretable

If a small model learns a world model for Othello, what might GPT-4 have learned about physics, psychology, or causality?

Othello Basics

Othello is played on an 8x8 board. Two players (black and white) take turns placing pieces.

Rules:

A move is legal only if it captures opponent pieces
Capturing happens along horizontal, vertical, or diagonal lines
All captured pieces flip to your color
Black plays first

The key insight: Predicting legal moves requires tracking the full board state. A move that was legal 5 turns ago might now be illegal.

The OthelloGPT Architecture

cfg = HookedTransformerConfig(
    n_layers=8,
    d_model=512,
    d_head=64,
    n_heads=8,
    d_mlp=2048,
    d_vocab=61,      # 60 squares + pass
    n_ctx=59,        # 60 moves - 1
    act_fn="gelu",
)

Small enough to study. Complex enough to be interesting.

Training Objective

The model sees random legal games and predicts the next move:

# Input: moves [0:59]
# Target: moves [1:60]

# Loss = cross entropy against actual next move

Since training data is uniformly random legal moves, the optimal model predicts uniform distribution over all legal next moves.

The model is never told which squares are occupied or by whom.

The Original Finding: Nonlinear Probes

The original paper (Li et al., 2022) trained nonlinear probes to extract board state:

# Linear probe: 20.4% error rate (doesn't work)
# Nonlinear probe (2-layer MLP): 1.7% error rate (works!)

This suggested the board representation might be fundamentally nonlinear.

But Neel Nanda discovered something different...

The Key Insight: Mine vs Theirs

Linear probes do work - if you use the right basis.

The model doesn't represent "this square has a black piece."

It represents "this square has MY piece."

# Original probe basis:
#   - Empty
#   - Black
#   - White

# Better probe basis:
#   - Empty
#   - Mine
#   - Theirs

Since black and white alternate turns, "mine" and "theirs" flip each move. This explains why simple linear probes failed.

Linear Probe in the Right Basis

linear_probe = t.stack([
    # "Empty" direction
    full_linear_probe[[black_to_play, white_to_play], ..., [empty, empty]].mean(0),
    # "Theirs" direction (opponent's color)
    full_linear_probe[[black_to_play, white_to_play], ..., [white, black]].mean(0),
    # "Mine" direction (my color)
    full_linear_probe[[black_to_play, white_to_play], ..., [black, white]].mean(0),
], dim=-1)

Result: High accuracy with a purely linear probe!

Visualizing Probe Directions

The "black vs white" direction for odd moves is nearly opposite to the same direction for even moves:

# Cosine similarity between probe directions:
# Same square, odd vs even moves: ~-1.0
# Different squares: ~0.0

This confirms the model thinks in "mine/theirs" not "black/white."

Extracting Board State

def get_board_state_from_model(cache, layer, game_idx, move):
    """Extract board state using linear probe."""
    residual = cache["resid_post", layer][game_idx, move]

    # Project onto probe directions
    probe_out = einops.einsum(
        residual, linear_probe,
        "d_model, d_model row col options -> options row col"
    )

    # Take argmax to get prediction for each square
    return probe_out.argmax(dim=0)

By layer 6, the probe achieves near-perfect accuracy.

When Does the Representation Form?

Probe accuracy by layer:

Layer 0-2: Poor (model still computing)
Layer 3-4: Good (most squares correct)
Layer 5-6: Excellent (nearly perfect)
Layer 7:   Excellent (computation complete)

The board state representation emerges through layers, not all at once.

Causal Interventions

The ultimate test: Can we change the model's behavior by editing its representation?

def apply_scale(resid, flip_dir, scale, pos):
    """
    Flip a square's representation from 'mine' to 'theirs'.
    """
    flip_dir_normed = flip_dir / flip_dir.norm()

    # Get current projection onto flip direction
    alpha = resid[0, pos] @ flip_dir_normed

    # Flip and scale
    resid[0, pos] -= (scale + 1) * alpha * flip_dir_normed

    return resid

The Intervention Works

When we flip F4 from "mine" to "theirs":

# Before intervention:
#   G4 legal, D2 illegal

# After intervention:
#   G4 illegal, D2 legal

The model now predicts legal moves for a board state that never occurred in training and may be impossible to reach through legal play!

This is strong evidence that:

The linear probe found a real representation
The model uses this representation causally
We can precisely control model behavior

Attention vs MLP Contributions

Where does the board state come from?

def get_contributions(cache, probe, layer, game_idx, move):
    attn_contrib = einops.einsum(
        cache["attn_out", layer][game_idx, move],
        probe,
        "d_model, d_model row col -> row col"
    )
    mlp_contrib = einops.einsum(
        cache["mlp_out", layer][game_idx, move],
        probe,
        "d_model, d_model row col -> row col"
    )
    return attn_contrib, mlp_contrib

Finding: Attention layers handle most squares. MLP layers are crucial for recently-captured pieces.

Neuron Interpretability

Individual neurons have interpretable roles:

def analyze_neuron(model, layer, neuron, probe):
    """What does this neuron respond to?"""
    # Input weights: what residual stream directions activate it?
    w_in = model.W_in[layer, :, neuron]
    input_pattern = einops.einsum(
        w_in / w_in.norm(),
        probe,
        "d_model, d_model row col -> row col"
    )

    # Output weights: what does it write to the residual stream?
    w_out = model.W_out[layer, neuron, :]
    output_pattern = einops.einsum(
        w_out / w_out.norm(),
        probe,
        "d_model, d_model row col -> row col"
    )

    return input_pattern, output_pattern

Example: Neuron L5N1393

Analysis reveals this neuron:

Reads from: specific squares in rows C-E
Writes to: "mine" direction for square D4

Hypothesis: This neuron helps track captures in the central region.

Such fine-grained interpretability is possible because:

The probe gives us meaningful directions
Neurons have linear input/output relationships
The model is small enough to inspect thoroughly

Max Activating Dataset Analysis

Find inputs that maximally activate a neuron:

def get_max_activating_examples(cache, layer, neuron, k=20):
    """Find the k board states that most activate this neuron."""
    activations = cache["post", layer][:, :, neuron]

    # Flatten and get top k
    flat = activations.flatten()
    top_indices = flat.topk(k).indices

    # Convert back to (game, move) pairs
    games = top_indices // activations.shape[1]
    moves = top_indices % activations.shape[1]

    return list(zip(games.tolist(), moves.tolist()))

Common pattern: Neurons often activate for specific board configurations (edges, corners, particular lines).

Spectrum Plots

Visualize the distribution of activations:

def plot_activation_spectrum(cache, layer, neuron):
    """How does this neuron's activation vary across all games/moves?"""
    acts = cache["post", layer][:, :, neuron].flatten()

    px.histogram(acts.cpu(), title=f"L{layer}N{neuron} Activation Distribution")

Observations:

Many neurons are sparse (mostly zero)
Some neurons have bimodal distributions (binary features?)
Distribution shape hints at what the neuron computes

The "Blank" vs "Occupied" Computation

Simpler than mine/theirs. The model just needs to track: "Has this square been played?"

# Blank probe direction
blank_probe = linear_probe[..., 0] - 0.5 * (linear_probe[..., 1] + linear_probe[..., 2])

# This is accurate very early (layer 0-1)
# Because it's a simpler computation

Attention alone can do this: just attend to all previous moves and check if any match this square.

Corner Squares Are Special

The probe has higher error rate for corners.

Why? In Othello, corner pieces can never be captured. Once placed, they never flip.

Hypothesis: The model has a different, simpler circuit for corners that doesn't fully match the general "mine/theirs" probe direction.

Training Your Own Probe

class LinearProbe(nn.Module):
    def __init__(self, d_model, n_squares=64, n_classes=3):
        super().__init__()
        self.probe = nn.Parameter(
            torch.randn(d_model, n_squares, n_classes) * 0.01
        )

    def forward(self, residual_stream):
        # residual_stream: (batch, seq, d_model)
        # output: (batch, seq, 64, 3)
        return einops.einsum(
            residual_stream, self.probe,
            "batch seq d_model, d_model squares classes -> batch seq squares classes"
        )

Training: Cross-entropy loss against ground-truth board states.

Probe Training Tips

Train on middle game (moves 5-55) - early/late game have edge cases
Use both parities - train on odd and even moves
Track per-square accuracy - some squares are harder
Compare layers - find where representation is strongest

Capstone Connection

OthelloGPT techniques directly apply to sycophancy detection:

OthelloGPT	Sycophancy
Board state probe	User-stance probe
"Mine vs theirs"	"Agree vs disagree"
Causal intervention on F4	Causal intervention on agreement direction
Neuron for corner squares	Neuron for detecting criticism

Key questions for your capstone:

Can you find a linear direction that encodes "user believes X"?
Can you find a separate direction for "X is actually true"?
What happens when you intervene on the "agree with user" direction?

The Broader Lesson

OthelloGPT demonstrates:

Emergent structure: Models learn more than they're explicitly taught
Linear representations: Complex concepts are often linear directions
Causal validity: We can verify our interpretations with interventions
Interpretable components: Individual neurons have meaningful roles

These principles likely apply to much larger language models.

Open Questions

Generalization: Does the probe generalize to board states never seen in training?
Impossible states: How does the model represent impossible board configurations?
Alternative circuits: Are there backup circuits if we ablate the main one?
Scaling: Would a larger OthelloGPT have cleaner or messier representations?

Each of these could be a research project.

🎓 Tyla's Exercise

The model represents "mine/theirs" rather than "black/white." Why is this more natural from the model's computational perspective? What would be different if it used black/white?
Linear probes extract features, but can they also create features? If you train a probe that achieves good accuracy, how do you know the model computed that feature vs the probe computing it?
The causal intervention changes model predictions. But does it prove the model "uses" the representation, or just that the representation is correlated with something the model uses?

💻 Aaliyah's Exercise

Implement the core OthelloGPT analysis pipeline:

def train_linear_probe(model, games, board_states, layer):
    """
    1. Run games through model, cache activations at `layer`
    2. Initialize probe with shape (d_model, 8, 8, 3)
    3. Train probe to predict board_states from activations
    4. Return trained probe and accuracy metrics
    """
    pass


def causal_intervention(model, game, move, square, flip_direction):
    """
    1. Get the probe direction for flipping `square`
    2. Define hook that flips the square's representation
    3. Run model with hook
    4. Compare predictions before/after
    5. Return dict with newly_legal and newly_illegal moves
    """
    pass


def find_interpretable_neurons(model, probe, layer):
    """
    1. For each neuron, compute input/output weight projections onto probe
    2. Identify neurons with strong, localized patterns
    3. Run max-activating dataset analysis
    4. Return list of (neuron_idx, interpretation) tuples
    """
    pass

📚 Maneesha's Reflection

The original paper used nonlinear probes because linear ones "didn't work." Neel Nanda discovered they work with the right basis. What does this teach us about the importance of choosing the right conceptual framework for interpretability?
OthelloGPT is a "toy model" - small, synthetic, fully understood. How should findings from toy models inform our beliefs about GPT-4? What transfers and what might not?
If we can causally intervene to make the model play illegal moves, what are the implications for AI safety? Is interpretability a path to controllability?

OthelloGPT: Emergent World Representations #

The Big Question #

Why This Matters #

Othello Basics #

The OthelloGPT Architecture #

Training Objective #

The Original Finding: Nonlinear Probes #

The Key Insight: Mine vs Theirs #

Linear Probe in the Right Basis #

Visualizing Probe Directions #

Extracting Board State #

When Does the Representation Form? #

Causal Interventions #

The Intervention Works #

Attention vs MLP Contributions #

Neuron Interpretability #

Example: Neuron L5N1393 #

Max Activating Dataset Analysis #

Spectrum Plots #

The "Blank" vs "Occupied" Computation #

Corner Squares Are Special #

Training Your Own Probe #

Probe Training Tips #

Capstone Connection #

The Broader Lesson #

Open Questions #

🎓 Tyla's Exercise #

💻 Aaliyah's Exercise #

📚 Maneesha's Reflection #