OthelloGPT: Emergent World Representations
Can a language model learn to understand the world, not just mimic text patterns? OthelloGPT provides striking evidence that it can.
The Big Question
A transformer is trained only to predict legal Othello moves. No board state is ever provided. Just sequences of moves.
Yet the model spontaneously learns to represent the full board state internally.
This isn't memorization. The model has learned a world model - an internal representation of the game state that it uses for computation.
Why This Matters
The debate: Do LLMs "really understand" or just pattern match?
OthelloGPT shows:
- Simple prediction objectives can create rich internal representations
- Models can learn to track state that's never explicitly provided
- These representations are linear and interpretable
If a small model learns a world model for Othello, what might GPT-4 have learned about physics, psychology, or causality?
Othello Basics
Othello is played on an 8x8 board. Two players (black and white) take turns placing pieces.
Rules:
- A move is legal only if it captures opponent pieces
- Capturing happens along horizontal, vertical, or diagonal lines
- All captured pieces flip to your color
- Black plays first
The key insight: Predicting legal moves requires tracking the full board state. A move that was legal 5 turns ago might now be illegal.
The OthelloGPT Architecture
cfg = HookedTransformerConfig(
n_layers=8,
d_model=512,
d_head=64,
n_heads=8,
d_mlp=2048,
d_vocab=61, # 60 squares + pass
n_ctx=59, # 60 moves - 1
act_fn="gelu",
)
Small enough to study. Complex enough to be interesting.
Training Objective
The model sees random legal games and predicts the next move:
# Input: moves [0:59]
# Target: moves [1:60]
# Loss = cross entropy against actual next move
Since training data is uniformly random legal moves, the optimal model predicts uniform distribution over all legal next moves.
The model is never told which squares are occupied or by whom.
The Original Finding: Nonlinear Probes
The original paper (Li et al., 2022) trained nonlinear probes to extract board state:
# Linear probe: 20.4% error rate (doesn't work)
# Nonlinear probe (2-layer MLP): 1.7% error rate (works!)
This suggested the board representation might be fundamentally nonlinear.
But Neel Nanda discovered something different...
The Key Insight: Mine vs Theirs
Linear probes do work - if you use the right basis.
The model doesn't represent "this square has a black piece."
It represents "this square has MY piece."
# Original probe basis:
# - Empty
# - Black
# - White
# Better probe basis:
# - Empty
# - Mine
# - Theirs
Since black and white alternate turns, "mine" and "theirs" flip each move. This explains why simple linear probes failed.
Linear Probe in the Right Basis
linear_probe = t.stack([
# "Empty" direction
full_linear_probe[[black_to_play, white_to_play], ..., [empty, empty]].mean(0),
# "Theirs" direction (opponent's color)
full_linear_probe[[black_to_play, white_to_play], ..., [white, black]].mean(0),
# "Mine" direction (my color)
full_linear_probe[[black_to_play, white_to_play], ..., [black, white]].mean(0),
], dim=-1)
Result: High accuracy with a purely linear probe!
Visualizing Probe Directions
The "black vs white" direction for odd moves is nearly opposite to the same direction for even moves:
# Cosine similarity between probe directions:
# Same square, odd vs even moves: ~-1.0
# Different squares: ~0.0
This confirms the model thinks in "mine/theirs" not "black/white."
Extracting Board State
def get_board_state_from_model(cache, layer, game_idx, move):
"""Extract board state using linear probe."""
residual = cache["resid_post", layer][game_idx, move]
# Project onto probe directions
probe_out = einops.einsum(
residual, linear_probe,
"d_model, d_model row col options -> options row col"
)
# Take argmax to get prediction for each square
return probe_out.argmax(dim=0)
By layer 6, the probe achieves near-perfect accuracy.
When Does the Representation Form?
Probe accuracy by layer:
Layer 0-2: Poor (model still computing)
Layer 3-4: Good (most squares correct)
Layer 5-6: Excellent (nearly perfect)
Layer 7: Excellent (computation complete)
The board state representation emerges through layers, not all at once.
Causal Interventions
The ultimate test: Can we change the model's behavior by editing its representation?
def apply_scale(resid, flip_dir, scale, pos):
"""
Flip a square's representation from 'mine' to 'theirs'.
"""
flip_dir_normed = flip_dir / flip_dir.norm()
# Get current projection onto flip direction
alpha = resid[0, pos] @ flip_dir_normed
# Flip and scale
resid[0, pos] -= (scale + 1) * alpha * flip_dir_normed
return resid
The Intervention Works
When we flip F4 from "mine" to "theirs":
# Before intervention:
# G4 legal, D2 illegal
# After intervention:
# G4 illegal, D2 legal
The model now predicts legal moves for a board state that never occurred in training and may be impossible to reach through legal play!
This is strong evidence that:
- The linear probe found a real representation
- The model uses this representation causally
- We can precisely control model behavior
Attention vs MLP Contributions
Where does the board state come from?
def get_contributions(cache, probe, layer, game_idx, move):
attn_contrib = einops.einsum(
cache["attn_out", layer][game_idx, move],
probe,
"d_model, d_model row col -> row col"
)
mlp_contrib = einops.einsum(
cache["mlp_out", layer][game_idx, move],
probe,
"d_model, d_model row col -> row col"
)
return attn_contrib, mlp_contrib
Finding: Attention layers handle most squares. MLP layers are crucial for recently-captured pieces.
Neuron Interpretability
Individual neurons have interpretable roles:
def analyze_neuron(model, layer, neuron, probe):
"""What does this neuron respond to?"""
# Input weights: what residual stream directions activate it?
w_in = model.W_in[layer, :, neuron]
input_pattern = einops.einsum(
w_in / w_in.norm(),
probe,
"d_model, d_model row col -> row col"
)
# Output weights: what does it write to the residual stream?
w_out = model.W_out[layer, neuron, :]
output_pattern = einops.einsum(
w_out / w_out.norm(),
probe,
"d_model, d_model row col -> row col"
)
return input_pattern, output_pattern
Example: Neuron L5N1393
Analysis reveals this neuron:
- Reads from: specific squares in rows C-E
- Writes to: "mine" direction for square D4
Hypothesis: This neuron helps track captures in the central region.
Such fine-grained interpretability is possible because:
- The probe gives us meaningful directions
- Neurons have linear input/output relationships
- The model is small enough to inspect thoroughly
Max Activating Dataset Analysis
Find inputs that maximally activate a neuron:
def get_max_activating_examples(cache, layer, neuron, k=20):
"""Find the k board states that most activate this neuron."""
activations = cache["post", layer][:, :, neuron]
# Flatten and get top k
flat = activations.flatten()
top_indices = flat.topk(k).indices
# Convert back to (game, move) pairs
games = top_indices // activations.shape[1]
moves = top_indices % activations.shape[1]
return list(zip(games.tolist(), moves.tolist()))
Common pattern: Neurons often activate for specific board configurations (edges, corners, particular lines).
Spectrum Plots
Visualize the distribution of activations:
def plot_activation_spectrum(cache, layer, neuron):
"""How does this neuron's activation vary across all games/moves?"""
acts = cache["post", layer][:, :, neuron].flatten()
px.histogram(acts.cpu(), title=f"L{layer}N{neuron} Activation Distribution")
Observations:
- Many neurons are sparse (mostly zero)
- Some neurons have bimodal distributions (binary features?)
- Distribution shape hints at what the neuron computes
The "Blank" vs "Occupied" Computation
Simpler than mine/theirs. The model just needs to track: "Has this square been played?"
# Blank probe direction
blank_probe = linear_probe[..., 0] - 0.5 * (linear_probe[..., 1] + linear_probe[..., 2])
# This is accurate very early (layer 0-1)
# Because it's a simpler computation
Attention alone can do this: just attend to all previous moves and check if any match this square.
Corner Squares Are Special
The probe has higher error rate for corners.
Why? In Othello, corner pieces can never be captured. Once placed, they never flip.
Hypothesis: The model has a different, simpler circuit for corners that doesn't fully match the general "mine/theirs" probe direction.
Training Your Own Probe
class LinearProbe(nn.Module):
def __init__(self, d_model, n_squares=64, n_classes=3):
super().__init__()
self.probe = nn.Parameter(
torch.randn(d_model, n_squares, n_classes) * 0.01
)
def forward(self, residual_stream):
# residual_stream: (batch, seq, d_model)
# output: (batch, seq, 64, 3)
return einops.einsum(
residual_stream, self.probe,
"batch seq d_model, d_model squares classes -> batch seq squares classes"
)
Training: Cross-entropy loss against ground-truth board states.
Probe Training Tips
- Train on middle game (moves 5-55) - early/late game have edge cases
- Use both parities - train on odd and even moves
- Track per-square accuracy - some squares are harder
- Compare layers - find where representation is strongest
Capstone Connection
OthelloGPT techniques directly apply to sycophancy detection:
| OthelloGPT | Sycophancy |
|---|---|
| Board state probe | User-stance probe |
| "Mine vs theirs" | "Agree vs disagree" |
| Causal intervention on F4 | Causal intervention on agreement direction |
| Neuron for corner squares | Neuron for detecting criticism |
Key questions for your capstone:
- Can you find a linear direction that encodes "user believes X"?
- Can you find a separate direction for "X is actually true"?
- What happens when you intervene on the "agree with user" direction?
The Broader Lesson
OthelloGPT demonstrates:
- Emergent structure: Models learn more than they're explicitly taught
- Linear representations: Complex concepts are often linear directions
- Causal validity: We can verify our interpretations with interventions
- Interpretable components: Individual neurons have meaningful roles
These principles likely apply to much larger language models.
Open Questions
- Generalization: Does the probe generalize to board states never seen in training?
- Impossible states: How does the model represent impossible board configurations?
- Alternative circuits: Are there backup circuits if we ablate the main one?
- Scaling: Would a larger OthelloGPT have cleaner or messier representations?
Each of these could be a research project.
🎓 Tyla's Exercise
The model represents "mine/theirs" rather than "black/white." Why is this more natural from the model's computational perspective? What would be different if it used black/white?
Linear probes extract features, but can they also create features? If you train a probe that achieves good accuracy, how do you know the model computed that feature vs the probe computing it?
The causal intervention changes model predictions. But does it prove the model "uses" the representation, or just that the representation is correlated with something the model uses?
💻 Aaliyah's Exercise
Implement the core OthelloGPT analysis pipeline:
def train_linear_probe(model, games, board_states, layer):
"""
1. Run games through model, cache activations at `layer`
2. Initialize probe with shape (d_model, 8, 8, 3)
3. Train probe to predict board_states from activations
4. Return trained probe and accuracy metrics
"""
pass
def causal_intervention(model, game, move, square, flip_direction):
"""
1. Get the probe direction for flipping `square`
2. Define hook that flips the square's representation
3. Run model with hook
4. Compare predictions before/after
5. Return dict with newly_legal and newly_illegal moves
"""
pass
def find_interpretable_neurons(model, probe, layer):
"""
1. For each neuron, compute input/output weight projections onto probe
2. Identify neurons with strong, localized patterns
3. Run max-activating dataset analysis
4. Return list of (neuron_idx, interpretation) tuples
"""
pass
📚 Maneesha's Reflection
The original paper used nonlinear probes because linear ones "didn't work." Neel Nanda discovered they work with the right basis. What does this teach us about the importance of choosing the right conceptual framework for interpretability?
OthelloGPT is a "toy model" - small, synthetic, fully understood. How should findings from toy models inform our beliefs about GPT-4? What transfers and what might not?
If we can causally intervene to make the model play illegal moves, what are the implications for AI safety? Is interpretability a path to controllability?