Superposition: The Core Problem

Why can't we just read features from neurons? Because models cram more features than they have dimensions.


The Superposition Problem

Superposition is when a model represents more than $n$ features in an $n$-dimensional space.

Imagine representing 100 features with only 10 neurons.
Each neuron must encode multiple features.
Features share dimensions.
This creates interference.

This breaks our interpretability dreams:


Why Superposition Happens

The world has more features than models have neurons:

Concept Typical Count
English words ~170,000
Named entities Millions
Concepts/relations Unbounded
GPT-2 Small neurons 49,152

The model must compress. Superposition is the compression strategy.


The Key Insight: Sparsity

Superposition works because features are sparse:

# Most features are zero most of the time
feature_activations = {
    "is_code": [0, 0, 0, 1, 0, 0, 0, 0],  # Rarely active
    "is_question": [0, 1, 0, 0, 0, 0, 0, 0],  # Rarely active
    "is_noun": [1, 1, 0, 1, 0, 1, 1, 0],  # More common
}

# If features rarely co-occur, interference is rare
# Model can "reuse" the same dimensions

When feature A is active, feature B is usually zero, so sharing a neuron works!


Superposition vs Polysemanticity

Polysemanticity: One neuron responds to multiple features

neuron_42_activates_on = ["cats", "dogs", "fuzzy"]

Superposition: More features than dimensions exist

n_features = 1000
n_neurons = 100
# Features must share dimensions

Polysemanticity is a symptom. Superposition is the cause.


The Anthropic Toy Model

Anthropic studied superposition with a simple setup:

# 5 features → 2 dimensions → 5 outputs
x_input = [f1, f2, f3, f4, f5]  # 5 features

h = W @ x_input       # Compress to 2D
x_out = ReLU(W.T @ h + b)  # Reconstruct to 5D

# W is (2, 5) - must represent 5 features in 2D

The model learns to pack features in 2D space like this:

        f2
         │
    f1 ──┼── f4
         │
        f3
         f5

Features point in different directions, creating a "feature basis."


Feature Importance vs Sparsity

Two key properties determine how features get represented:

Property Definition Effect
Importance How much does this feature matter for loss? High importance → dedicated dimension
Sparsity How rarely is this feature active? High sparsity → share dimensions
# High importance, low sparsity → orthogonal representation
# Low importance, high sparsity → superposition

The Geometry of Superposition

In 2D, you can only have 2 orthogonal directions.

But you can have many nearly orthogonal directions:

# 5 features in 2D
# Can't all be orthogonal
# But can minimize interference

W = torch.randn(2, 5)
W_normed = W / W.norm(dim=0)

# Check cosine similarities
similarities = W_normed.T @ W_normed
# Diagonal = 1 (self-similarity)
# Off-diagonal should be small (low interference)

Visualizing Feature Geometry

import matplotlib.pyplot as plt
import numpy as np

# 5 features in 2D
angles = [0, 72, 144, 216, 288]  # Evenly spaced pentagon
features = [(np.cos(a * np.pi/180), np.sin(a * np.pi/180)) for a in angles]

plt.figure(figsize=(6, 6))
for i, (x, y) in enumerate(features):
    plt.arrow(0, 0, x, y, head_width=0.05)
    plt.text(x*1.1, y*1.1, f'f{i}')
plt.xlim(-1.5, 1.5)
plt.ylim(-1.5, 1.5)
plt.title("Pentagon: 5 features in 2D")

Capstone Connection

Superposition and sycophancy:

Sycophancy might be encoded in superposition:

# Hypothetical features
features = [
    "user_sentiment_positive",
    "factual_correctness",
    "agreement_with_user",
    "confidence",
]

# These might share dimensions
# Making it hard to ablate "sycophancy" without affecting "confidence"

Understanding superposition helps us:

  1. Find where sycophancy is encoded
  2. Understand why steering is hard
  3. Design better interventions

🎓 Tyla's Exercise

  1. Prove mathematically: In an $n$-dimensional space, you can have at most $n$ mutually orthogonal vectors. What's the maximum number of vectors with pairwise cosine similarity $\leq 0.1$?

  2. If sparsity is $S$ (probability a feature is zero), what's the expected interference between two features that share a dimension?

  3. Explain why ReLU creates a "privileged basis" but the residual stream doesn't have one.


💻 Aaliyah's Exercise

Explore the geometry of superposition:

def analyze_superposition(n_features, n_hidden, sparsity):
    """
    1. Create random features and a bottleneck model
    2. Train to reconstruct sparse features
    3. Visualize the learned feature geometry
    4. Measure average cosine similarity between features
    5. How does sparsity affect the learned geometry?
    """
    pass

def measure_interference(W, features):
    """
    Given a weight matrix W and active features,
    compute the interference (error) in reconstruction.
    """
    pass

📚 Maneesha's Reflection

  1. Superposition is a compression strategy. What are the trade-offs between compression and interpretability?

  2. If the brain also uses superposition, what implications does this have for cognitive science?

  3. The linear representation hypothesis says features are directions. Under what conditions might this break down?