Transformers: Building GPT-2
Time to assemble a complete transformer. By the end, you'll have a working GPT-2 that can generate text.
GPT-2 Architecture Overview
Input Tokens
↓
Token Embedding + Position Embedding
↓
┌─────────────────────────────────┐
│ TransformerBlock × 12 │
│ ├─ LayerNorm │
│ ├─ Multi-Head Attention │
│ ├─ + Residual │
│ ├─ LayerNorm │
│ ├─ MLP │
│ └─ + Residual │
└─────────────────────────────────┘
↓
Final LayerNorm
↓
Unembed → Logits
Layer Normalization
Normalize across the feature dimension:
class LayerNorm(nn.Module):
def __init__(self, d_model: int, eps: float = 1e-5):
super().__init__()
self.eps = eps
self.gamma = nn.Parameter(t.ones(d_model))
self.beta = nn.Parameter(t.zeros(d_model))
def forward(self, x: t.Tensor) -> t.Tensor:
# Normalize across last dimension
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
x_norm = (x - mean) / t.sqrt(var + self.eps)
return self.gamma * x_norm + self.beta
Why LayerNorm instead of BatchNorm?
- Works with variable sequence lengths
- No batch statistics needed at inference
- Each position normalized independently
The MLP Block
Two linear layers with GELU activation:
class MLP(nn.Module):
def __init__(self, d_model: int, d_mlp: int = None):
super().__init__()
d_mlp = d_mlp or 4 * d_model # Typically 4× hidden dim
self.W_in = nn.Linear(d_model, d_mlp)
self.W_out = nn.Linear(d_mlp, d_model)
def forward(self, x: t.Tensor) -> t.Tensor:
return self.W_out(F.gelu(self.W_in(x)))
GELU (Gaussian Error Linear Unit): $$\text{GELU}(x) = x \cdot \Phi(x) \approx x \cdot \sigma(1.702x)$$
Smoother than ReLU, often works better for transformers.
The Transformer Block
class TransformerBlock(nn.Module):
def __init__(self, d_model: int, n_heads: int):
super().__init__()
self.ln1 = LayerNorm(d_model)
self.attn = MultiHeadAttention(d_model, n_heads)
self.ln2 = LayerNorm(d_model)
self.mlp = MLP(d_model)
def forward(self, x: t.Tensor) -> t.Tensor:
# Pre-norm architecture (GPT-2 style)
x = x + self.attn(self.ln1(x))
x = x + self.mlp(self.ln2(x))
return x
Note the residual connections: x = x + f(x). This is crucial for:
- Gradient flow during training
- Creating the "residual stream" for interpretability
The Full Model
class GPT2(nn.Module):
def __init__(
self,
vocab_size: int = 50257,
d_model: int = 768,
n_heads: int = 12,
n_layers: int = 12,
max_seq_len: int = 1024,
):
super().__init__()
self.token_embed = nn.Embedding(vocab_size, d_model)
self.pos_embed = nn.Embedding(max_seq_len, d_model)
self.blocks = nn.ModuleList([
TransformerBlock(d_model, n_heads)
for _ in range(n_layers)
])
self.ln_final = LayerNorm(d_model)
self.unembed = nn.Linear(d_model, vocab_size, bias=False)
def forward(self, tokens: t.Tensor) -> t.Tensor:
batch, seq_len = tokens.shape
# Embeddings
x = self.token_embed(tokens)
x = x + self.pos_embed(t.arange(seq_len, device=tokens.device))
# Transformer blocks
for block in self.blocks:
x = block(x)
# Output
x = self.ln_final(x)
logits = self.unembed(x)
return logits
Loading Pretrained Weights
from transformers import GPT2LMHeadModel
def load_gpt2_weights(model: GPT2):
"""Load weights from HuggingFace GPT-2."""
hf_model = GPT2LMHeadModel.from_pretrained("gpt2")
hf_state = hf_model.state_dict()
# Map HuggingFace names to our names
# (This is tedious but necessary)
model.token_embed.weight.data = hf_state["transformer.wte.weight"]
model.pos_embed.weight.data = hf_state["transformer.wpe.weight"]
# ... etc for all layers
GPT-2 Configurations
| Model | d_model | n_heads | n_layers | Parameters |
|---|---|---|---|---|
| gpt2 (small) | 768 | 12 | 12 | 124M |
| gpt2-medium | 1024 | 16 | 24 | 355M |
| gpt2-large | 1280 | 20 | 36 | 774M |
| gpt2-xl | 1600 | 25 | 48 | 1.5B |
Generating Text
Autoregressive generation:
def generate(model, prompt_tokens, max_new_tokens=50, temperature=1.0):
tokens = prompt_tokens.clone()
for _ in range(max_new_tokens):
# Get logits for last position
logits = model(tokens)[:, -1, :] # (batch, vocab)
# Apply temperature
logits = logits / temperature
# Sample from distribution
probs = F.softmax(logits, dim=-1)
next_token = t.multinomial(probs, num_samples=1)
# Append and continue
tokens = t.cat([tokens, next_token], dim=1)
return tokens
Lower temperature → more deterministic Higher temperature → more random
Capstone Connection
The residual stream:
Information flows through the model via the residual stream:
x = embed(tokens) + pos_embed
for block in blocks:
x = x + attn(ln(x)) # Attention writes to stream
x = x + mlp(ln(x)) # MLP writes to stream
Every component READS from and WRITES to this stream. When analyzing sycophancy:
- What does each attention head write?
- Which MLP neurons activate for sycophantic vs honest responses?
- Where in the stream does "sycophancy" live?
🎓 Tyla's Exercise
Calculate the total parameter count for GPT-2 small. Break it down by component.
Why pre-norm (LayerNorm before attention/MLP) instead of post-norm (after)? What training benefits does this provide?
The residual stream has dimension d_model throughout. Why not increase it in deeper layers?
💻 Aaliyah's Exercise
Build and verify GPT-2:
def build_gpt2():
"""
1. Implement GPT2 class
2. Load pretrained weights from HuggingFace
3. Verify outputs match HuggingFace model
4. Generate text continuation for "The quick brown fox"
"""
pass
def compare_activations(our_model, hf_model, text):
"""
Run both models on the same input.
Compare intermediate activations (embeddings, attention patterns, etc.)
They should match within floating point precision.
"""
pass
📚 Maneesha's Reflection
GPT-2 was trained on ~40GB of web text. How might the training data influence what "sycophancy" means to the model?
The transformer architecture has remained largely unchanged since 2017. What does this stability tell us about the design?
If you were explaining the residual stream to someone who understands rivers but not neural networks, what metaphor would you use?