Transformers: Building GPT-2

Time to assemble a complete transformer. By the end, you'll have a working GPT-2 that can generate text.


GPT-2 Architecture Overview

Input Tokens
     ↓
Token Embedding + Position Embedding
     ↓
┌─────────────────────────────────┐
│ TransformerBlock × 12           │
│  ├─ LayerNorm                   │
│  ├─ Multi-Head Attention        │
│  ├─ + Residual                  │
│  ├─ LayerNorm                   │
│  ├─ MLP                         │
│  └─ + Residual                  │
└─────────────────────────────────┘
     ↓
Final LayerNorm
     ↓
Unembed → Logits

Layer Normalization

Normalize across the feature dimension:

class LayerNorm(nn.Module):
    def __init__(self, d_model: int, eps: float = 1e-5):
        super().__init__()
        self.eps = eps
        self.gamma = nn.Parameter(t.ones(d_model))
        self.beta = nn.Parameter(t.zeros(d_model))

    def forward(self, x: t.Tensor) -> t.Tensor:
        # Normalize across last dimension
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        x_norm = (x - mean) / t.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta

Why LayerNorm instead of BatchNorm?


The MLP Block

Two linear layers with GELU activation:

class MLP(nn.Module):
    def __init__(self, d_model: int, d_mlp: int = None):
        super().__init__()
        d_mlp = d_mlp or 4 * d_model  # Typically 4× hidden dim

        self.W_in = nn.Linear(d_model, d_mlp)
        self.W_out = nn.Linear(d_mlp, d_model)

    def forward(self, x: t.Tensor) -> t.Tensor:
        return self.W_out(F.gelu(self.W_in(x)))

GELU (Gaussian Error Linear Unit): $$\text{GELU}(x) = x \cdot \Phi(x) \approx x \cdot \sigma(1.702x)$$

Smoother than ReLU, often works better for transformers.


The Transformer Block

class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        self.ln1 = LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ln2 = LayerNorm(d_model)
        self.mlp = MLP(d_model)

    def forward(self, x: t.Tensor) -> t.Tensor:
        # Pre-norm architecture (GPT-2 style)
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

Note the residual connections: x = x + f(x). This is crucial for:


The Full Model

class GPT2(nn.Module):
    def __init__(
        self,
        vocab_size: int = 50257,
        d_model: int = 768,
        n_heads: int = 12,
        n_layers: int = 12,
        max_seq_len: int = 1024,
    ):
        super().__init__()

        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_seq_len, d_model)

        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads)
            for _ in range(n_layers)
        ])

        self.ln_final = LayerNorm(d_model)
        self.unembed = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, tokens: t.Tensor) -> t.Tensor:
        batch, seq_len = tokens.shape

        # Embeddings
        x = self.token_embed(tokens)
        x = x + self.pos_embed(t.arange(seq_len, device=tokens.device))

        # Transformer blocks
        for block in self.blocks:
            x = block(x)

        # Output
        x = self.ln_final(x)
        logits = self.unembed(x)

        return logits

Loading Pretrained Weights

from transformers import GPT2LMHeadModel

def load_gpt2_weights(model: GPT2):
    """Load weights from HuggingFace GPT-2."""
    hf_model = GPT2LMHeadModel.from_pretrained("gpt2")
    hf_state = hf_model.state_dict()

    # Map HuggingFace names to our names
    # (This is tedious but necessary)
    model.token_embed.weight.data = hf_state["transformer.wte.weight"]
    model.pos_embed.weight.data = hf_state["transformer.wpe.weight"]
    # ... etc for all layers

GPT-2 Configurations

Model d_model n_heads n_layers Parameters
gpt2 (small) 768 12 12 124M
gpt2-medium 1024 16 24 355M
gpt2-large 1280 20 36 774M
gpt2-xl 1600 25 48 1.5B

Generating Text

Autoregressive generation:

def generate(model, prompt_tokens, max_new_tokens=50, temperature=1.0):
    tokens = prompt_tokens.clone()

    for _ in range(max_new_tokens):
        # Get logits for last position
        logits = model(tokens)[:, -1, :]  # (batch, vocab)

        # Apply temperature
        logits = logits / temperature

        # Sample from distribution
        probs = F.softmax(logits, dim=-1)
        next_token = t.multinomial(probs, num_samples=1)

        # Append and continue
        tokens = t.cat([tokens, next_token], dim=1)

    return tokens

Lower temperature → more deterministic Higher temperature → more random


Capstone Connection

The residual stream:

Information flows through the model via the residual stream:

x = embed(tokens) + pos_embed
for block in blocks:
    x = x + attn(ln(x))   # Attention writes to stream
    x = x + mlp(ln(x))    # MLP writes to stream

Every component READS from and WRITES to this stream. When analyzing sycophancy:


🎓 Tyla's Exercise

  1. Calculate the total parameter count for GPT-2 small. Break it down by component.

  2. Why pre-norm (LayerNorm before attention/MLP) instead of post-norm (after)? What training benefits does this provide?

  3. The residual stream has dimension d_model throughout. Why not increase it in deeper layers?


💻 Aaliyah's Exercise

Build and verify GPT-2:

def build_gpt2():
    """
    1. Implement GPT2 class
    2. Load pretrained weights from HuggingFace
    3. Verify outputs match HuggingFace model
    4. Generate text continuation for "The quick brown fox"
    """
    pass

def compare_activations(our_model, hf_model, text):
    """
    Run both models on the same input.
    Compare intermediate activations (embeddings, attention patterns, etc.)
    They should match within floating point precision.
    """
    pass

📚 Maneesha's Reflection

  1. GPT-2 was trained on ~40GB of web text. How might the training data influence what "sycophancy" means to the model?

  2. The transformer architecture has remained largely unchanged since 2017. What does this stability tell us about the design?

  3. If you were explaining the residual stream to someone who understands rivers but not neural networks, what metaphor would you use?