Transformers: Building GPT-2

Time to assemble a complete transformer. By the end, you'll have a working GPT-2 that can generate text.

GPT-2 Architecture Overview

Input Tokens
     ↓
Token Embedding + Position Embedding
     ↓
┌─────────────────────────────────┐
│ TransformerBlock × 12           │
│  ├─ LayerNorm                   │
│  ├─ Multi-Head Attention        │
│  ├─ + Residual                  │
│  ├─ LayerNorm                   │
│  ├─ MLP                         │
│  └─ + Residual                  │
└─────────────────────────────────┘
     ↓
Final LayerNorm
     ↓
Unembed → Logits

Layer Normalization

Normalize across the feature dimension:

class LayerNorm(nn.Module):
    def __init__(self, d_model: int, eps: float = 1e-5):
        super().__init__()
        self.eps = eps
        self.gamma = nn.Parameter(t.ones(d_model))
        self.beta = nn.Parameter(t.zeros(d_model))

    def forward(self, x: t.Tensor) -> t.Tensor:
        # Normalize across last dimension
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        x_norm = (x - mean) / t.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta

Why LayerNorm instead of BatchNorm?

Works with variable sequence lengths
No batch statistics needed at inference
Each position normalized independently

The MLP Block

Two linear layers with GELU activation:

class MLP(nn.Module):
    def __init__(self, d_model: int, d_mlp: int = None):
        super().__init__()
        d_mlp = d_mlp or 4 * d_model  # Typically 4× hidden dim

        self.W_in = nn.Linear(d_model, d_mlp)
        self.W_out = nn.Linear(d_mlp, d_model)

    def forward(self, x: t.Tensor) -> t.Tensor:
        return self.W_out(F.gelu(self.W_in(x)))

GELU (Gaussian Error Linear Unit): $$\text{GELU}(x) = x \cdot \Phi(x) \approx x \cdot \sigma(1.702x)$$

Smoother than ReLU, often works better for transformers.

The Transformer Block

class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        self.ln1 = LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ln2 = LayerNorm(d_model)
        self.mlp = MLP(d_model)

    def forward(self, x: t.Tensor) -> t.Tensor:
        # Pre-norm architecture (GPT-2 style)
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

Note the residual connections: x = x + f(x). This is crucial for:

Gradient flow during training
Creating the "residual stream" for interpretability

The Full Model

class GPT2(nn.Module):
    def __init__(
        self,
        vocab_size: int = 50257,
        d_model: int = 768,
        n_heads: int = 12,
        n_layers: int = 12,
        max_seq_len: int = 1024,
    ):
        super().__init__()

        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_seq_len, d_model)

        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads)
            for _ in range(n_layers)
        ])

        self.ln_final = LayerNorm(d_model)
        self.unembed = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, tokens: t.Tensor) -> t.Tensor:
        batch, seq_len = tokens.shape

        # Embeddings
        x = self.token_embed(tokens)
        x = x + self.pos_embed(t.arange(seq_len, device=tokens.device))

        # Transformer blocks
        for block in self.blocks:
            x = block(x)

        # Output
        x = self.ln_final(x)
        logits = self.unembed(x)

        return logits

Loading Pretrained Weights

from transformers import GPT2LMHeadModel

def load_gpt2_weights(model: GPT2):
    """Load weights from HuggingFace GPT-2."""
    hf_model = GPT2LMHeadModel.from_pretrained("gpt2")
    hf_state = hf_model.state_dict()

    # Map HuggingFace names to our names
    # (This is tedious but necessary)
    model.token_embed.weight.data = hf_state["transformer.wte.weight"]
    model.pos_embed.weight.data = hf_state["transformer.wpe.weight"]
    # ... etc for all layers

GPT-2 Configurations

Model	d_model	n_heads	n_layers	Parameters
gpt2 (small)	768	12	12	124M
gpt2-medium	1024	16	24	355M
gpt2-large	1280	20	36	774M
gpt2-xl	1600	25	48	1.5B

Generating Text

Autoregressive generation:

def generate(model, prompt_tokens, max_new_tokens=50, temperature=1.0):
    tokens = prompt_tokens.clone()

    for _ in range(max_new_tokens):
        # Get logits for last position
        logits = model(tokens)[:, -1, :]  # (batch, vocab)

        # Apply temperature
        logits = logits / temperature

        # Sample from distribution
        probs = F.softmax(logits, dim=-1)
        next_token = t.multinomial(probs, num_samples=1)

        # Append and continue
        tokens = t.cat([tokens, next_token], dim=1)

    return tokens

Lower temperature → more deterministic Higher temperature → more random

Capstone Connection

The residual stream:

Information flows through the model via the residual stream:

x = embed(tokens) + pos_embed
for block in blocks:
    x = x + attn(ln(x))   # Attention writes to stream
    x = x + mlp(ln(x))    # MLP writes to stream

Every component READS from and WRITES to this stream. When analyzing sycophancy:

What does each attention head write?
Which MLP neurons activate for sycophantic vs honest responses?
Where in the stream does "sycophancy" live?

🎓 Tyla's Exercise

Calculate the total parameter count for GPT-2 small. Break it down by component.
Why pre-norm (LayerNorm before attention/MLP) instead of post-norm (after)? What training benefits does this provide?
The residual stream has dimension d_model throughout. Why not increase it in deeper layers?

💻 Aaliyah's Exercise

Build and verify GPT-2:

def build_gpt2():
    """
    1. Implement GPT2 class
    2. Load pretrained weights from HuggingFace
    3. Verify outputs match HuggingFace model
    4. Generate text continuation for "The quick brown fox"
    """
    pass

def compare_activations(our_model, hf_model, text):
    """
    Run both models on the same input.
    Compare intermediate activations (embeddings, attention patterns, etc.)
    They should match within floating point precision.
    """
    pass

📚 Maneesha's Reflection

GPT-2 was trained on ~40GB of web text. How might the training data influence what "sycophancy" means to the model?
The transformer architecture has remained largely unchanged since 2017. What does this stability tell us about the design?
If you were explaining the residual stream to someone who understands rivers but not neural networks, what metaphor would you use?

Transformers: Building GPT-2 #

GPT-2 Architecture Overview #

Layer Normalization #

The MLP Block #

The Transformer Block #

The Full Model #

Loading Pretrained Weights #

GPT-2 Configurations #

Generating Text #

Capstone Connection #

🎓 Tyla's Exercise #

💻 Aaliyah's Exercise #

📚 Maneesha's Reflection #