Chapter 0: Linear Layers and Training

Two building blocks for everything that follows: the Linear layer and the training loop.


Part 1: The Linear Layer

A linear layer is just: output = input @ weight.T + bias

But PyTorch wraps it in a class so:

The Implementation

import torch as t
import torch.nn as nn
import numpy as np
import einops

class Linear(nn.Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super().__init__()

        self.in_features = in_features
        self.out_features = out_features

        # Weight initialization matters!
        # Too large → gradients explode
        # Too small → gradients vanish
        # Scale by 1/sqrt(in_features) keeps variance stable
        scale = 1 / np.sqrt(in_features)

        # Shape: (out_features, in_features) - intentional!
        weight = scale * (2 * t.rand(out_features, in_features) - 1)
        self.weight = nn.Parameter(weight)

        if bias:
            b = scale * (2 * t.rand(out_features) - 1)
            self.bias = nn.Parameter(b)
        else:
            self.bias = None

    def forward(self, x: t.Tensor) -> t.Tensor:
        # output = x @ weight.T + bias
        out = einops.einsum(
            x, self.weight,
            "... in_f, out_f in_f -> ... out_f"
        )
        if self.bias is not None:
            out = out + self.bias
        return out

Why (out_features, in_features)?

PyTorch stores weights as (out, in) because the forward pass is x @ W.T.

With input x of shape (batch, in):

Capstone Connection

In transformers, the Q, K, V projections are Linear layers:

Q = self.W_Q(residual)  # (batch, seq, hidden) → (batch, seq, head_dim * n_heads)
K = self.W_K(residual)
V = self.W_V(residual)

When we analyze sycophancy, we'll look at what these projections do:

If a model is sycophantic, specific patterns in these weights might cause it to "look for" user preferences rather than truth.


Part 2: The Training Loop

Training is repeated application of:

  1. Forward pass: compute predictions
  2. Loss: how wrong are we?
  3. Backward pass: compute gradients
  4. Update: adjust weights to be less wrong

The Canonical Loop (Memorize This)

def train(model, loader, epochs=10, lr=0.01):
    model.train()  # Enable training mode
    optimizer = t.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.MSELoss()

    for epoch in range(epochs):
        for x, y in loader:
            optimizer.zero_grad()  # 1. Clear old gradients
            pred = model(x)        # 2. Forward pass
            loss = criterion(pred, y)  # 3. Compute loss
            loss.backward()        # 4. Compute gradients
            optimizer.step()       # 5. Update weights

        print(f"Epoch {epoch+1}: {loss.item():.4f}")

Why optimizer.zero_grad()?

PyTorch accumulates gradients by default. If you don't zero them, gradients from multiple batches add up, which is usually not what you want.

Why loss.backward() before optimizer.step()?

You can't step before you know which direction to go.

The Minimal Loop (10 Lines)

def train_minimal(model, loader, epochs=10, lr=0.01):
    optimizer = t.optim.Adam(model.parameters(), lr=lr)
    for epoch in range(epochs):
        for x, y in loader:
            loss = nn.functional.mse_loss(model(x), y)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        print(f"Epoch {epoch}: {loss:.4f}")

Note: zero_grad() can come before or after step() as long as it happens before the next backward().


Part 3: How Training Relates to Sycophancy

Here's the key insight for your capstone:

RLHF uses this same training loop, but with a different loss:

loss = -reward_model(response)

The reward model predicts what humans prefer. If humans prefer agreeable responses, the model learns to agree.

The gradient flows backward through the entire transformer. Every attention head, every MLP, every embedding gets nudged toward producing more "preferred" outputs.

This is how sycophancy emerges from training.

Not from a bug. Not from adversarial inputs. From a training objective that rewards agreement more than accuracy.

Your Milestone 1 will explore this: train a simple model where the "preference" data favors agreeable responses, and observe what happens.


🎓 Tyla's Exercise

  1. Implement the Linear layer without looking at the solution.
  2. Verify it produces the same output as nn.Linear (up to numerical precision).
  3. Reflection: Why does initialization scale matter? What would happen if we initialized weights as all 1s?

💻 Aaliyah's Exercise

  1. Write the training loop from memory (no peeking).
  2. Train a simple model on MNIST.
  3. Add validation accuracy printing.
# Starter code
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)

# Your model and training loop here

📚 Maneesha's Reflection

  1. The training loop is about shaping behavior through feedback. How is this similar to and different from how humans learn?

  2. RLHF uses human preferences as the reward signal. What happens when human preferences are biased? What happens when they're inconsistent?

  3. If you were designing an AI training process that minimized sycophancy, what would the feedback loop look like?