Chapter 0: Linear Layers and Training
Two building blocks for everything that follows: the Linear layer and the training loop.
Part 1: The Linear Layer
A linear layer is just: output = input @ weight.T + bias
But PyTorch wraps it in a class so:
- Weights are trainable parameters
- Layers are composable (can stack in Sequential)
- Initialization is handled properly
The Implementation
import torch as t
import torch.nn as nn
import numpy as np
import einops
class Linear(nn.Module):
def __init__(self, in_features: int, out_features: int, bias: bool = True):
super().__init__()
self.in_features = in_features
self.out_features = out_features
# Weight initialization matters!
# Too large → gradients explode
# Too small → gradients vanish
# Scale by 1/sqrt(in_features) keeps variance stable
scale = 1 / np.sqrt(in_features)
# Shape: (out_features, in_features) - intentional!
weight = scale * (2 * t.rand(out_features, in_features) - 1)
self.weight = nn.Parameter(weight)
if bias:
b = scale * (2 * t.rand(out_features) - 1)
self.bias = nn.Parameter(b)
else:
self.bias = None
def forward(self, x: t.Tensor) -> t.Tensor:
# output = x @ weight.T + bias
out = einops.einsum(
x, self.weight,
"... in_f, out_f in_f -> ... out_f"
)
if self.bias is not None:
out = out + self.bias
return out
Why (out_features, in_features)?
PyTorch stores weights as (out, in) because the forward pass is x @ W.T.
With input x of shape (batch, in):
Wis(out, in)W.Tis(in, out)x @ W.Tis(batch, in) @ (in, out) = (batch, out)✓
Capstone Connection
In transformers, the Q, K, V projections are Linear layers:
Q = self.W_Q(residual) # (batch, seq, hidden) → (batch, seq, head_dim * n_heads)
K = self.W_K(residual)
V = self.W_V(residual)
When we analyze sycophancy, we'll look at what these projections do:
- Q projection: "What is this position looking for?"
- K projection: "What does this position contain?"
- V projection: "What information should be copied?"
If a model is sycophantic, specific patterns in these weights might cause it to "look for" user preferences rather than truth.
Part 2: The Training Loop
Training is repeated application of:
- Forward pass: compute predictions
- Loss: how wrong are we?
- Backward pass: compute gradients
- Update: adjust weights to be less wrong
The Canonical Loop (Memorize This)
def train(model, loader, epochs=10, lr=0.01):
model.train() # Enable training mode
optimizer = t.optim.Adam(model.parameters(), lr=lr)
criterion = nn.MSELoss()
for epoch in range(epochs):
for x, y in loader:
optimizer.zero_grad() # 1. Clear old gradients
pred = model(x) # 2. Forward pass
loss = criterion(pred, y) # 3. Compute loss
loss.backward() # 4. Compute gradients
optimizer.step() # 5. Update weights
print(f"Epoch {epoch+1}: {loss.item():.4f}")
Why optimizer.zero_grad()?
PyTorch accumulates gradients by default. If you don't zero them, gradients from multiple batches add up, which is usually not what you want.
Why loss.backward() before optimizer.step()?
backward()computes gradients and stores them inparam.gradstep()uses those gradients to update parameters
You can't step before you know which direction to go.
The Minimal Loop (10 Lines)
def train_minimal(model, loader, epochs=10, lr=0.01):
optimizer = t.optim.Adam(model.parameters(), lr=lr)
for epoch in range(epochs):
for x, y in loader:
loss = nn.functional.mse_loss(model(x), y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"Epoch {epoch}: {loss:.4f}")
Note: zero_grad() can come before or after step() as long as it happens before the next backward().
Part 3: How Training Relates to Sycophancy
Here's the key insight for your capstone:
RLHF uses this same training loop, but with a different loss:
loss = -reward_model(response)
The reward model predicts what humans prefer. If humans prefer agreeable responses, the model learns to agree.
The gradient flows backward through the entire transformer. Every attention head, every MLP, every embedding gets nudged toward producing more "preferred" outputs.
This is how sycophancy emerges from training.
Not from a bug. Not from adversarial inputs. From a training objective that rewards agreement more than accuracy.
Your Milestone 1 will explore this: train a simple model where the "preference" data favors agreeable responses, and observe what happens.
🎓 Tyla's Exercise
- Implement the Linear layer without looking at the solution.
- Verify it produces the same output as
nn.Linear(up to numerical precision). - Reflection: Why does initialization scale matter? What would happen if we initialized weights as all 1s?
💻 Aaliyah's Exercise
- Write the training loop from memory (no peeking).
- Train a simple model on MNIST.
- Add validation accuracy printing.
# Starter code
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
# Your model and training loop here
📚 Maneesha's Reflection
The training loop is about shaping behavior through feedback. How is this similar to and different from how humans learn?
RLHF uses human preferences as the reward signal. What happens when human preferences are biased? What happens when they're inconsistent?
If you were designing an AI training process that minimized sycophancy, what would the feedback loop look like?