Optimization: SGD & Momentum
Gradient descent finds the path down the loss landscape. Understanding optimizers is understanding how models learn.
The Core Idea
A loss function measures how wrong our model is. Training minimizes this loss.
Gradient descent: Move in the direction that decreases loss most quickly.
$$\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)$$
Where:
- $\theta$ = model parameters
- $\eta$ = learning rate
- $\nabla_\theta L$ = gradient of loss with respect to parameters
Stochastic Gradient Descent (SGD)
True gradient descent computes the gradient over ALL data. Too expensive!
Stochastic gradient descent estimates the gradient from a mini-batch:
for batch_x, batch_y in dataloader:
# Estimate gradient from mini-batch
loss = criterion(model(batch_x), batch_y)
loss.backward()
# Update parameters
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
param.grad.zero_()
The noise from mini-batches actually helps! It provides regularization and helps escape local minima.
Implementing SGD
class SGD:
def __init__(self, params, lr: float):
self.params = list(params)
self.lr = lr
def zero_grad(self):
for p in self.params:
if p.grad is not None:
p.grad.zero_()
def step(self):
with t.no_grad():
for p in self.params:
if p.grad is not None:
p -= self.lr * p.grad
Note: We use p -= ... (in-place) instead of p = p - ... because the optimizer holds references to the original parameters.
The Pathological Curvature Problem
Loss landscapes aren't nice bowls. They have ravines—steep in one direction, gentle in another.
/\
/ \
/ \
------/ · \------ ← Want to go THIS way
/\
/ \
↓
Going this way instead
With vanilla SGD:
- Large learning rate: Oscillate across the ravine
- Small learning rate: Crawl slowly along it
Momentum
Add a "velocity" term that accumulates past gradients:
$$v_t = \mu v_{t-1} + \nabla_\theta L$$ $$\theta_{t+1} = \theta_t - \eta v_t$$
Where $\mu$ (momentum coefficient) is typically 0.9.
Intuition: A ball rolling down a hill accumulates speed. It doesn't stop immediately when the slope changes.
class SGDWithMomentum:
def __init__(self, params, lr: float, momentum: float = 0.9):
self.params = list(params)
self.lr = lr
self.momentum = momentum
# Initialize velocity for each parameter
self.v = [t.zeros_like(p) for p in self.params]
def step(self):
with t.no_grad():
for p, v in zip(self.params, self.v):
if p.grad is not None:
v.mul_(self.momentum).add_(p.grad)
p -= self.lr * v
Visualizing Momentum
On a pathological curvature:
def pathological_loss(x, y):
"""Ravine-shaped loss: steep in x, gentle in y"""
return t.tanh(x)**2 + 0.01 * t.abs(x) + t.sigmoid(y)
# SGD without momentum: oscillates wildly
# SGD with momentum: builds up velocity along the ravine
Momentum:
- Dampens oscillations (averaging cancels out)
- Accelerates along consistent directions
- Helps escape flat regions and local minima
Weight Decay
Regularization by shrinking weights toward zero:
$$\theta_{t+1} = \theta_t - \eta (\nabla_\theta L + \lambda \theta_t)$$
Equivalent to adding $\frac{\lambda}{2}||\theta||^2$ to the loss.
class SGDWithDecay:
def __init__(self, params, lr: float, weight_decay: float = 0.01):
self.params = list(params)
self.lr = lr
self.weight_decay = weight_decay
def step(self):
with t.no_grad():
for p in self.params:
if p.grad is not None:
# Add weight decay to gradient
p.grad.add_(p, alpha=self.weight_decay)
p -= self.lr * p.grad
Learning Rate Schedules
Fixed learning rate is rarely optimal.
Warm-up: Start small, increase gradually
if step < warmup_steps:
lr = base_lr * step / warmup_steps
Decay: Decrease over time
lr = base_lr * (decay_rate ** (step / decay_steps))
Cosine annealing: Smooth decrease
lr = base_lr * (1 + cos(π * step / total_steps)) / 2
Capstone Connection
RLHF training uses these optimizers:
When fine-tuning a model with human feedback:
# The same gradient descent, but now optimizing for human preference
reward = reward_model(response)
loss = -reward # Maximize reward = minimize negative reward
loss.backward()
optimizer.step()
If the reward model is biased toward agreeable responses, gradient descent will push the model toward sycophancy—step by step, batch by batch.
Understanding optimization = understanding how sycophancy emerges from training.
🎓 Tyla's Exercise
Derive why momentum dampens oscillations. (Hint: What happens when gradients alternate signs?)
Prove that SGD with weight decay is equivalent to L2 regularization (adding $\frac{\lambda}{2}||\theta||^2$ to the loss).
Why do we typically NOT apply weight decay to bias terms?
💻 Aaliyah's Exercise
Implement and compare optimizers:
def train_with_optimizer(model, optimizer_class, **kwargs):
"""
Train on MNIST for 3 epochs.
Return training loss curve and final accuracy.
"""
pass
# Compare:
# 1. SGD(lr=0.1)
# 2. SGD(lr=0.1, momentum=0.9)
# 3. SGD(lr=0.1, momentum=0.9, weight_decay=0.01)
# Plot loss curves. Which converges fastest?
# Which achieves highest validation accuracy?
📚 Maneesha's Reflection
The analogy of a "ball rolling down a hill" for momentum is intuitive but imperfect. Where does the analogy break down?
Learning rate is often the most important hyperparameter. Why do you think finding the right one is so difficult?
If training is "teaching the model," what does the learning rate represent pedagogically? What about momentum?