Optimization: SGD & Momentum

Gradient descent finds the path down the loss landscape. Understanding optimizers is understanding how models learn.


The Core Idea

A loss function measures how wrong our model is. Training minimizes this loss.

Gradient descent: Move in the direction that decreases loss most quickly.

$$\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)$$

Where:


Stochastic Gradient Descent (SGD)

True gradient descent computes the gradient over ALL data. Too expensive!

Stochastic gradient descent estimates the gradient from a mini-batch:

for batch_x, batch_y in dataloader:
    # Estimate gradient from mini-batch
    loss = criterion(model(batch_x), batch_y)
    loss.backward()

    # Update parameters
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad
            param.grad.zero_()

The noise from mini-batches actually helps! It provides regularization and helps escape local minima.


Implementing SGD

class SGD:
    def __init__(self, params, lr: float):
        self.params = list(params)
        self.lr = lr

    def zero_grad(self):
        for p in self.params:
            if p.grad is not None:
                p.grad.zero_()

    def step(self):
        with t.no_grad():
            for p in self.params:
                if p.grad is not None:
                    p -= self.lr * p.grad

Note: We use p -= ... (in-place) instead of p = p - ... because the optimizer holds references to the original parameters.


The Pathological Curvature Problem

Loss landscapes aren't nice bowls. They have ravines—steep in one direction, gentle in another.

         /\
        /  \
       /    \
------/  ·   \------  ← Want to go THIS way
             /\
            /  \
           ↓
      Going this way instead

With vanilla SGD:


Momentum

Add a "velocity" term that accumulates past gradients:

$$v_t = \mu v_{t-1} + \nabla_\theta L$$ $$\theta_{t+1} = \theta_t - \eta v_t$$

Where $\mu$ (momentum coefficient) is typically 0.9.

Intuition: A ball rolling down a hill accumulates speed. It doesn't stop immediately when the slope changes.

class SGDWithMomentum:
    def __init__(self, params, lr: float, momentum: float = 0.9):
        self.params = list(params)
        self.lr = lr
        self.momentum = momentum
        # Initialize velocity for each parameter
        self.v = [t.zeros_like(p) for p in self.params]

    def step(self):
        with t.no_grad():
            for p, v in zip(self.params, self.v):
                if p.grad is not None:
                    v.mul_(self.momentum).add_(p.grad)
                    p -= self.lr * v

Visualizing Momentum

On a pathological curvature:

def pathological_loss(x, y):
    """Ravine-shaped loss: steep in x, gentle in y"""
    return t.tanh(x)**2 + 0.01 * t.abs(x) + t.sigmoid(y)

# SGD without momentum: oscillates wildly
# SGD with momentum: builds up velocity along the ravine

Momentum:

  1. Dampens oscillations (averaging cancels out)
  2. Accelerates along consistent directions
  3. Helps escape flat regions and local minima

Weight Decay

Regularization by shrinking weights toward zero:

$$\theta_{t+1} = \theta_t - \eta (\nabla_\theta L + \lambda \theta_t)$$

Equivalent to adding $\frac{\lambda}{2}||\theta||^2$ to the loss.

class SGDWithDecay:
    def __init__(self, params, lr: float, weight_decay: float = 0.01):
        self.params = list(params)
        self.lr = lr
        self.weight_decay = weight_decay

    def step(self):
        with t.no_grad():
            for p in self.params:
                if p.grad is not None:
                    # Add weight decay to gradient
                    p.grad.add_(p, alpha=self.weight_decay)
                    p -= self.lr * p.grad

Learning Rate Schedules

Fixed learning rate is rarely optimal.

Warm-up: Start small, increase gradually

if step < warmup_steps:
    lr = base_lr * step / warmup_steps

Decay: Decrease over time

lr = base_lr * (decay_rate ** (step / decay_steps))

Cosine annealing: Smooth decrease

lr = base_lr * (1 + cos(π * step / total_steps)) / 2

Capstone Connection

RLHF training uses these optimizers:

When fine-tuning a model with human feedback:

# The same gradient descent, but now optimizing for human preference
reward = reward_model(response)
loss = -reward  # Maximize reward = minimize negative reward
loss.backward()
optimizer.step()

If the reward model is biased toward agreeable responses, gradient descent will push the model toward sycophancy—step by step, batch by batch.

Understanding optimization = understanding how sycophancy emerges from training.


🎓 Tyla's Exercise

  1. Derive why momentum dampens oscillations. (Hint: What happens when gradients alternate signs?)

  2. Prove that SGD with weight decay is equivalent to L2 regularization (adding $\frac{\lambda}{2}||\theta||^2$ to the loss).

  3. Why do we typically NOT apply weight decay to bias terms?


💻 Aaliyah's Exercise

Implement and compare optimizers:

def train_with_optimizer(model, optimizer_class, **kwargs):
    """
    Train on MNIST for 3 epochs.
    Return training loss curve and final accuracy.
    """
    pass

# Compare:
# 1. SGD(lr=0.1)
# 2. SGD(lr=0.1, momentum=0.9)
# 3. SGD(lr=0.1, momentum=0.9, weight_decay=0.01)

# Plot loss curves. Which converges fastest?
# Which achieves highest validation accuracy?

📚 Maneesha's Reflection

  1. The analogy of a "ball rolling down a hill" for momentum is intuitive but imperfect. Where does the analogy break down?

  2. Learning rate is often the most important hyperparameter. Why do you think finding the right one is so difficult?

  3. If training is "teaching the model," what does the learning rate represent pedagogically? What about momentum?