Optimization: Adam & RMSprop

Modern optimizers adapt the learning rate for each parameter. Adam is the default for good reason.

The Problem with Global Learning Rate

Different parameters need different learning rates:

Rare features: Need larger updates when they appear
Common features: Need smaller, stable updates
Different layers: Different gradient scales

One learning rate doesn't fit all.

RMSprop: Adaptive Learning Rates

Track the running average of squared gradients:

$$v_t = \beta v_{t-1} + (1-\beta) g_t^2$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} g_t$$

Where $g_t = \nabla_\theta L$ is the gradient.

Key insight: Parameters with consistently large gradients get smaller effective learning rates. Parameters with small gradients get larger updates.

class RMSprop:
    def __init__(self, params, lr=0.01, beta=0.99, eps=1e-8):
        self.params = list(params)
        self.lr = lr
        self.beta = beta
        self.eps = eps
        self.v = [t.zeros_like(p) for p in self.params]

    def step(self):
        with t.no_grad():
            for p, v in zip(self.params, self.v):
                if p.grad is not None:
                    # Update running average of squared gradients
                    v.mul_(self.beta).addcmul_(p.grad, p.grad, value=1-self.beta)
                    # Update parameters
                    p.addcdiv_(p.grad, (v.sqrt() + self.eps), value=-self.lr)

Adam: Best of Both Worlds

Adam combines momentum AND adaptive learning rates:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(momentum)}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(RMSprop)}$$

With bias correction (because $m_0 = v_0 = 0$): $$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$

Update: $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

Implementing Adam

class Adam:
    def __init__(
        self,
        params,
        lr: float = 0.001,
        betas: tuple = (0.9, 0.999),
        eps: float = 1e-8,
        weight_decay: float = 0.0
    ):
        self.params = list(params)
        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.weight_decay = weight_decay

        # State
        self.m = [t.zeros_like(p) for p in self.params]
        self.v = [t.zeros_like(p) for p in self.params]
        self.t = 0

    def step(self):
        self.t += 1

        with t.no_grad():
            for p, m, v in zip(self.params, self.m, self.v):
                if p.grad is None:
                    continue

                g = p.grad
                if self.weight_decay > 0:
                    g = g + self.weight_decay * p

                # Update biased first moment
                m.mul_(self.beta1).add_(g, alpha=1-self.beta1)
                # Update biased second moment
                v.mul_(self.beta2).addcmul_(g, g, value=1-self.beta2)

                # Bias correction
                m_hat = m / (1 - self.beta1 ** self.t)
                v_hat = v / (1 - self.beta2 ** self.t)

                # Update parameters
                p.addcdiv_(m_hat, (v_hat.sqrt() + self.eps), value=-self.lr)

Why Bias Correction?

At $t=1$:

$m_1 = (1-\beta_1) g_1$ — much smaller than $g_1$ because $\beta_1 \approx 0.9$
Without correction, early updates would be too small

The correction factor $\frac{1}{1-\beta^t}$ compensates:

At $t=1$: $\frac{1}{1-0.9} = 10$ — large boost
As $t \to \infty$: $\frac{1}{1-0.9^{\infty}} = 1$ — no effect

Default Hyperparameters

These work surprisingly well across many tasks:

optimizer = Adam(
    model.parameters(),
    lr=0.001,          # Lower than SGD!
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0.01  # AdamW style
)

Typical adjustments:

Learning rate: Often the only thing you tune
$\beta_1$: Lower (0.5-0.9) for noisy gradients
Weight decay: Important for generalization

AdamW: Decoupled Weight Decay

Standard Adam applies weight decay to the gradient before adaptive scaling.

AdamW applies it directly to weights after the Adam update:

# Standard Adam weight decay (wrong?)
g = g + weight_decay * p
p -= lr / sqrt(v) * g

# AdamW (correct)
p -= lr * (m_hat / sqrt(v_hat) + weight_decay * p)

AdamW often works better in practice. PyTorch provides both.

Comparing Optimizers

# Task: Train ResNet on CIFAR-10

# SGD: Needs careful tuning but can achieve best results
optimizer = SGD(params, lr=0.1, momentum=0.9, weight_decay=5e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=200)

# Adam: Works out of the box
optimizer = Adam(params, lr=1e-3, weight_decay=1e-2)

# For transformers: AdamW is standard
optimizer = AdamW(params, lr=3e-4, betas=(0.9, 0.98), weight_decay=0.01)

Capstone Connection

Adam in RLHF:

Fine-tuning language models uses Adam with specific settings:

# GPT-style training
optimizer = AdamW(
    model.parameters(),
    lr=1e-5,           # Much lower than pre-training
    betas=(0.9, 0.95), # Often lower β2 for RL
    weight_decay=0.1,
)

The adaptive learning rate means:

Parameters that strongly correlate with reward get smaller updates
Rarely-activated parameters get larger updates when they do fire

This can amplify sycophancy if agreeable features are consistently rewarded.

🎓 Tyla's Exercise

Why does Adam use $\sqrt{v}$ instead of just $v$ in the denominator?
Prove that the bias correction formula is correct. (Hint: Expand $m_t$ as a function of all past gradients $g_1, ..., g_t$, and compute its expected value.)
What happens if $\epsilon = 0$? When would this cause problems?

💻 Aaliyah's Exercise

Compare optimizer behavior:

def visualize_optimization(loss_fn, optimizers, steps=100):
    """
    Given a 2D loss function, visualize the path each optimizer takes.

    loss_fn: takes (x, y) tensors, returns scalar loss
    optimizers: list of (name, optimizer_class, kwargs)

    Plot the loss surface as a contour plot,
    overlay the path of each optimizer.
    """
    pass

# Test on Rosenbrock function:
def rosenbrock(x, y):
    return (1 - x)**2 + 100*(y - x**2)**2

# Which optimizer finds the minimum fastest?

📚 Maneesha's Reflection

Adam was published in 2014 and is still the default for most tasks. Why has it been so hard to improve upon?
The adaptive learning rate idea can be seen as "personalized instruction rates for each parameter." How would you design a human learning system with this principle?
"Just use Adam with lr=0.001" is often good advice. What does this tell us about the relationship between simplicity and robustness in algorithm design?

Optimization: Adam & RMSprop #

The Problem with Global Learning Rate #

RMSprop: Adaptive Learning Rates #

Adam: Best of Both Worlds #

Implementing Adam #

Why Bias Correction? #

Default Hyperparameters #

AdamW: Decoupled Weight Decay #

Comparing Optimizers #

Capstone Connection #

🎓 Tyla's Exercise #

💻 Aaliyah's Exercise #

📚 Maneesha's Reflection #