Optimization: Adam & RMSprop
Modern optimizers adapt the learning rate for each parameter. Adam is the default for good reason.
The Problem with Global Learning Rate
Different parameters need different learning rates:
- Rare features: Need larger updates when they appear
- Common features: Need smaller, stable updates
- Different layers: Different gradient scales
One learning rate doesn't fit all.
RMSprop: Adaptive Learning Rates
Track the running average of squared gradients:
$$v_t = \beta v_{t-1} + (1-\beta) g_t^2$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} g_t$$
Where $g_t = \nabla_\theta L$ is the gradient.
Key insight: Parameters with consistently large gradients get smaller effective learning rates. Parameters with small gradients get larger updates.
class RMSprop:
def __init__(self, params, lr=0.01, beta=0.99, eps=1e-8):
self.params = list(params)
self.lr = lr
self.beta = beta
self.eps = eps
self.v = [t.zeros_like(p) for p in self.params]
def step(self):
with t.no_grad():
for p, v in zip(self.params, self.v):
if p.grad is not None:
# Update running average of squared gradients
v.mul_(self.beta).addcmul_(p.grad, p.grad, value=1-self.beta)
# Update parameters
p.addcdiv_(p.grad, (v.sqrt() + self.eps), value=-self.lr)
Adam: Best of Both Worlds
Adam combines momentum AND adaptive learning rates:
$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(momentum)}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(RMSprop)}$$
With bias correction (because $m_0 = v_0 = 0$): $$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$
Update: $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$
Implementing Adam
class Adam:
def __init__(
self,
params,
lr: float = 0.001,
betas: tuple = (0.9, 0.999),
eps: float = 1e-8,
weight_decay: float = 0.0
):
self.params = list(params)
self.lr = lr
self.beta1, self.beta2 = betas
self.eps = eps
self.weight_decay = weight_decay
# State
self.m = [t.zeros_like(p) for p in self.params]
self.v = [t.zeros_like(p) for p in self.params]
self.t = 0
def step(self):
self.t += 1
with t.no_grad():
for p, m, v in zip(self.params, self.m, self.v):
if p.grad is None:
continue
g = p.grad
if self.weight_decay > 0:
g = g + self.weight_decay * p
# Update biased first moment
m.mul_(self.beta1).add_(g, alpha=1-self.beta1)
# Update biased second moment
v.mul_(self.beta2).addcmul_(g, g, value=1-self.beta2)
# Bias correction
m_hat = m / (1 - self.beta1 ** self.t)
v_hat = v / (1 - self.beta2 ** self.t)
# Update parameters
p.addcdiv_(m_hat, (v_hat.sqrt() + self.eps), value=-self.lr)
Why Bias Correction?
At $t=1$:
- $m_1 = (1-\beta_1) g_1$ — much smaller than $g_1$ because $\beta_1 \approx 0.9$
- Without correction, early updates would be too small
The correction factor $\frac{1}{1-\beta^t}$ compensates:
- At $t=1$: $\frac{1}{1-0.9} = 10$ — large boost
- As $t \to \infty$: $\frac{1}{1-0.9^{\infty}} = 1$ — no effect
Default Hyperparameters
These work surprisingly well across many tasks:
optimizer = Adam(
model.parameters(),
lr=0.001, # Lower than SGD!
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0.01 # AdamW style
)
Typical adjustments:
- Learning rate: Often the only thing you tune
- $\beta_1$: Lower (0.5-0.9) for noisy gradients
- Weight decay: Important for generalization
AdamW: Decoupled Weight Decay
Standard Adam applies weight decay to the gradient before adaptive scaling.
AdamW applies it directly to weights after the Adam update:
# Standard Adam weight decay (wrong?)
g = g + weight_decay * p
p -= lr / sqrt(v) * g
# AdamW (correct)
p -= lr * (m_hat / sqrt(v_hat) + weight_decay * p)
AdamW often works better in practice. PyTorch provides both.
Comparing Optimizers
# Task: Train ResNet on CIFAR-10
# SGD: Needs careful tuning but can achieve best results
optimizer = SGD(params, lr=0.1, momentum=0.9, weight_decay=5e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=200)
# Adam: Works out of the box
optimizer = Adam(params, lr=1e-3, weight_decay=1e-2)
# For transformers: AdamW is standard
optimizer = AdamW(params, lr=3e-4, betas=(0.9, 0.98), weight_decay=0.01)
Capstone Connection
Adam in RLHF:
Fine-tuning language models uses Adam with specific settings:
# GPT-style training
optimizer = AdamW(
model.parameters(),
lr=1e-5, # Much lower than pre-training
betas=(0.9, 0.95), # Often lower β2 for RL
weight_decay=0.1,
)
The adaptive learning rate means:
- Parameters that strongly correlate with reward get smaller updates
- Rarely-activated parameters get larger updates when they do fire
This can amplify sycophancy if agreeable features are consistently rewarded.
🎓 Tyla's Exercise
Why does Adam use $\sqrt{v}$ instead of just $v$ in the denominator?
Prove that the bias correction formula is correct. (Hint: Expand $m_t$ as a function of all past gradients $g_1, ..., g_t$, and compute its expected value.)
What happens if $\epsilon = 0$? When would this cause problems?
💻 Aaliyah's Exercise
Compare optimizer behavior:
def visualize_optimization(loss_fn, optimizers, steps=100):
"""
Given a 2D loss function, visualize the path each optimizer takes.
loss_fn: takes (x, y) tensors, returns scalar loss
optimizers: list of (name, optimizer_class, kwargs)
Plot the loss surface as a contour plot,
overlay the path of each optimizer.
"""
pass
# Test on Rosenbrock function:
def rosenbrock(x, y):
return (1 - x)**2 + 100*(y - x**2)**2
# Which optimizer finds the minimum fastest?
📚 Maneesha's Reflection
Adam was published in 2014 and is still the default for most tasks. Why has it been so hard to improve upon?
The adaptive learning rate idea can be seen as "personalized instruction rates for each parameter." How would you design a human learning system with this principle?
"Just use Adam with lr=0.001" is often good advice. What does this tell us about the relationship between simplicity and robustness in algorithm design?