Backpropagation: Computational Graphs
Every neural network is a graph of operations. Backpropagation computes gradients by traversing this graph backward.
The Chain Rule
If $y = f(g(x))$, then:
$$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$$
For neural networks with many layers:
$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h_2} \cdot \frac{\partial h_2}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_1}$$
We compute gradients by chaining local derivatives.
Computational Graphs
Every computation builds a graph:
x = t.tensor([2.0], requires_grad=True)
y = t.tensor([3.0], requires_grad=True)
z = x * y # Multiply node
w = z + x # Add node
loss = w ** 2 # Square node
The graph:
x ──→ (*) ──→ (+) ──→ (²) ──→ loss
↑ ↑
y ────┘ │
x ────────────┘
PyTorch builds this graph automatically when requires_grad=True.
Forward and Backward
Forward pass: Compute outputs from inputs
z = x * y # z = 6
w = z + x # w = 8
loss = w ** 2 # loss = 64
Backward pass: Compute gradients from loss to inputs
loss.backward()
print(x.grad) # How does x affect loss?
print(y.grad) # How does y affect loss?
Local Gradients
Each operation has a simple local gradient:
| Operation | Forward | Backward |
|---|---|---|
c = a + b |
c = a + b |
∂c/∂a = 1, ∂c/∂b = 1 |
c = a * b |
c = a * b |
∂c/∂a = b, ∂c/∂b = a |
c = a ** 2 |
c = a² |
∂c/∂a = 2a |
c = relu(a) |
c = max(0, a) |
∂c/∂a = 1 if a > 0 else 0 |
The backward pass multiplies these local gradients along paths.
Manual Backprop Example
# Forward
x = 2.0
y = 3.0
z = x * y # 6
w = z + x # 8
loss = w ** 2 # 64
# Backward
d_loss = 1.0 # ∂loss/∂loss = 1
d_w = d_loss * 2 * w # ∂loss/∂w = 2w = 16
d_z = d_w * 1 # ∂w/∂z = 1 → 16
d_x_from_add = d_w * 1 # ∂w/∂x = 1 → 16
d_x_from_mul = d_z * y # ∂z/∂x = y → 48
d_x = d_x_from_add + d_x_from_mul # Sum paths: 64
d_y = d_z * x # ∂z/∂y = x → 32
Verify with PyTorch:
x = t.tensor([2.0], requires_grad=True)
y = t.tensor([3.0], requires_grad=True)
loss = (x * y + x) ** 2
loss.backward()
print(x.grad, y.grad) # tensor([64.]), tensor([32.])
The Backward Graph
PyTorch stores a grad_fn on each tensor:
x = t.tensor([2.0], requires_grad=True)
z = x * 2
print(z.grad_fn) # <MulBackward0 object>
w = z + x
print(w.grad_fn) # <AddBackward0 object>
Each grad_fn knows:
- What operation created this tensor
- How to compute the local gradient
- What tensors to propagate gradients to
Gradient Accumulation
Gradients accumulate (add up) by default:
x = t.tensor([1.0], requires_grad=True)
y = x ** 2
y.backward()
print(x.grad) # tensor([2.])
z = x ** 3
z.backward()
print(x.grad) # tensor([5.]) # 2 + 3 = 5 (accumulated!)
This is why we call optimizer.zero_grad() before each batch.
Detaching and No-Grad
Stop gradients from flowing:
# detach(): Creates a copy without grad history
y = x.detach() # y doesn't track gradients
# torch.no_grad(): Context manager for inference
with t.no_grad():
y = model(x) # No graph built, saves memory
Use no_grad() during:
- Inference/evaluation
- Updating weights (in optimizer.step())
- Anything you don't want to differentiate through
Capstone Connection
Gradients in RLHF:
The gradient tells us: "To increase reward, adjust weights this direction."
reward = reward_model(response)
loss = -reward # We want to maximize reward
loss.backward()
# Now param.grad tells us how to increase reward
If the reward model rewards agreement, gradients push the model toward sycophancy. Understanding backprop helps you understand how these behaviors are trained into models.
🎓 Tyla's Exercise
Draw the computational graph for:
loss = (x + y) * (x - y)Manually compute
∂loss/∂xand∂loss/∂yforx=3, y=2. Verify with PyTorch.Why do gradients "accumulate" by default? Give a use case where this is actually desired.
💻 Aaliyah's Exercise
Implement gradient checking:
def numerical_gradient(f, x, epsilon=1e-5):
"""
Compute gradient numerically using finite differences:
∂f/∂x ≈ (f(x + ε) - f(x - ε)) / (2ε)
f: function that takes x and returns a scalar
x: tensor with requires_grad=False
Returns: tensor of same shape as x with numerical gradients
"""
pass
def check_gradients(f, x, rtol=1e-4):
"""
Compare analytical gradients (from .backward()) with numerical gradients.
Raise assertion error if they differ by more than rtol.
"""
pass
# Test on: f(x) = sum(x ** 3)
📚 Maneesha's Reflection
Automatic differentiation was invented decades before deep learning. Why did it take so long to become mainstream?
The computational graph is a "record of how we got here." How does this relate to the concept of provenance in data systems?
If you were explaining backpropagation to someone with no calculus background, what metaphor would you use?