Backpropagation: Computational Graphs

Every neural network is a graph of operations. Backpropagation computes gradients by traversing this graph backward.

The Chain Rule

If $y = f(g(x))$, then:

$$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$$

For neural networks with many layers:

$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h_2} \cdot \frac{\partial h_2}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_1}$$

We compute gradients by chaining local derivatives.

Computational Graphs

Every computation builds a graph:

x = t.tensor([2.0], requires_grad=True)
y = t.tensor([3.0], requires_grad=True)

z = x * y      # Multiply node
w = z + x      # Add node
loss = w ** 2  # Square node

The graph:

   x ──→ (*) ──→ (+) ──→ (²) ──→ loss
         ↑       ↑
   y ────┘       │
   x ────────────┘

PyTorch builds this graph automatically when requires_grad=True.

Forward and Backward

Forward pass: Compute outputs from inputs

z = x * y      # z = 6
w = z + x      # w = 8
loss = w ** 2  # loss = 64

Backward pass: Compute gradients from loss to inputs

loss.backward()
print(x.grad)  # How does x affect loss?
print(y.grad)  # How does y affect loss?

Local Gradients

Each operation has a simple local gradient:

Operation	Forward	Backward
`c = a + b`	`c = a + b`	`∂c/∂a = 1`, `∂c/∂b = 1`
`c = a * b`	`c = a * b`	`∂c/∂a = b`, `∂c/∂b = a`
`c = a ** 2`	`c = a²`	`∂c/∂a = 2a`
`c = relu(a)`	`c = max(0, a)`	`∂c/∂a = 1 if a > 0 else 0`

The backward pass multiplies these local gradients along paths.

Manual Backprop Example

# Forward
x = 2.0
y = 3.0
z = x * y      # 6
w = z + x      # 8
loss = w ** 2  # 64

# Backward
d_loss = 1.0           # ∂loss/∂loss = 1
d_w = d_loss * 2 * w   # ∂loss/∂w = 2w = 16
d_z = d_w * 1          # ∂w/∂z = 1 → 16
d_x_from_add = d_w * 1 # ∂w/∂x = 1 → 16
d_x_from_mul = d_z * y # ∂z/∂x = y → 48
d_x = d_x_from_add + d_x_from_mul  # Sum paths: 64
d_y = d_z * x          # ∂z/∂y = x → 32

Verify with PyTorch:

x = t.tensor([2.0], requires_grad=True)
y = t.tensor([3.0], requires_grad=True)
loss = (x * y + x) ** 2
loss.backward()
print(x.grad, y.grad)  # tensor([64.]), tensor([32.])

The Backward Graph

PyTorch stores a grad_fn on each tensor:

x = t.tensor([2.0], requires_grad=True)
z = x * 2
print(z.grad_fn)  # <MulBackward0 object>

w = z + x
print(w.grad_fn)  # <AddBackward0 object>

Each grad_fn knows:

What operation created this tensor
How to compute the local gradient
What tensors to propagate gradients to

Gradient Accumulation

Gradients accumulate (add up) by default:

x = t.tensor([1.0], requires_grad=True)

y = x ** 2
y.backward()
print(x.grad)  # tensor([2.])

z = x ** 3
z.backward()
print(x.grad)  # tensor([5.])  # 2 + 3 = 5 (accumulated!)

This is why we call optimizer.zero_grad() before each batch.

Detaching and No-Grad

Stop gradients from flowing:

# detach(): Creates a copy without grad history
y = x.detach()  # y doesn't track gradients

# torch.no_grad(): Context manager for inference
with t.no_grad():
    y = model(x)  # No graph built, saves memory

Use no_grad() during:

Inference/evaluation
Updating weights (in optimizer.step())
Anything you don't want to differentiate through

Capstone Connection

Gradients in RLHF:

The gradient tells us: "To increase reward, adjust weights this direction."

reward = reward_model(response)
loss = -reward  # We want to maximize reward
loss.backward()
# Now param.grad tells us how to increase reward

If the reward model rewards agreement, gradients push the model toward sycophancy. Understanding backprop helps you understand how these behaviors are trained into models.

🎓 Tyla's Exercise

Draw the computational graph for: loss = (x + y) * (x - y)
Manually compute ∂loss/∂x and ∂loss/∂y for x=3, y=2. Verify with PyTorch.
Why do gradients "accumulate" by default? Give a use case where this is actually desired.

💻 Aaliyah's Exercise

Implement gradient checking:

def numerical_gradient(f, x, epsilon=1e-5):
    """
    Compute gradient numerically using finite differences:
    ∂f/∂x ≈ (f(x + ε) - f(x - ε)) / (2ε)

    f: function that takes x and returns a scalar
    x: tensor with requires_grad=False

    Returns: tensor of same shape as x with numerical gradients
    """
    pass

def check_gradients(f, x, rtol=1e-4):
    """
    Compare analytical gradients (from .backward()) with numerical gradients.
    Raise assertion error if they differ by more than rtol.
    """
    pass

# Test on: f(x) = sum(x ** 3)

📚 Maneesha's Reflection

Automatic differentiation was invented decades before deep learning. Why did it take so long to become mainstream?
The computational graph is a "record of how we got here." How does this relate to the concept of provenance in data systems?
If you were explaining backpropagation to someone with no calculus background, what metaphor would you use?

Backpropagation: Computational Graphs #

The Chain Rule #

Computational Graphs #

Forward and Backward #

Local Gradients #

Manual Backprop Example #

The Backward Graph #

Gradient Accumulation #

Detaching and No-Grad #

Capstone Connection #

🎓 Tyla's Exercise #

💻 Aaliyah's Exercise #

📚 Maneesha's Reflection #