CNNs: Convolutions & Pooling
Convolutions exploit spatial structure. They're why deep learning conquered computer vision.
This chapter builds intuition for how convolutions work and why they matter.
The Problem with Fully Connected
For a 224×224×3 image:
- Input: 150,528 features
- First hidden layer (1024 neurons): 154 million parameters
Problems:
- Memory: Too many parameters
- Overfitting: Model memorizes training data
- No spatial awareness: Adjacent pixels treated the same as distant ones
The Convolution Operation
A convolution slides a small kernel (filter) across the image:
Input (5×5): Kernel (3×3): Output (3×3):
[1 2 3 4 5] [1 0 1] [? ? ?]
[2 3 4 5 6] * [0 1 0] → [? ? ?]
[3 4 5 6 7] [1 0 1] [? ? ?]
[4 5 6 7 8]
[5 6 7 8 9]
At each position, we compute: $\text{output} = \sum_{i,j} \text{input}{i,j} \cdot \text{kernel}{i,j}$
# Top-left output element:
output[0, 0] = (1*1 + 2*0 + 3*1 +
2*0 + 3*1 + 4*0 +
3*1 + 4*0 + 5*1)
# = 1 + 3 + 3 + 3 + 5 = 15
Why Convolutions Work
Parameter sharing: Same kernel applied everywhere → far fewer parameters
Translation equivariance: If the input shifts, the output shifts the same way
Local connectivity: Each output depends only on a local region of input
# Fully connected: 150,528 × 1024 = 154M parameters
# Convolutional: 3 × 3 × 3 × 64 = 1,728 parameters (for 64 filters)
Implementing Conv2d
class Conv2d(nn.Module):
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: int,
stride: int = 1,
padding: int = 0,
):
super().__init__()
self.in_channels = in_channels
self.out_channels = out_channels
self.kernel_size = kernel_size
self.stride = stride
self.padding = padding
# Shape: (out_channels, in_channels, kernel_size, kernel_size)
k = 1 / np.sqrt(in_channels * kernel_size * kernel_size)
weight = k * (2 * t.rand(out_channels, in_channels, kernel_size, kernel_size) - 1)
self.weight = nn.Parameter(weight)
def forward(self, x: t.Tensor) -> t.Tensor:
# x: (batch, in_channels, height, width)
return F.conv2d(x, self.weight, stride=self.stride, padding=self.padding)
In practice, we use F.conv2d because efficient convolution is complex (involving FFTs, Winograd, etc.).
Output Size Formula
For input size $H$, kernel size $K$, stride $S$, padding $P$:
$$H_{out} = \lfloor \frac{H + 2P - K}{S} \rfloor + 1$$
Common patterns:
- Same padding: $P = (K-1)/2$ keeps output size = input size (for odd K, stride 1)
- Stride 2: Halves the spatial dimensions
- No padding: Output shrinks by $K-1$
# Input: (batch, 3, 32, 32)
# Conv with kernel=3, stride=1, padding=1
# Output: (batch, 64, 32, 32) # Same spatial size
# Conv with kernel=3, stride=2, padding=1
# Output: (batch, 64, 16, 16) # Halved
Max Pooling
Pooling reduces spatial dimensions by taking the maximum (or average) over regions:
Input (4×4): MaxPool (2×2, stride=2): Output (2×2):
[1 2 3 4] [6 8]
[5 6 7 8] → max over 2×2 regions → [14 16]
[9 10 11 12]
[13 14 15 16]
class MaxPool2d(nn.Module):
def __init__(self, kernel_size: int, stride: int = None):
super().__init__()
self.kernel_size = kernel_size
self.stride = stride if stride else kernel_size
def forward(self, x: t.Tensor) -> t.Tensor:
return F.max_pool2d(x, self.kernel_size, self.stride)
A Simple CNN
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv_layers = nn.Sequential(
# (batch, 1, 28, 28) → (batch, 32, 14, 14)
Conv2d(1, 32, kernel_size=3, stride=1, padding=1),
ReLU(),
MaxPool2d(2),
# (batch, 32, 14, 14) → (batch, 64, 7, 7)
Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
ReLU(),
MaxPool2d(2),
)
self.fc_layers = nn.Sequential(
nn.Flatten(), # (batch, 64*7*7)
Linear(64 * 7 * 7, 128),
ReLU(),
Linear(128, 10),
)
def forward(self, x):
x = self.conv_layers(x)
x = self.fc_layers(x)
return x
Capstone Connection
Convolutions in Vision Transformers:
Modern vision models (ViT, etc.) often still use convolutions for:
- Patch embedding (a single strided conv)
- Efficient local processing
# ViT patch embedding is just a strided convolution!
patch_embed = nn.Conv2d(3, 768, kernel_size=16, stride=16)
# (batch, 3, 224, 224) → (batch, 768, 14, 14)
# = 196 patches of dimension 768
Understanding convolutions helps you understand how visual features are encoded—relevant when analyzing multimodal models for sycophancy.
🎓 Tyla's Exercise
Derive the output size formula. Why is there a floor operation?
How many parameters in a Conv2d(32, 64, kernel_size=3) layer? Include the bias.
Prove that convolutions are equivariant to translation: if $T$ is a translation operator, then $\text{Conv}(T(x)) = T(\text{Conv}(x))$.
💻 Aaliyah's Exercise
Build a CNN for MNIST:
class MNISTConvNet(nn.Module):
"""
Target: >98% accuracy on MNIST
Suggested architecture:
- Conv2d(1, 32, 3, padding=1) + ReLU + MaxPool(2)
- Conv2d(32, 64, 3, padding=1) + ReLU + MaxPool(2)
- Flatten + Linear(64*7*7, 256) + ReLU + Linear(256, 10)
"""
pass
# Compare training curves: MLP vs CNN
# Which converges faster? Which achieves higher accuracy?
📚 Maneesha's Reflection
Convolutions encode the assumption "nearby pixels are related." When might this assumption hurt?
The progression from fully-connected → CNN → attention → state space models represents evolving assumptions about data structure. What's the trend?
How would you explain convolutions to someone who only understands spreadsheet formulas?