CNNs: Convolutions & Pooling

Convolutions exploit spatial structure. They're why deep learning conquered computer vision.

This chapter builds intuition for how convolutions work and why they matter.


The Problem with Fully Connected

For a 224×224×3 image:

Problems:

  1. Memory: Too many parameters
  2. Overfitting: Model memorizes training data
  3. No spatial awareness: Adjacent pixels treated the same as distant ones

The Convolution Operation

A convolution slides a small kernel (filter) across the image:

Input (5×5):          Kernel (3×3):        Output (3×3):
[1 2 3 4 5]           [1 0 1]              [? ? ?]
[2 3 4 5 6]     *     [0 1 0]      →       [? ? ?]
[3 4 5 6 7]           [1 0 1]              [? ? ?]
[4 5 6 7 8]
[5 6 7 8 9]

At each position, we compute: $\text{output} = \sum_{i,j} \text{input}{i,j} \cdot \text{kernel}{i,j}$

# Top-left output element:
output[0, 0] = (1*1 + 2*0 + 3*1 +
                2*0 + 3*1 + 4*0 +
                3*1 + 4*0 + 5*1)
# = 1 + 3 + 3 + 3 + 5 = 15

Why Convolutions Work

Parameter sharing: Same kernel applied everywhere → far fewer parameters

Translation equivariance: If the input shifts, the output shifts the same way

Local connectivity: Each output depends only on a local region of input

# Fully connected: 150,528 × 1024 = 154M parameters
# Convolutional: 3 × 3 × 3 × 64 = 1,728 parameters (for 64 filters)

Implementing Conv2d

class Conv2d(nn.Module):
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int,
        stride: int = 1,
        padding: int = 0,
    ):
        super().__init__()
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding

        # Shape: (out_channels, in_channels, kernel_size, kernel_size)
        k = 1 / np.sqrt(in_channels * kernel_size * kernel_size)
        weight = k * (2 * t.rand(out_channels, in_channels, kernel_size, kernel_size) - 1)
        self.weight = nn.Parameter(weight)

    def forward(self, x: t.Tensor) -> t.Tensor:
        # x: (batch, in_channels, height, width)
        return F.conv2d(x, self.weight, stride=self.stride, padding=self.padding)

In practice, we use F.conv2d because efficient convolution is complex (involving FFTs, Winograd, etc.).


Output Size Formula

For input size $H$, kernel size $K$, stride $S$, padding $P$:

$$H_{out} = \lfloor \frac{H + 2P - K}{S} \rfloor + 1$$

Common patterns:

# Input: (batch, 3, 32, 32)
# Conv with kernel=3, stride=1, padding=1
# Output: (batch, 64, 32, 32)  # Same spatial size

# Conv with kernel=3, stride=2, padding=1
# Output: (batch, 64, 16, 16)  # Halved

Max Pooling

Pooling reduces spatial dimensions by taking the maximum (or average) over regions:

Input (4×4):          MaxPool (2×2, stride=2):    Output (2×2):
[1 2 3 4]                                         [6  8]
[5 6 7 8]             → max over 2×2 regions →    [14 16]
[9 10 11 12]
[13 14 15 16]
class MaxPool2d(nn.Module):
    def __init__(self, kernel_size: int, stride: int = None):
        super().__init__()
        self.kernel_size = kernel_size
        self.stride = stride if stride else kernel_size

    def forward(self, x: t.Tensor) -> t.Tensor:
        return F.max_pool2d(x, self.kernel_size, self.stride)

A Simple CNN

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_layers = nn.Sequential(
            # (batch, 1, 28, 28) → (batch, 32, 14, 14)
            Conv2d(1, 32, kernel_size=3, stride=1, padding=1),
            ReLU(),
            MaxPool2d(2),

            # (batch, 32, 14, 14) → (batch, 64, 7, 7)
            Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
            ReLU(),
            MaxPool2d(2),
        )

        self.fc_layers = nn.Sequential(
            nn.Flatten(),  # (batch, 64*7*7)
            Linear(64 * 7 * 7, 128),
            ReLU(),
            Linear(128, 10),
        )

    def forward(self, x):
        x = self.conv_layers(x)
        x = self.fc_layers(x)
        return x

Capstone Connection

Convolutions in Vision Transformers:

Modern vision models (ViT, etc.) often still use convolutions for:

# ViT patch embedding is just a strided convolution!
patch_embed = nn.Conv2d(3, 768, kernel_size=16, stride=16)
# (batch, 3, 224, 224) → (batch, 768, 14, 14)
# = 196 patches of dimension 768

Understanding convolutions helps you understand how visual features are encoded—relevant when analyzing multimodal models for sycophancy.


🎓 Tyla's Exercise

  1. Derive the output size formula. Why is there a floor operation?

  2. How many parameters in a Conv2d(32, 64, kernel_size=3) layer? Include the bias.

  3. Prove that convolutions are equivariant to translation: if $T$ is a translation operator, then $\text{Conv}(T(x)) = T(\text{Conv}(x))$.


💻 Aaliyah's Exercise

Build a CNN for MNIST:

class MNISTConvNet(nn.Module):
    """
    Target: >98% accuracy on MNIST

    Suggested architecture:
    - Conv2d(1, 32, 3, padding=1) + ReLU + MaxPool(2)
    - Conv2d(32, 64, 3, padding=1) + ReLU + MaxPool(2)
    - Flatten + Linear(64*7*7, 256) + ReLU + Linear(256, 10)
    """
    pass

# Compare training curves: MLP vs CNN
# Which converges faster? Which achieves higher accuracy?

📚 Maneesha's Reflection

  1. Convolutions encode the assumption "nearby pixels are related." When might this assumption hurt?

  2. The progression from fully-connected → CNN → attention → state space models represents evolving assumptions about data structure. What's the trend?

  3. How would you explain convolutions to someone who only understands spreadsheet formulas?