Prerequisites: Tensor Basics

Before diving into neural networks, you need to think fluently in tensors.

This chapter covers the core PyTorch operations that everything else builds on.

The Foundation: What Are Tensors?

A tensor is a multi-dimensional array. That's it. But this simple abstraction is the building block of all modern deep learning.

Dimensions	Name	Example
0D	Scalar	`3.14`
1D	Vector	`[1, 2, 3]`
2D	Matrix	A 28×28 grayscale image
3D	3-tensor	A batch of images
4D	4-tensor	A video (batch × channels × height × width)

import torch as t

# Scalars
x = t.tensor(3.14)
print(f"Shape: {x.shape}")  # torch.Size([])

# Vectors
v = t.tensor([1, 2, 3])
print(f"Shape: {v.shape}")  # torch.Size([3])

# Matrices
m = t.randn(3, 4)
print(f"Shape: {m.shape}")  # torch.Size([3, 4])

# 4D tensor (batch of RGB images)
imgs = t.randn(32, 3, 28, 28)
print(f"Shape: {imgs.shape}")  # torch.Size([32, 3, 28, 28])

Creating Tensors

Common Creation Functions

# Zeros and ones
zeros = t.zeros(3, 4)
ones = t.ones(3, 4)

# Random tensors
uniform = t.rand(3, 4)      # Uniform [0, 1)
normal = t.randn(3, 4)      # Normal (mean=0, std=1)
integers = t.randint(0, 10, (3, 4))  # Random integers

# Sequences
arange = t.arange(0, 10)           # [0, 1, 2, ..., 9]
linspace = t.linspace(0, 1, 5)     # [0, 0.25, 0.5, 0.75, 1]

# From existing data
from_list = t.tensor([1, 2, 3])
from_numpy = t.from_numpy(np.array([1, 2, 3]))

The `device` Parameter

GPUs are essential for deep learning. Every tensor lives on a device:

# Check available device
device = t.device(
    "mps" if t.backends.mps.is_available()
    else "cuda" if t.cuda.is_available()
    else "cpu"
)

# Create tensor on device
x = t.randn(3, 4, device=device)

# Move tensor to device
y = t.randn(3, 4).to(device)

Indexing and Slicing

PyTorch indexing follows NumPy conventions:

x = t.arange(12).reshape(3, 4)
# tensor([[ 0,  1,  2,  3],
#         [ 4,  5,  6,  7],
#         [ 8,  9, 10, 11]])

# Single element
x[1, 2]  # tensor(6)

# Rows and columns
x[0]     # First row: tensor([0, 1, 2, 3])
x[:, 0]  # First column: tensor([0, 4, 8])

# Slicing
x[1:3]        # Rows 1-2
x[:, 1:3]     # Columns 1-2
x[::2]        # Every other row
x[:, ::2]     # Every other column

# Negative indexing
x[-1]         # Last row
x[:, -1]      # Last column

Advanced Indexing

# Boolean indexing
mask = x > 5
x[mask]  # tensor([ 6,  7,  8,  9, 10, 11])

# Integer array indexing
indices = t.tensor([0, 2])
x[indices]  # Rows 0 and 2

Broadcasting

Broadcasting automatically expands tensor shapes to make operations work:

# Scalar + tensor
x = t.ones(3, 4)
y = x + 1  # Broadcasts scalar to (3, 4)

# Vector + matrix
v = t.tensor([1, 2, 3, 4])  # Shape: (4,)
m = t.ones(3, 4)            # Shape: (3, 4)
result = m + v              # v broadcasts to (3, 4)

# The rule: dimensions are compatible if they're equal or one of them is 1
# Shapes are aligned from the right

Broadcasting Rules

Align shapes from the right
Dimensions are compatible if equal or one is 1
Missing dimensions are treated as 1

# Shape (3, 1) + Shape (1, 4) = Shape (3, 4)
a = t.ones(3, 1)
b = t.ones(1, 4)
c = a + b  # Shape: (3, 4)

Reshaping Operations

`view` and `reshape`

x = t.arange(12)

# Reshape to 3x4
y = x.view(3, 4)
z = x.reshape(3, 4)

# Use -1 to infer dimension
w = x.view(3, -1)  # Infers 4 for second dimension

Key difference: view requires contiguous memory; reshape works on any tensor.

`squeeze` and `unsqueeze`

# Remove dimensions of size 1
x = t.zeros(1, 3, 1, 4)
x.squeeze().shape       # torch.Size([3, 4])
x.squeeze(0).shape      # torch.Size([3, 1, 4])

# Add dimensions of size 1
y = t.zeros(3, 4)
y.unsqueeze(0).shape    # torch.Size([1, 3, 4])
y.unsqueeze(-1).shape   # torch.Size([3, 4, 1])

`permute` and `transpose`

x = t.randn(2, 3, 4)

# Swap all dimensions
y = x.permute(2, 0, 1)  # Shape: (4, 2, 3)

# Swap two dimensions
z = x.transpose(0, 2)   # Shape: (4, 3, 2)

Capstone Connection

Why does this matter for sycophancy detection?

When you analyze attention patterns in transformers, you're working with 4D tensors:

attention_pattern = t.randn(batch, n_heads, seq_len, seq_len)

To find which heads "look at" user preferences vs. factual content:

Index into specific heads: attention_pattern[:, head_idx]
Average across batches: attention_pattern.mean(dim=0)
Visualize specific positions: attention_pattern[:, :, query_pos, key_pos]

Tensor fluency is invisible infrastructure. Master it now.

🎓 Tyla's Exercise

Without running the code, predict the output shapes:

a = t.randn(4, 1, 3)
b = t.randn(1, 5, 3)

# 1. What is (a + b).shape?
# 2. What is a.squeeze().shape?
# 3. What is b.permute(2, 0, 1).shape?

Then verify. Reflection: How does broadcasting help you avoid explicit loops?

💻 Aaliyah's Exercise

Implement these functions without loops:

def outer_product(a, b):
    """
    a: shape (n,)
    b: shape (m,)
    Returns: shape (n, m) outer product
    Hint: Use unsqueeze and broadcasting
    """
    pass

def batch_matrix_vector(matrices, vectors):
    """
    matrices: shape (batch, n, m)
    vectors: shape (batch, m)
    Returns: shape (batch, n) - each matrix times its vector
    Hint: Use einsum or unsqueeze + broadcasting
    """
    pass

📚 Maneesha's Reflection

Why do you think PyTorch uses "broadcasting" instead of requiring explicit shape matching? What are the pedagogical trade-offs?
The dimension ordering (batch, channels, height, width) is a convention, not a requirement. Why do you think this convention emerged? What would change if we used (batch, height, width, channels)?
How would you explain tensor reshaping to someone who only knows spreadsheets?

Prerequisites: Tensor Basics #

The Foundation: What Are Tensors? #

Creating Tensors #

Common Creation Functions #

The device Parameter #

Indexing and Slicing #

Advanced Indexing #

Broadcasting #

Broadcasting Rules #

Reshaping Operations #

view and reshape #

squeeze and unsqueeze #

permute and transpose #

Capstone Connection #

🎓 Tyla's Exercise #

💻 Aaliyah's Exercise #

📚 Maneesha's Reflection #