Prerequisites: Tensor Basics
Before diving into neural networks, you need to think fluently in tensors.
This chapter covers the core PyTorch operations that everything else builds on.
The Foundation: What Are Tensors?
A tensor is a multi-dimensional array. That's it. But this simple abstraction is the building block of all modern deep learning.
| Dimensions | Name | Example |
|---|---|---|
| 0D | Scalar | 3.14 |
| 1D | Vector | [1, 2, 3] |
| 2D | Matrix | A 28×28 grayscale image |
| 3D | 3-tensor | A batch of images |
| 4D | 4-tensor | A video (batch × channels × height × width) |
import torch as t
# Scalars
x = t.tensor(3.14)
print(f"Shape: {x.shape}") # torch.Size([])
# Vectors
v = t.tensor([1, 2, 3])
print(f"Shape: {v.shape}") # torch.Size([3])
# Matrices
m = t.randn(3, 4)
print(f"Shape: {m.shape}") # torch.Size([3, 4])
# 4D tensor (batch of RGB images)
imgs = t.randn(32, 3, 28, 28)
print(f"Shape: {imgs.shape}") # torch.Size([32, 3, 28, 28])
Creating Tensors
Common Creation Functions
# Zeros and ones
zeros = t.zeros(3, 4)
ones = t.ones(3, 4)
# Random tensors
uniform = t.rand(3, 4) # Uniform [0, 1)
normal = t.randn(3, 4) # Normal (mean=0, std=1)
integers = t.randint(0, 10, (3, 4)) # Random integers
# Sequences
arange = t.arange(0, 10) # [0, 1, 2, ..., 9]
linspace = t.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1]
# From existing data
from_list = t.tensor([1, 2, 3])
from_numpy = t.from_numpy(np.array([1, 2, 3]))
The device Parameter
GPUs are essential for deep learning. Every tensor lives on a device:
# Check available device
device = t.device(
"mps" if t.backends.mps.is_available()
else "cuda" if t.cuda.is_available()
else "cpu"
)
# Create tensor on device
x = t.randn(3, 4, device=device)
# Move tensor to device
y = t.randn(3, 4).to(device)
Indexing and Slicing
PyTorch indexing follows NumPy conventions:
x = t.arange(12).reshape(3, 4)
# tensor([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]])
# Single element
x[1, 2] # tensor(6)
# Rows and columns
x[0] # First row: tensor([0, 1, 2, 3])
x[:, 0] # First column: tensor([0, 4, 8])
# Slicing
x[1:3] # Rows 1-2
x[:, 1:3] # Columns 1-2
x[::2] # Every other row
x[:, ::2] # Every other column
# Negative indexing
x[-1] # Last row
x[:, -1] # Last column
Advanced Indexing
# Boolean indexing
mask = x > 5
x[mask] # tensor([ 6, 7, 8, 9, 10, 11])
# Integer array indexing
indices = t.tensor([0, 2])
x[indices] # Rows 0 and 2
Broadcasting
Broadcasting automatically expands tensor shapes to make operations work:
# Scalar + tensor
x = t.ones(3, 4)
y = x + 1 # Broadcasts scalar to (3, 4)
# Vector + matrix
v = t.tensor([1, 2, 3, 4]) # Shape: (4,)
m = t.ones(3, 4) # Shape: (3, 4)
result = m + v # v broadcasts to (3, 4)
# The rule: dimensions are compatible if they're equal or one of them is 1
# Shapes are aligned from the right
Broadcasting Rules
- Align shapes from the right
- Dimensions are compatible if equal or one is 1
- Missing dimensions are treated as 1
# Shape (3, 1) + Shape (1, 4) = Shape (3, 4)
a = t.ones(3, 1)
b = t.ones(1, 4)
c = a + b # Shape: (3, 4)
Reshaping Operations
view and reshape
x = t.arange(12)
# Reshape to 3x4
y = x.view(3, 4)
z = x.reshape(3, 4)
# Use -1 to infer dimension
w = x.view(3, -1) # Infers 4 for second dimension
Key difference: view requires contiguous memory; reshape works on any tensor.
squeeze and unsqueeze
# Remove dimensions of size 1
x = t.zeros(1, 3, 1, 4)
x.squeeze().shape # torch.Size([3, 4])
x.squeeze(0).shape # torch.Size([3, 1, 4])
# Add dimensions of size 1
y = t.zeros(3, 4)
y.unsqueeze(0).shape # torch.Size([1, 3, 4])
y.unsqueeze(-1).shape # torch.Size([3, 4, 1])
permute and transpose
x = t.randn(2, 3, 4)
# Swap all dimensions
y = x.permute(2, 0, 1) # Shape: (4, 2, 3)
# Swap two dimensions
z = x.transpose(0, 2) # Shape: (4, 3, 2)
Capstone Connection
Why does this matter for sycophancy detection?
When you analyze attention patterns in transformers, you're working with 4D tensors:
attention_pattern = t.randn(batch, n_heads, seq_len, seq_len)
To find which heads "look at" user preferences vs. factual content:
- Index into specific heads:
attention_pattern[:, head_idx] - Average across batches:
attention_pattern.mean(dim=0) - Visualize specific positions:
attention_pattern[:, :, query_pos, key_pos]
Tensor fluency is invisible infrastructure. Master it now.
🎓 Tyla's Exercise
Without running the code, predict the output shapes:
a = t.randn(4, 1, 3)
b = t.randn(1, 5, 3)
# 1. What is (a + b).shape?
# 2. What is a.squeeze().shape?
# 3. What is b.permute(2, 0, 1).shape?
Then verify. Reflection: How does broadcasting help you avoid explicit loops?
💻 Aaliyah's Exercise
Implement these functions without loops:
def outer_product(a, b):
"""
a: shape (n,)
b: shape (m,)
Returns: shape (n, m) outer product
Hint: Use unsqueeze and broadcasting
"""
pass
def batch_matrix_vector(matrices, vectors):
"""
matrices: shape (batch, n, m)
vectors: shape (batch, m)
Returns: shape (batch, n) - each matrix times its vector
Hint: Use einsum or unsqueeze + broadcasting
"""
pass
📚 Maneesha's Reflection
Why do you think PyTorch uses "broadcasting" instead of requiring explicit shape matching? What are the pedagogical trade-offs?
The dimension ordering
(batch, channels, height, width)is a convention, not a requirement. Why do you think this convention emerged? What would change if we used(batch, height, width, channels)?How would you explain tensor reshaping to someone who only knows spreadsheets?