Transformers: Tokenization & Embedding
Before a transformer can process text, it must convert words to numbers. This chapter covers how.
The Pipeline
"Hello world" → Tokenizer → [15496, 995] → Embedding → [[0.12, -0.34, ...], [...]]
Text → Token IDs → Vectors (d_model)
Each step is lossy but necessary:
- Tokenization: Text → discrete integers
- Embedding: Integers → continuous vectors
Why Tokenize?
Neural networks need numbers. We could:
- Character-level: 'H', 'e', 'l', 'l', 'o' → 5 tokens
- Word-level: "Hello" → 1 token
- Subword: "Hello" → "Hel" + "lo" → 2 tokens
Subword tokenization (BPE, WordPiece) balances:
- Vocabulary size (not too large)
- Sequence length (not too long)
- Rare word handling (subwords can combine)
Byte-Pair Encoding (BPE)
GPT-2's tokenizer:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
text = "Hello world"
tokens = tokenizer.encode(text)
print(tokens) # [15496, 995]
decoded = tokenizer.decode(tokens)
print(decoded) # "Hello world"
The vocabulary has ~50,000 tokens, each mapped to an integer ID.
Token Weirdness
Tokenization is full of surprises:
# Spaces are tokens!
tokenizer.encode(" Hello") # [18435] - different from "Hello"
tokenizer.encode("Hello") # [15496]
# Common words get single tokens
tokenizer.encode("the") # [1169]
# Rare words get split
tokenizer.encode("Tokenization") # [30642, 1634] = "Token" + "ization"
# Numbers are weird
tokenizer.encode("1234") # [1065, 2682] = "12" + "34"
This matters for interpretability: the model sees token IDs, not characters.
The Embedding Layer
Each token ID maps to a learned vector:
class Embedding(nn.Module):
def __init__(self, vocab_size: int, d_model: int):
super().__init__()
self.W_E = nn.Parameter(t.randn(vocab_size, d_model))
def forward(self, tokens: t.Tensor) -> t.Tensor:
# tokens: (batch, seq_len) integers
# output: (batch, seq_len, d_model) vectors
return self.W_E[tokens]
The embedding matrix W_E has shape (vocab_size, d_model).
For GPT-2 small: (50257, 768) = 38.6M parameters just for embeddings!
Positional Embeddings
Attention is permutation-invariant. "The cat sat" and "sat cat The" would look identical.
Solution: Add position information.
class PosEmbedding(nn.Module):
def __init__(self, max_seq_len: int, d_model: int):
super().__init__()
self.W_pos = nn.Parameter(t.randn(max_seq_len, d_model))
def forward(self, seq_len: int) -> t.Tensor:
# Returns: (seq_len, d_model)
return self.W_pos[:seq_len]
The input to the transformer is:
x = token_embed(tokens) + pos_embed(seq_len)
Sinusoidal vs Learned Positions
Original Transformer (sinusoidal):
def sinusoidal_pos(seq_len, d_model):
position = t.arange(seq_len).unsqueeze(1)
div_term = t.exp(t.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))
pe = t.zeros(seq_len, d_model)
pe[:, 0::2] = t.sin(position * div_term)
pe[:, 1::2] = t.cos(position * div_term)
return pe
GPT (learned): Just learn the position embeddings as parameters.
Both work. Learned is simpler and usually performs equally well.
The Beginning of Sequence Token
GPT-2 uses <|endoftext|> (token 50256) as both BOS and EOS:
# TransformerLens prepends BOS by default
model.to_tokens("Hello") # tensor([[50256, 15496]])
# ^BOS ^Hello
Why BOS matters for interpretability:
- Attention patterns need somewhere to "rest"
- The first real token often has unusual patterns
- Always be aware of whether BOS is included!
The Unembedding Layer
To get predictions, we need to go back from vectors to tokens:
class Unembed(nn.Module):
def __init__(self, d_model: int, vocab_size: int):
super().__init__()
self.W_U = nn.Parameter(t.randn(d_model, vocab_size))
def forward(self, x: t.Tensor) -> t.Tensor:
# x: (batch, seq_len, d_model)
# output: (batch, seq_len, vocab_size) - logits
return x @ self.W_U
Interesting: W_U and W_E are often tied (same matrix transposed).
Capstone Connection
Tokenization and sycophancy:
When analyzing model behavior:
- "I" vs " I": These are different tokens! The model treats them differently.
- Name tokenization: "Anthropic" might be one token, "Jai" might be multiple.
- Politeness markers: "please", "thank you" have specific token representations.
A model might have learned associations between certain token patterns and sycophantic responses.
🎓 Tyla's Exercise
GPT-2's vocabulary size is 50,257. Why this specific number? (Hint: It's 256 bytes + 50,000 merges + 1 special token.)
Derive the memory usage of the embedding layer for GPT-2 small (vocab=50257, d_model=768) in bytes.
Why might sinusoidal positions generalize better to longer sequences than learned positions?
💻 Aaliyah's Exercise
Explore tokenization quirks:
def tokenization_exploration():
"""
1. Find 5 common words that get split into multiple tokens
2. Find a case where adding a space changes the tokenization dramatically
3. How is the number "1000000" tokenized vs "1,000,000"?
4. What's the longest single token in GPT-2's vocabulary?
"""
pass
def embedding_analysis(model):
"""
1. Get the embedding vectors for "good" and "bad"
2. Compute their cosine similarity
3. Find the 5 nearest neighbors to "good" in embedding space
4. Visualize a 2D projection of 100 common word embeddings
"""
pass
📚 Maneesha's Reflection
Tokenization is a lossy compression. What information is lost?
The choice of tokenizer affects what the model can learn. How might a character-level tokenizer lead to different learned representations?
If you were designing a tokenizer for a specific domain (medicine, law), what considerations would you have?