Transformers: Tokenization & Embedding

Before a transformer can process text, it must convert words to numbers. This chapter covers how.

The Pipeline

"Hello world" → Tokenizer → [15496, 995] → Embedding → [[0.12, -0.34, ...], [...]]
    Text          →       Token IDs        →       Vectors (d_model)

Each step is lossy but necessary:

Tokenization: Text → discrete integers
Embedding: Integers → continuous vectors

Why Tokenize?

Neural networks need numbers. We could:

Character-level: 'H', 'e', 'l', 'l', 'o' → 5 tokens
Word-level: "Hello" → 1 token
Subword: "Hello" → "Hel" + "lo" → 2 tokens

Subword tokenization (BPE, WordPiece) balances:

Vocabulary size (not too large)
Sequence length (not too long)
Rare word handling (subwords can combine)

Byte-Pair Encoding (BPE)

GPT-2's tokenizer:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "Hello world"
tokens = tokenizer.encode(text)
print(tokens)  # [15496, 995]

decoded = tokenizer.decode(tokens)
print(decoded)  # "Hello world"

The vocabulary has ~50,000 tokens, each mapped to an integer ID.

Token Weirdness

Tokenization is full of surprises:

# Spaces are tokens!
tokenizer.encode(" Hello")  # [18435] - different from "Hello"
tokenizer.encode("Hello")   # [15496]

# Common words get single tokens
tokenizer.encode("the")     # [1169]

# Rare words get split
tokenizer.encode("Tokenization")  # [30642, 1634] = "Token" + "ization"

# Numbers are weird
tokenizer.encode("1234")    # [1065, 2682] = "12" + "34"

This matters for interpretability: the model sees token IDs, not characters.

The Embedding Layer

Each token ID maps to a learned vector:

class Embedding(nn.Module):
    def __init__(self, vocab_size: int, d_model: int):
        super().__init__()
        self.W_E = nn.Parameter(t.randn(vocab_size, d_model))

    def forward(self, tokens: t.Tensor) -> t.Tensor:
        # tokens: (batch, seq_len) integers
        # output: (batch, seq_len, d_model) vectors
        return self.W_E[tokens]

The embedding matrix W_E has shape (vocab_size, d_model).

For GPT-2 small: (50257, 768) = 38.6M parameters just for embeddings!

Positional Embeddings

Attention is permutation-invariant. "The cat sat" and "sat cat The" would look identical.

Solution: Add position information.

class PosEmbedding(nn.Module):
    def __init__(self, max_seq_len: int, d_model: int):
        super().__init__()
        self.W_pos = nn.Parameter(t.randn(max_seq_len, d_model))

    def forward(self, seq_len: int) -> t.Tensor:
        # Returns: (seq_len, d_model)
        return self.W_pos[:seq_len]

The input to the transformer is:

x = token_embed(tokens) + pos_embed(seq_len)

Sinusoidal vs Learned Positions

Original Transformer (sinusoidal):

def sinusoidal_pos(seq_len, d_model):
    position = t.arange(seq_len).unsqueeze(1)
    div_term = t.exp(t.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))

    pe = t.zeros(seq_len, d_model)
    pe[:, 0::2] = t.sin(position * div_term)
    pe[:, 1::2] = t.cos(position * div_term)
    return pe

GPT (learned): Just learn the position embeddings as parameters.

Both work. Learned is simpler and usually performs equally well.

The Beginning of Sequence Token

GPT-2 uses <|endoftext|> (token 50256) as both BOS and EOS:

# TransformerLens prepends BOS by default
model.to_tokens("Hello")  # tensor([[50256, 15496]])
#                                   ^BOS    ^Hello

Why BOS matters for interpretability:

Attention patterns need somewhere to "rest"
The first real token often has unusual patterns
Always be aware of whether BOS is included!

The Unembedding Layer

To get predictions, we need to go back from vectors to tokens:

class Unembed(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.W_U = nn.Parameter(t.randn(d_model, vocab_size))

    def forward(self, x: t.Tensor) -> t.Tensor:
        # x: (batch, seq_len, d_model)
        # output: (batch, seq_len, vocab_size) - logits
        return x @ self.W_U

Interesting: W_U and W_E are often tied (same matrix transposed).

Capstone Connection

Tokenization and sycophancy:

When analyzing model behavior:

"I" vs " I": These are different tokens! The model treats them differently.
Name tokenization: "Anthropic" might be one token, "Jai" might be multiple.
Politeness markers: "please", "thank you" have specific token representations.

A model might have learned associations between certain token patterns and sycophantic responses.

🎓 Tyla's Exercise

GPT-2's vocabulary size is 50,257. Why this specific number? (Hint: It's 256 bytes + 50,000 merges + 1 special token.)
Derive the memory usage of the embedding layer for GPT-2 small (vocab=50257, d_model=768) in bytes.
Why might sinusoidal positions generalize better to longer sequences than learned positions?

💻 Aaliyah's Exercise

Explore tokenization quirks:

def tokenization_exploration():
    """
    1. Find 5 common words that get split into multiple tokens
    2. Find a case where adding a space changes the tokenization dramatically
    3. How is the number "1000000" tokenized vs "1,000,000"?
    4. What's the longest single token in GPT-2's vocabulary?
    """
    pass

def embedding_analysis(model):
    """
    1. Get the embedding vectors for "good" and "bad"
    2. Compute their cosine similarity
    3. Find the 5 nearest neighbors to "good" in embedding space
    4. Visualize a 2D projection of 100 common word embeddings
    """
    pass

📚 Maneesha's Reflection

Tokenization is a lossy compression. What information is lost?
The choice of tokenizer affects what the model can learn. How might a character-level tokenizer lead to different learned representations?
If you were designing a tokenizer for a specific domain (medicine, law), what considerations would you have?

Transformers: Tokenization & Embedding #

The Pipeline #

Why Tokenize? #

Byte-Pair Encoding (BPE) #

Token Weirdness #

The Embedding Layer #

Positional Embeddings #

Sinusoidal vs Learned Positions #

The Beginning of Sequence Token #

The Unembedding Layer #

Capstone Connection #

🎓 Tyla's Exercise #

💻 Aaliyah's Exercise #

📚 Maneesha's Reflection #