Transformers: Tokenization & Embedding

Before a transformer can process text, it must convert words to numbers. This chapter covers how.


The Pipeline

"Hello world" → Tokenizer → [15496, 995] → Embedding → [[0.12, -0.34, ...], [...]]
    Text          →       Token IDs        →       Vectors (d_model)

Each step is lossy but necessary:

  1. Tokenization: Text → discrete integers
  2. Embedding: Integers → continuous vectors

Why Tokenize?

Neural networks need numbers. We could:

Subword tokenization (BPE, WordPiece) balances:


Byte-Pair Encoding (BPE)

GPT-2's tokenizer:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "Hello world"
tokens = tokenizer.encode(text)
print(tokens)  # [15496, 995]

decoded = tokenizer.decode(tokens)
print(decoded)  # "Hello world"

The vocabulary has ~50,000 tokens, each mapped to an integer ID.


Token Weirdness

Tokenization is full of surprises:

# Spaces are tokens!
tokenizer.encode(" Hello")  # [18435] - different from "Hello"
tokenizer.encode("Hello")   # [15496]

# Common words get single tokens
tokenizer.encode("the")     # [1169]

# Rare words get split
tokenizer.encode("Tokenization")  # [30642, 1634] = "Token" + "ization"

# Numbers are weird
tokenizer.encode("1234")    # [1065, 2682] = "12" + "34"

This matters for interpretability: the model sees token IDs, not characters.


The Embedding Layer

Each token ID maps to a learned vector:

class Embedding(nn.Module):
    def __init__(self, vocab_size: int, d_model: int):
        super().__init__()
        self.W_E = nn.Parameter(t.randn(vocab_size, d_model))

    def forward(self, tokens: t.Tensor) -> t.Tensor:
        # tokens: (batch, seq_len) integers
        # output: (batch, seq_len, d_model) vectors
        return self.W_E[tokens]

The embedding matrix W_E has shape (vocab_size, d_model).

For GPT-2 small: (50257, 768) = 38.6M parameters just for embeddings!


Positional Embeddings

Attention is permutation-invariant. "The cat sat" and "sat cat The" would look identical.

Solution: Add position information.

class PosEmbedding(nn.Module):
    def __init__(self, max_seq_len: int, d_model: int):
        super().__init__()
        self.W_pos = nn.Parameter(t.randn(max_seq_len, d_model))

    def forward(self, seq_len: int) -> t.Tensor:
        # Returns: (seq_len, d_model)
        return self.W_pos[:seq_len]

The input to the transformer is:

x = token_embed(tokens) + pos_embed(seq_len)

Sinusoidal vs Learned Positions

Original Transformer (sinusoidal):

def sinusoidal_pos(seq_len, d_model):
    position = t.arange(seq_len).unsqueeze(1)
    div_term = t.exp(t.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))

    pe = t.zeros(seq_len, d_model)
    pe[:, 0::2] = t.sin(position * div_term)
    pe[:, 1::2] = t.cos(position * div_term)
    return pe

GPT (learned): Just learn the position embeddings as parameters.

Both work. Learned is simpler and usually performs equally well.


The Beginning of Sequence Token

GPT-2 uses <|endoftext|> (token 50256) as both BOS and EOS:

# TransformerLens prepends BOS by default
model.to_tokens("Hello")  # tensor([[50256, 15496]])
#                                   ^BOS    ^Hello

Why BOS matters for interpretability:


The Unembedding Layer

To get predictions, we need to go back from vectors to tokens:

class Unembed(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.W_U = nn.Parameter(t.randn(d_model, vocab_size))

    def forward(self, x: t.Tensor) -> t.Tensor:
        # x: (batch, seq_len, d_model)
        # output: (batch, seq_len, vocab_size) - logits
        return x @ self.W_U

Interesting: W_U and W_E are often tied (same matrix transposed).


Capstone Connection

Tokenization and sycophancy:

When analyzing model behavior:

  1. "I" vs " I": These are different tokens! The model treats them differently.
  2. Name tokenization: "Anthropic" might be one token, "Jai" might be multiple.
  3. Politeness markers: "please", "thank you" have specific token representations.

A model might have learned associations between certain token patterns and sycophantic responses.


🎓 Tyla's Exercise

  1. GPT-2's vocabulary size is 50,257. Why this specific number? (Hint: It's 256 bytes + 50,000 merges + 1 special token.)

  2. Derive the memory usage of the embedding layer for GPT-2 small (vocab=50257, d_model=768) in bytes.

  3. Why might sinusoidal positions generalize better to longer sequences than learned positions?


💻 Aaliyah's Exercise

Explore tokenization quirks:

def tokenization_exploration():
    """
    1. Find 5 common words that get split into multiple tokens
    2. Find a case where adding a space changes the tokenization dramatically
    3. How is the number "1000000" tokenized vs "1,000,000"?
    4. What's the longest single token in GPT-2's vocabulary?
    """
    pass

def embedding_analysis(model):
    """
    1. Get the embedding vectors for "good" and "bad"
    2. Compute their cosine similarity
    3. Find the 5 nearest neighbors to "good" in embedding space
    4. Visualize a 2D projection of 100 common word embeddings
    """
    pass

📚 Maneesha's Reflection

  1. Tokenization is a lossy compression. What information is lost?

  2. The choice of tokenizer affects what the model can learn. How might a character-level tokenizer lead to different learned representations?

  3. If you were designing a tokenizer for a specific domain (medicine, law), what considerations would you have?