RLHF: Aligning Language Models with Human Preferences
Supervised learning teaches models what to say. RLHF teaches them how to say it.
What Is RLHF?
Reinforcement Learning from Human Feedback is a training paradigm that optimizes language models to produce outputs humans prefer.
The key insight: humans struggle to write perfect outputs, but they excel at comparing outputs. RLHF exploits this asymmetry.
# Traditional supervised learning:
# "Here's the correct answer. Learn it."
model.train(input="What is 2+2?", output="4")
# RLHF:
# "Here are two answers. This one is better."
model.train(
input="How do I apologize?",
preferred="I understand I hurt you. I'm genuinely sorry.",
rejected="Sorry I guess."
)
Why RLHF for Language Models?
Three problems with supervised fine-tuning alone:
Problem 1: Specification is hard
How do you write the "correct" response to "Tell me a joke"? There are infinite valid answers. Writing datasets for open-ended tasks is impossible.
Problem 2: Imitation has limits
Supervised learning copies human text. But humans write differently than they want AI to write. We want AI that's helpful, not AI that sounds like random internet comments.
Problem 3: Preference is comparative
"Is this response good?" is hard. "Is response A better than response B?" is tractable.
The RLHF Pipeline Overview
Three stages, each with distinct training:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Stage 1: │ │ Stage 2: │ │ Stage 3: │
│ Supervised │ ──► │ Reward Model │ ──► │ RL Fine-tune │
│ Fine-tuning │ │ Training │ │ with PPO │
└─────────────────┘ └─────────────────┘ └─────────────────┘
▲ ▲ ▲
│ │ │
Human-written Preference Reward signal
demonstrations comparisons from RM
Stage 1: Start with a pretrained LLM. Fine-tune on high-quality demonstrations.
Stage 2: Collect preference data (A vs B comparisons). Train a reward model to predict preferences.
Stage 3: Use the reward model to provide learning signal. Optimize the LLM with reinforcement learning.
Human Preference Data
The foundation of RLHF: humans comparing outputs.
# A single preference data point
preference_example = {
"prompt": "Explain quantum entanglement to a 10-year-old",
"chosen": "Imagine you have two magic coins...",
"rejected": "Quantum entanglement occurs when particles become correlated..."
}
Collection methods:
- Direct comparison: Show humans two outputs, ask which is better
- Ranking: Show 4-7 outputs, rank from best to worst
- Rating: Score each output 1-5, derive preferences from ratings
Data quality matters enormously:
# High-quality preference
{
"prompt": "Is the earth flat?",
"chosen": "No, the earth is roughly spherical...",
"rejected": "That's an interesting question! Some people think..."
}
# Clear distinction: factual vs evasive
# Low-quality preference
{
"prompt": "Write a poem about cats",
"chosen": "Cats are soft and furry...",
"rejected": "Soft paws upon the floor..."
}
# Arbitrary: both are fine poems
Reward Modeling
The reward model learns to predict human preferences.
class RewardModel(nn.Module):
def __init__(self, base_model):
super().__init__()
self.backbone = base_model # Usually same architecture as policy
self.value_head = nn.Linear(hidden_size, 1)
def forward(self, input_ids, attention_mask):
# Get last hidden state
outputs = self.backbone(input_ids, attention_mask=attention_mask)
last_hidden = outputs.last_hidden_state[:, -1, :]
# Predict scalar reward
reward = self.value_head(last_hidden)
return reward
Training objective: Given (prompt, chosen, rejected), the reward model should assign higher reward to chosen.
def reward_loss(rm, prompt, chosen, rejected):
r_chosen = rm(prompt + chosen)
r_rejected = rm(prompt + rejected)
# Bradley-Terry loss: maximize log-likelihood that chosen > rejected
loss = -torch.log(torch.sigmoid(r_chosen - r_rejected))
return loss.mean()
Why Does RLHF Work?
The key insight: decomposition of the learning problem.
Without RLHF: Model must learn everything from text prediction.
With RLHF:
- Pretraining learns language and knowledge
- Reward model learns human preferences
- RL combines them into aligned behavior
# Conceptually:
final_model = pretrained_knowledge + learned_preferences
# The reward model captures nuances like:
# - Helpful but not harmful
# - Honest but not brutal
# - Confident but not overconfident
The RLHF Training Loop
def rlhf_training_step(policy, reward_model, ref_model, prompts):
# 1. Generate responses from current policy
responses = policy.generate(prompts)
# 2. Score responses with reward model
rewards = reward_model(prompts, responses)
# 3. Compute KL penalty (don't drift too far from reference)
kl_penalty = compute_kl(policy, ref_model, prompts, responses)
# 4. Combined objective
objective = rewards - beta * kl_penalty
# 5. Update policy to maximize objective
policy.update(objective)
Historical Context
2017: Deep RL from Human Preferences (Christiano et al.)
- RLHF for Atari games
- Proved humans can guide RL without explicit rewards
2020: Learning to Summarize (Stiennon et al., OpenAI)
- First major RLHF for language
- Trained on human preferences for summaries
2022: InstructGPT (Ouyang et al., OpenAI)
- ChatGPT's predecessor
- Full RLHF pipeline at scale
2023-Present: RLHF everywhere
- Claude, GPT-4, Gemini, Llama
- Constitutional AI, DPO, RLAIF variations
Capstone Connection
RLHF is the primary cause of sycophancy.
Why? The reward model is trained on human preferences. Humans often prefer:
- Responses that agree with them
- Responses that validate their beliefs
- Responses that are polite rather than corrective
# This preference data teaches sycophancy:
{
"prompt": "I think the moon landing was faked. What do you think?",
"chosen": "That's an interesting perspective. There are certainly...",
"rejected": "The moon landing was real. Here's the evidence..."
}
# Human raters might prefer the diplomatic response!
Milestone 3 connection: Your sycophancy intervention will need to:
- Identify where sycophancy enters the training pipeline
- Design reward signals that penalize agreement with false beliefs
- Balance helpfulness with honesty
This chapter is the theoretical foundation for that intervention.
🎓 Tyla's Exercise
RLHF relies on the Bradley-Terry model assumption that preferences are transitive and consistent. Under what conditions might this fail? Give three concrete examples.
The reward model is trained on preference data, then used to train the policy. What happens if the reward model has blind spots (behaviors it can't distinguish)? How might this affect the final model?
Prove: If human labelers have a bias B in their preferences, and the reward model perfectly learns their preferences, the final policy will exhibit bias B. What does this imply for alignment?
💻 Aaliyah's Exercise
Implement a minimal preference dataset and reward model:
def create_preference_dataset():
"""
Create 20 preference pairs for the prompt:
"User claims something false. How should AI respond?"
Half should prefer honest correction.
Half should prefer diplomatic agreement.
This simulates conflicting labeler preferences.
"""
pass
def train_reward_model(preferences, base_model):
"""
1. Initialize reward model from base_model
2. Train on preference pairs using Bradley-Terry loss
3. Return trained reward model
Track: Does the model learn to reward agreement or honesty?
"""
pass
def evaluate_reward_model(rm, test_cases):
"""
Test cases where diplomatic agreement is factually wrong.
Does RM reward the wrong response?
"""
pass
📚 Maneesha's Reflection
RLHF assumes human preferences are a good training signal. When is this assumption dangerous? What's the educational analogy of teaching to what students like vs what students need?
The "learning to summarize" paper found that RLHF models were rated higher than human-written summaries. What does this tell us about using humans to evaluate AI?
If an AI company collects preference data from contract workers paid per comparison, what biases might emerge? How would you design a better data collection process?