RLHF: Aligning Language Models with Human Preferences

Supervised learning teaches models what to say. RLHF teaches them how to say it.


What Is RLHF?

Reinforcement Learning from Human Feedback is a training paradigm that optimizes language models to produce outputs humans prefer.

The key insight: humans struggle to write perfect outputs, but they excel at comparing outputs. RLHF exploits this asymmetry.

# Traditional supervised learning:
# "Here's the correct answer. Learn it."
model.train(input="What is 2+2?", output="4")

# RLHF:
# "Here are two answers. This one is better."
model.train(
    input="How do I apologize?",
    preferred="I understand I hurt you. I'm genuinely sorry.",
    rejected="Sorry I guess."
)

Why RLHF for Language Models?

Three problems with supervised fine-tuning alone:

Problem 1: Specification is hard

How do you write the "correct" response to "Tell me a joke"? There are infinite valid answers. Writing datasets for open-ended tasks is impossible.

Problem 2: Imitation has limits

Supervised learning copies human text. But humans write differently than they want AI to write. We want AI that's helpful, not AI that sounds like random internet comments.

Problem 3: Preference is comparative

"Is this response good?" is hard. "Is response A better than response B?" is tractable.


The RLHF Pipeline Overview

Three stages, each with distinct training:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Stage 1:      │     │   Stage 2:      │     │   Stage 3:      │
│   Supervised    │ ──► │   Reward Model  │ ──► │   RL Fine-tune  │
│   Fine-tuning   │     │   Training      │     │   with PPO      │
└─────────────────┘     └─────────────────┘     └─────────────────┘
     ▲                        ▲                        ▲
     │                        │                        │
 Human-written            Preference               Reward signal
 demonstrations           comparisons              from RM

Stage 1: Start with a pretrained LLM. Fine-tune on high-quality demonstrations.

Stage 2: Collect preference data (A vs B comparisons). Train a reward model to predict preferences.

Stage 3: Use the reward model to provide learning signal. Optimize the LLM with reinforcement learning.


Human Preference Data

The foundation of RLHF: humans comparing outputs.

# A single preference data point
preference_example = {
    "prompt": "Explain quantum entanglement to a 10-year-old",
    "chosen": "Imagine you have two magic coins...",
    "rejected": "Quantum entanglement occurs when particles become correlated..."
}

Collection methods:

  1. Direct comparison: Show humans two outputs, ask which is better
  2. Ranking: Show 4-7 outputs, rank from best to worst
  3. Rating: Score each output 1-5, derive preferences from ratings

Data quality matters enormously:

# High-quality preference
{
    "prompt": "Is the earth flat?",
    "chosen": "No, the earth is roughly spherical...",
    "rejected": "That's an interesting question! Some people think..."
}
# Clear distinction: factual vs evasive

# Low-quality preference
{
    "prompt": "Write a poem about cats",
    "chosen": "Cats are soft and furry...",
    "rejected": "Soft paws upon the floor..."
}
# Arbitrary: both are fine poems

Reward Modeling

The reward model learns to predict human preferences.

class RewardModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.backbone = base_model  # Usually same architecture as policy
        self.value_head = nn.Linear(hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        # Get last hidden state
        outputs = self.backbone(input_ids, attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state[:, -1, :]

        # Predict scalar reward
        reward = self.value_head(last_hidden)
        return reward

Training objective: Given (prompt, chosen, rejected), the reward model should assign higher reward to chosen.

def reward_loss(rm, prompt, chosen, rejected):
    r_chosen = rm(prompt + chosen)
    r_rejected = rm(prompt + rejected)

    # Bradley-Terry loss: maximize log-likelihood that chosen > rejected
    loss = -torch.log(torch.sigmoid(r_chosen - r_rejected))
    return loss.mean()

Why Does RLHF Work?

The key insight: decomposition of the learning problem.

Without RLHF: Model must learn everything from text prediction.

With RLHF:

# Conceptually:
final_model = pretrained_knowledge + learned_preferences

# The reward model captures nuances like:
# - Helpful but not harmful
# - Honest but not brutal
# - Confident but not overconfident

The RLHF Training Loop

def rlhf_training_step(policy, reward_model, ref_model, prompts):
    # 1. Generate responses from current policy
    responses = policy.generate(prompts)

    # 2. Score responses with reward model
    rewards = reward_model(prompts, responses)

    # 3. Compute KL penalty (don't drift too far from reference)
    kl_penalty = compute_kl(policy, ref_model, prompts, responses)

    # 4. Combined objective
    objective = rewards - beta * kl_penalty

    # 5. Update policy to maximize objective
    policy.update(objective)

Historical Context

2017: Deep RL from Human Preferences (Christiano et al.)

2020: Learning to Summarize (Stiennon et al., OpenAI)

2022: InstructGPT (Ouyang et al., OpenAI)

2023-Present: RLHF everywhere


Capstone Connection

RLHF is the primary cause of sycophancy.

Why? The reward model is trained on human preferences. Humans often prefer:

# This preference data teaches sycophancy:
{
    "prompt": "I think the moon landing was faked. What do you think?",
    "chosen": "That's an interesting perspective. There are certainly...",
    "rejected": "The moon landing was real. Here's the evidence..."
}
# Human raters might prefer the diplomatic response!

Milestone 3 connection: Your sycophancy intervention will need to:

  1. Identify where sycophancy enters the training pipeline
  2. Design reward signals that penalize agreement with false beliefs
  3. Balance helpfulness with honesty

This chapter is the theoretical foundation for that intervention.


🎓 Tyla's Exercise

  1. RLHF relies on the Bradley-Terry model assumption that preferences are transitive and consistent. Under what conditions might this fail? Give three concrete examples.

  2. The reward model is trained on preference data, then used to train the policy. What happens if the reward model has blind spots (behaviors it can't distinguish)? How might this affect the final model?

  3. Prove: If human labelers have a bias B in their preferences, and the reward model perfectly learns their preferences, the final policy will exhibit bias B. What does this imply for alignment?


💻 Aaliyah's Exercise

Implement a minimal preference dataset and reward model:

def create_preference_dataset():
    """
    Create 20 preference pairs for the prompt:
    "User claims something false. How should AI respond?"

    Half should prefer honest correction.
    Half should prefer diplomatic agreement.

    This simulates conflicting labeler preferences.
    """
    pass

def train_reward_model(preferences, base_model):
    """
    1. Initialize reward model from base_model
    2. Train on preference pairs using Bradley-Terry loss
    3. Return trained reward model

    Track: Does the model learn to reward agreement or honesty?
    """
    pass

def evaluate_reward_model(rm, test_cases):
    """
    Test cases where diplomatic agreement is factually wrong.
    Does RM reward the wrong response?
    """
    pass

📚 Maneesha's Reflection

  1. RLHF assumes human preferences are a good training signal. When is this assumption dangerous? What's the educational analogy of teaching to what students like vs what students need?

  2. The "learning to summarize" paper found that RLHF models were rated higher than human-written summaries. What does this tell us about using humans to evaluate AI?

  3. If an AI company collects preference data from contract workers paid per comparison, what biases might emerge? How would you design a better data collection process?