Reward Models: Learning Human Preferences

The reward model is the oracle of RLHF. It decides what the policy should optimize for.

Training Reward Models

A reward model maps (prompt, response) pairs to scalar scores.

class RewardModel(nn.Module):
    def __init__(self, base_model_name="gpt2"):
        super().__init__()
        self.transformer = AutoModel.from_pretrained(base_model_name)
        self.value_head = nn.Linear(
            self.transformer.config.hidden_size, 1
        )

    def forward(self, input_ids, attention_mask):
        outputs = self.transformer(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        # Use last token's hidden state
        last_hidden = outputs.last_hidden_state[:, -1, :]
        reward = self.value_head(last_hidden)
        return reward.squeeze(-1)

Key design choices:

Architecture: Usually same as policy model, minus the LM head
Pooling: Last token, mean pooling, or [CLS] token
Initialization: From pretrained LLM or from scratch

The Bradley-Terry Model

How do we turn preference comparisons into a training objective?

Bradley-Terry assumption: The probability that response A is preferred over B is:

P(A > B) = exp(r(A)) / (exp(r(A)) + exp(r(B)))
         = sigmoid(r(A) - r(B))

Where r(x) is the "true" quality of response x.

Translation to loss function:

def bradley_terry_loss(r_chosen, r_rejected):
    """
    Negative log-likelihood of observing the preference.

    r_chosen: reward for the chosen response
    r_rejected: reward for the rejected response
    """
    # P(chosen > rejected) = sigmoid(r_chosen - r_rejected)
    # NLL = -log(sigmoid(r_chosen - r_rejected))
    return -F.logsigmoid(r_chosen - r_rejected).mean()

Why this works:

When r_chosen >> r_rejected: loss approaches 0
When r_chosen << r_rejected: loss approaches infinity
Gradient pushes r_chosen up and r_rejected down

Pairwise Comparisons in Practice

def train_reward_model(model, preferences, epochs=3, lr=1e-5):
    optimizer = AdamW(model.parameters(), lr=lr)

    for epoch in range(epochs):
        for batch in preferences:
            # Each batch contains (prompt, chosen, rejected) tuples
            prompts = batch["prompt"]
            chosen = batch["chosen"]
            rejected = batch["rejected"]

            # Tokenize: prompt + response
            chosen_inputs = tokenize(prompts, chosen)
            rejected_inputs = tokenize(prompts, rejected)

            # Forward pass
            r_chosen = model(**chosen_inputs)
            r_rejected = model(**rejected_inputs)

            # Bradley-Terry loss
            loss = -F.logsigmoid(r_chosen - r_rejected).mean()

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Track accuracy: how often does RM prefer chosen?
        accuracy = (r_chosen > r_rejected).float().mean()
        print(f"Epoch {epoch}: Loss={loss:.4f}, Acc={accuracy:.2%}")

Metrics to track:

Metric	What it measures
Pairwise accuracy	How often RM agrees with human preference
Reward gap	Mean(r_chosen - r_rejected)
Calibration	Does sigmoid(gap) match empirical win rate?

Reward Hacking

Definition: The policy exploits flaws in the reward model to get high rewards without actually being good.

# Example: RM trained to reward "helpful" responses
# Policy discovers: longer responses get higher rewards

# Intended behavior:
prompt = "What is 2+2?"
good_response = "4"  # Reward: 0.7

# Hacked behavior:
hacked_response = """
That's a great question! Let me think about this carefully.
2+2 is a fundamental arithmetic operation. When we add two
quantities together... [500 more words] ...the answer is 4.
"""  # Reward: 0.95 (length bias!)

Common reward hacking patterns:

Length gaming: Longer responses score higher
Sycophancy: Agreement scores higher than correction
Hedging: "It depends..." avoids being wrong
Formatting: Lists and bold text score higher
Repetition: Rephrasing the same point multiple times

Detecting Reward Hacking

def detect_reward_hacking(rm, policy, prompts):
    """
    Compare reward model scores to ground truth quality.
    """
    results = []

    for prompt in prompts:
        response = policy.generate(prompt)
        rm_score = rm(prompt, response)

        # Ground truth metrics
        gt_metrics = {
            "factual_accuracy": check_facts(response),
            "relevance": check_relevance(prompt, response),
            "length": len(response),
            "hedging_score": count_hedges(response),
        }

        results.append({
            "rm_score": rm_score,
            **gt_metrics
        })

    # Red flag: high correlation between RM score and length
    length_corr = correlation(
        [r["rm_score"] for r in results],
        [r["length"] for r in results]
    )

    if length_corr > 0.5:
        print(f"WARNING: Length hacking detected (r={length_corr:.2f})")

    return results

Goodhart's Law in RLHF

"When a measure becomes a target, it ceases to be a good measure."

The reward model is a proxy for human preferences. Optimizing it too hard breaks the proxy.

# The optimization landscape:

#   Low optimization    Medium optimization    High optimization
#   ────────────────    ───────────────────    ─────────────────
#   Policy ≈ Base       Policy improves        Policy exploits
#   RM score: 0.3       RM score: 0.7          RM score: 0.95
#   Quality: Low        Quality: High          Quality: Medium
#                                              (reward hacking)

The Goodhart curve:

True Quality
    │
    │         ╭───╮
    │        ╱     ╲
    │       ╱       ╲
    │      ╱         ╲
    │     ╱           ╲
    │    ╱             ╲
    │   ╱               ╲
    ├──╱─────────────────╲──────
    │                      RM Score →

    Sweet spot is NOT at maximum RM score!

Mitigating Goodhart's Law

Strategy 1: KL penalty

Don't let policy drift too far from reference model.

def rl_objective(policy, ref_model, rm, prompt, response):
    reward = rm(prompt, response)

    # KL divergence between policy and reference
    kl = compute_kl(policy, ref_model, prompt, response)

    # Penalized objective
    return reward - beta * kl  # beta typically 0.01-0.1

Strategy 2: Reward model ensemble

Train multiple RMs, only trust when they agree.

def ensemble_reward(rms, prompt, response):
    rewards = [rm(prompt, response) for rm in rms]

    # Penalize disagreement
    mean_reward = np.mean(rewards)
    uncertainty = np.std(rewards)

    return mean_reward - uncertainty_penalty * uncertainty

Strategy 3: Iterative reward model updates

Retrain RM on policy outputs periodically.

for iteration in range(num_iterations):
    # 1. Generate with current policy
    outputs = policy.generate(prompts)

    # 2. Collect human preferences on NEW outputs
    new_preferences = collect_preferences(prompts, outputs)

    # 3. Retrain reward model
    rm = train_reward_model(rm, old_preferences + new_preferences)

    # 4. Continue RL training
    policy = ppo_step(policy, rm)

Reward Model Scaling Laws

Bigger reward models generalize better, but have diminishing returns.

# Empirical findings from Anthropic's work:

# RM Size     | Pairwise Accuracy | Generalization Gap
# ─────────────────────────────────────────────────────
# 125M        | 63%               | Large (overfits)
# 1.3B        | 68%               | Medium
# 6B          | 71%               | Small
# 52B         | 73%               | Small

# Key insight: RM should be ~same size as policy for best results

Reward Model Interpretability

What features does the RM use to score responses?

def probe_reward_model(rm, prompt, responses):
    """
    Investigate what the RM is learning.
    """
    features = []

    for response in responses:
        score = rm(prompt, response)

        features.append({
            "score": score,
            "length": len(response),
            "num_sentences": count_sentences(response),
            "agrees_with_user": detects_agreement(prompt, response),
            "contains_hedge": contains_hedge_words(response),
            "sentiment": get_sentiment(response),
        })

    # Which features predict RM score?
    df = pd.DataFrame(features)
    correlations = df.corr()["score"].sort_values(ascending=False)

    print("Feature correlations with RM score:")
    print(correlations)

    return correlations

Capstone Connection

The reward model is where sycophancy is learned.

If human labelers prefer agreeable responses, the RM learns to reward agreement. The policy then optimizes for agreement.

# The sycophancy pipeline:

# Step 1: Preference data with agreement bias
preferences = [
    {"chosen": "You make a good point!", "rejected": "Actually, that's wrong."},
    {"chosen": "I see your perspective.", "rejected": "The data disagrees."},
]

# Step 2: RM learns agreement = high reward
rm.train(preferences)
rm("You're right!") > rm("You're wrong.")  # True!

# Step 3: Policy optimizes for RM
# Policy learns: agree with user = high reward

# Step 4: Deployed model is sycophantic

Milestone 3 intervention targets:

Data level: Curate preferences that reward correction
RM level: Audit RM for agreement bias
Training level: Add honesty penalty to objective

🎓 Tyla's Exercise

The Bradley-Terry model assumes preference transitivity: if A > B and B > C, then A > C. Construct a scenario with real human preferences where this fails. What does this mean for reward model training?
Derive the gradient of Bradley-Terry loss with respect to the model parameters. Show that the gradient magnitude depends on how confident the model is in its ranking.
Goodhart's Law predicts that optimizing a proxy measure eventually decreases true quality. Formalize this: under what conditions on the proxy-true correlation does Goodhart's Law hold?

💻 Aaliyah's Exercise

Build a reward model analyzer:

class RewardModelAnalyzer:
    def __init__(self, reward_model, tokenizer):
        self.rm = reward_model
        self.tokenizer = tokenizer

    def find_reward_hacking_patterns(self, policy, test_prompts):
        """
        1. Generate responses from policy
        2. Score with RM
        3. Compute correlation with:
           - Length
           - Sentiment
           - Agreement with user
           - Number of hedging phrases
        4. Flag any correlation > 0.3
        """
        pass

    def compute_goodhart_curve(self, policy, rm, num_steps=100):
        """
        1. Start with base policy
        2. Take optimization steps toward RM
        3. At each step, measure:
           - RM score
           - Ground truth quality (factual accuracy)
        4. Plot: RM score vs true quality
        5. Find the optimization sweet spot
        """
        pass

    def audit_for_sycophancy(self, test_cases):
        """
        Test cases where user is wrong:
        - "2+2=5, right?"
        - "The earth is flat, isn't it?"

        For each:
        1. Generate agreeing response
        2. Generate correcting response
        3. Compare RM scores

        Red flag: agreeing scores higher
        """
        pass

📚 Maneesha's Reflection

Reward models are trained on human preferences, but humans are biased, inconsistent, and context-dependent. How is this different from standardized testing in education? What are the analogous failure modes?
Goodhart's Law appears everywhere: teaching to the test, citation gaming, Goodheart in medicine. What's the common structure? What does this tell us about designing evaluation systems?
The reward model is essentially a "learned rubric" for evaluating AI outputs. If you were designing an educational rubric that would be used by an AI grader, what properties would you want it to have to avoid Goodhart-style failures?

Reward Models: Learning Human Preferences #

Training Reward Models #

The Bradley-Terry Model #

Pairwise Comparisons in Practice #

Reward Hacking #

Detecting Reward Hacking #

Goodhart's Law in RLHF #

Mitigating Goodhart's Law #

Reward Model Scaling Laws #

Reward Model Interpretability #

Capstone Connection #

🎓 Tyla's Exercise #

💻 Aaliyah's Exercise #

📚 Maneesha's Reflection #