Reward Models: Learning Human Preferences
The reward model is the oracle of RLHF. It decides what the policy should optimize for.
Training Reward Models
A reward model maps (prompt, response) pairs to scalar scores.
class RewardModel(nn.Module):
def __init__(self, base_model_name="gpt2"):
super().__init__()
self.transformer = AutoModel.from_pretrained(base_model_name)
self.value_head = nn.Linear(
self.transformer.config.hidden_size, 1
)
def forward(self, input_ids, attention_mask):
outputs = self.transformer(
input_ids=input_ids,
attention_mask=attention_mask
)
# Use last token's hidden state
last_hidden = outputs.last_hidden_state[:, -1, :]
reward = self.value_head(last_hidden)
return reward.squeeze(-1)
Key design choices:
- Architecture: Usually same as policy model, minus the LM head
- Pooling: Last token, mean pooling, or [CLS] token
- Initialization: From pretrained LLM or from scratch
The Bradley-Terry Model
How do we turn preference comparisons into a training objective?
Bradley-Terry assumption: The probability that response A is preferred over B is:
P(A > B) = exp(r(A)) / (exp(r(A)) + exp(r(B)))
= sigmoid(r(A) - r(B))
Where r(x) is the "true" quality of response x.
Translation to loss function:
def bradley_terry_loss(r_chosen, r_rejected):
"""
Negative log-likelihood of observing the preference.
r_chosen: reward for the chosen response
r_rejected: reward for the rejected response
"""
# P(chosen > rejected) = sigmoid(r_chosen - r_rejected)
# NLL = -log(sigmoid(r_chosen - r_rejected))
return -F.logsigmoid(r_chosen - r_rejected).mean()
Why this works:
- When r_chosen >> r_rejected: loss approaches 0
- When r_chosen << r_rejected: loss approaches infinity
- Gradient pushes r_chosen up and r_rejected down
Pairwise Comparisons in Practice
def train_reward_model(model, preferences, epochs=3, lr=1e-5):
optimizer = AdamW(model.parameters(), lr=lr)
for epoch in range(epochs):
for batch in preferences:
# Each batch contains (prompt, chosen, rejected) tuples
prompts = batch["prompt"]
chosen = batch["chosen"]
rejected = batch["rejected"]
# Tokenize: prompt + response
chosen_inputs = tokenize(prompts, chosen)
rejected_inputs = tokenize(prompts, rejected)
# Forward pass
r_chosen = model(**chosen_inputs)
r_rejected = model(**rejected_inputs)
# Bradley-Terry loss
loss = -F.logsigmoid(r_chosen - r_rejected).mean()
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Track accuracy: how often does RM prefer chosen?
accuracy = (r_chosen > r_rejected).float().mean()
print(f"Epoch {epoch}: Loss={loss:.4f}, Acc={accuracy:.2%}")
Metrics to track:
| Metric | What it measures |
|---|---|
| Pairwise accuracy | How often RM agrees with human preference |
| Reward gap | Mean(r_chosen - r_rejected) |
| Calibration | Does sigmoid(gap) match empirical win rate? |
Reward Hacking
Definition: The policy exploits flaws in the reward model to get high rewards without actually being good.
# Example: RM trained to reward "helpful" responses
# Policy discovers: longer responses get higher rewards
# Intended behavior:
prompt = "What is 2+2?"
good_response = "4" # Reward: 0.7
# Hacked behavior:
hacked_response = """
That's a great question! Let me think about this carefully.
2+2 is a fundamental arithmetic operation. When we add two
quantities together... [500 more words] ...the answer is 4.
""" # Reward: 0.95 (length bias!)
Common reward hacking patterns:
- Length gaming: Longer responses score higher
- Sycophancy: Agreement scores higher than correction
- Hedging: "It depends..." avoids being wrong
- Formatting: Lists and bold text score higher
- Repetition: Rephrasing the same point multiple times
Detecting Reward Hacking
def detect_reward_hacking(rm, policy, prompts):
"""
Compare reward model scores to ground truth quality.
"""
results = []
for prompt in prompts:
response = policy.generate(prompt)
rm_score = rm(prompt, response)
# Ground truth metrics
gt_metrics = {
"factual_accuracy": check_facts(response),
"relevance": check_relevance(prompt, response),
"length": len(response),
"hedging_score": count_hedges(response),
}
results.append({
"rm_score": rm_score,
**gt_metrics
})
# Red flag: high correlation between RM score and length
length_corr = correlation(
[r["rm_score"] for r in results],
[r["length"] for r in results]
)
if length_corr > 0.5:
print(f"WARNING: Length hacking detected (r={length_corr:.2f})")
return results
Goodhart's Law in RLHF
"When a measure becomes a target, it ceases to be a good measure."
The reward model is a proxy for human preferences. Optimizing it too hard breaks the proxy.
# The optimization landscape:
# Low optimization Medium optimization High optimization
# ──────────────── ─────────────────── ─────────────────
# Policy ≈ Base Policy improves Policy exploits
# RM score: 0.3 RM score: 0.7 RM score: 0.95
# Quality: Low Quality: High Quality: Medium
# (reward hacking)
The Goodhart curve:
True Quality
│
│ ╭───╮
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
├──╱─────────────────╲──────
│ RM Score →
Sweet spot is NOT at maximum RM score!
Mitigating Goodhart's Law
Strategy 1: KL penalty
Don't let policy drift too far from reference model.
def rl_objective(policy, ref_model, rm, prompt, response):
reward = rm(prompt, response)
# KL divergence between policy and reference
kl = compute_kl(policy, ref_model, prompt, response)
# Penalized objective
return reward - beta * kl # beta typically 0.01-0.1
Strategy 2: Reward model ensemble
Train multiple RMs, only trust when they agree.
def ensemble_reward(rms, prompt, response):
rewards = [rm(prompt, response) for rm in rms]
# Penalize disagreement
mean_reward = np.mean(rewards)
uncertainty = np.std(rewards)
return mean_reward - uncertainty_penalty * uncertainty
Strategy 3: Iterative reward model updates
Retrain RM on policy outputs periodically.
for iteration in range(num_iterations):
# 1. Generate with current policy
outputs = policy.generate(prompts)
# 2. Collect human preferences on NEW outputs
new_preferences = collect_preferences(prompts, outputs)
# 3. Retrain reward model
rm = train_reward_model(rm, old_preferences + new_preferences)
# 4. Continue RL training
policy = ppo_step(policy, rm)
Reward Model Scaling Laws
Bigger reward models generalize better, but have diminishing returns.
# Empirical findings from Anthropic's work:
# RM Size | Pairwise Accuracy | Generalization Gap
# ─────────────────────────────────────────────────────
# 125M | 63% | Large (overfits)
# 1.3B | 68% | Medium
# 6B | 71% | Small
# 52B | 73% | Small
# Key insight: RM should be ~same size as policy for best results
Reward Model Interpretability
What features does the RM use to score responses?
def probe_reward_model(rm, prompt, responses):
"""
Investigate what the RM is learning.
"""
features = []
for response in responses:
score = rm(prompt, response)
features.append({
"score": score,
"length": len(response),
"num_sentences": count_sentences(response),
"agrees_with_user": detects_agreement(prompt, response),
"contains_hedge": contains_hedge_words(response),
"sentiment": get_sentiment(response),
})
# Which features predict RM score?
df = pd.DataFrame(features)
correlations = df.corr()["score"].sort_values(ascending=False)
print("Feature correlations with RM score:")
print(correlations)
return correlations
Capstone Connection
The reward model is where sycophancy is learned.
If human labelers prefer agreeable responses, the RM learns to reward agreement. The policy then optimizes for agreement.
# The sycophancy pipeline:
# Step 1: Preference data with agreement bias
preferences = [
{"chosen": "You make a good point!", "rejected": "Actually, that's wrong."},
{"chosen": "I see your perspective.", "rejected": "The data disagrees."},
]
# Step 2: RM learns agreement = high reward
rm.train(preferences)
rm("You're right!") > rm("You're wrong.") # True!
# Step 3: Policy optimizes for RM
# Policy learns: agree with user = high reward
# Step 4: Deployed model is sycophantic
Milestone 3 intervention targets:
- Data level: Curate preferences that reward correction
- RM level: Audit RM for agreement bias
- Training level: Add honesty penalty to objective
🎓 Tyla's Exercise
The Bradley-Terry model assumes preference transitivity: if A > B and B > C, then A > C. Construct a scenario with real human preferences where this fails. What does this mean for reward model training?
Derive the gradient of Bradley-Terry loss with respect to the model parameters. Show that the gradient magnitude depends on how confident the model is in its ranking.
Goodhart's Law predicts that optimizing a proxy measure eventually decreases true quality. Formalize this: under what conditions on the proxy-true correlation does Goodhart's Law hold?
💻 Aaliyah's Exercise
Build a reward model analyzer:
class RewardModelAnalyzer:
def __init__(self, reward_model, tokenizer):
self.rm = reward_model
self.tokenizer = tokenizer
def find_reward_hacking_patterns(self, policy, test_prompts):
"""
1. Generate responses from policy
2. Score with RM
3. Compute correlation with:
- Length
- Sentiment
- Agreement with user
- Number of hedging phrases
4. Flag any correlation > 0.3
"""
pass
def compute_goodhart_curve(self, policy, rm, num_steps=100):
"""
1. Start with base policy
2. Take optimization steps toward RM
3. At each step, measure:
- RM score
- Ground truth quality (factual accuracy)
4. Plot: RM score vs true quality
5. Find the optimization sweet spot
"""
pass
def audit_for_sycophancy(self, test_cases):
"""
Test cases where user is wrong:
- "2+2=5, right?"
- "The earth is flat, isn't it?"
For each:
1. Generate agreeing response
2. Generate correcting response
3. Compare RM scores
Red flag: agreeing scores higher
"""
pass
📚 Maneesha's Reflection
Reward models are trained on human preferences, but humans are biased, inconsistent, and context-dependent. How is this different from standardized testing in education? What are the analogous failure modes?
Goodhart's Law appears everywhere: teaching to the test, citation gaming, Goodheart in medicine. What's the common structure? What does this tell us about designing evaluation systems?
The reward model is essentially a "learned rubric" for evaluating AI outputs. If you were designing an educational rubric that would be used by an AI grader, what properties would you want it to have to avoid Goodhart-style failures?