Dataset Generation for Evals: Introduction
The quality of your evaluation is bounded by the quality of your dataset. Garbage in, garbage out—but for safety-critical AI systems.
Why Generate Eval Datasets?
Standard benchmarks measure standard capabilities. But you need to measure:
- Specific failure modes — Does your model sycophantically agree with wrong users?
- Edge cases — What happens when the user is confidently incorrect?
- Domain-specific behaviors — Does the coding assistant mention security implications?
Existing benchmarks → "Is this model smart?"
Custom eval datasets → "Does this model fail in ways that matter to us?"
You can't evaluate sycophancy with MMLU. You need targeted data.
Types of Eval Datasets
| Type | Purpose | Example |
|---|---|---|
| MCQ Benchmarks | Quick, scalable behavior measurement | "User claims X (wrong). Does model agree?" |
| Free-form Response | Nuanced behavior analysis | "Explain your reasoning to a user who believes X" |
| Dialogue Scenarios | Multi-turn dynamics | 5-turn conversation where user doubles down |
| Agent Trajectories | Tool use and planning | Coding agent given ambiguous requirements |
| Preference Pairs | Training signal for RLHF | (Sycophantic response, Honest response) |
Each type serves different evaluation needs:
eval_dataset_types = {
"mcq": {
"scalability": "high",
"nuance": "low",
"automation": "easy",
"use_case": "regression testing"
},
"freeform": {
"scalability": "medium",
"nuance": "high",
"automation": "requires LLM judge",
"use_case": "capability assessment"
},
"dialogue": {
"scalability": "low",
"nuance": "very high",
"automation": "complex",
"use_case": "behavioral dynamics"
}
}
Human vs Synthetic Data
Human-generated data:
+ Authentic edge cases humans actually encounter
+ Captures subtle linguistic patterns
+ Ground truth labels are reliable
- Expensive ($0.50-$5+ per item)
- Slow (days to weeks)
- Limited scale (hundreds to low thousands)
LLM-generated (synthetic) data:
+ Cheap (~$0.001 per item)
+ Fast (thousands per hour)
+ Infinite scale
- May not capture real distribution
- Systematic blind spots
- Labels require validation
The practical answer: hybrid approaches.
def hybrid_dataset_generation():
"""
1. Generate synthetic data at scale
2. Filter with heuristics and LLM validation
3. Human review of subset (10-20%)
4. Use human feedback to improve generation
5. Iterate
"""
synthetic = generate_synthetic_samples(n=10000)
filtered = llm_filter(synthetic, threshold=0.8) # ~3000 remain
human_reviewed = sample_for_review(filtered, n=500)
final_dataset = apply_human_corrections(filtered, human_reviewed)
return final_dataset
Quality vs Quantity
More data isn't always better. Consider:
100 high-quality items:
- Each carefully constructed to test specific behavior
- Clear ground truth labels
- Diverse coverage of failure modes
10,000 low-quality items:
- Many near-duplicates
- Ambiguous labels
- Clustered in easy regions of behavior space
def dataset_quality_metrics(dataset):
return {
"diversity": measure_embedding_spread(dataset),
"label_agreement": inter_annotator_agreement(dataset),
"difficulty_distribution": plot_item_difficulty(dataset),
"coverage": behavior_coverage_map(dataset)
}
The quality hierarchy:
1. Construct validity — Does the item actually test what you think?
2. Label accuracy — Is the ground truth actually true?
3. Diversity — Does dataset cover the behavior space?
4. Scale — Is there enough statistical power?
Scale matters only after quality is established.
Data Contamination
The silent killer of eval validity.
Contamination happens when:
- Your test items appear in the model's training data
- Items are derived from public sources the model saw
- Patterns in your dataset match training distribution too closely
# Signs of contamination
def detect_contamination_signals(model, eval_items):
signals = []
for item in eval_items:
# 1. Suspiciously high confidence
if model.confidence(item) > 0.99:
signals.append(("high_confidence", item))
# 2. Verbatim completion
if model.completes_exactly(item.stem, item.answer):
signals.append(("verbatim", item))
# 3. Performance inconsistent with difficulty
if item.expected_difficulty == "hard" and model.correct(item):
if model.incorrect(similar_novel_item):
signals.append(("difficulty_mismatch", item))
return signals
Mitigation strategies:
- Temporal splits — Create items after the model's training cutoff
- Novel scenarios — Use situations that couldn't exist in training data
- Held-out transformations — Systematically modify public items
- Canary tokens — Include detectably unique strings
The Sycophancy Dataset Challenge
For your capstone, you need eval data that tests sycophancy across levels:
sycophancy_dataset_requirements = {
"level_1_chatbot": {
"scenario": "User states false belief confidently",
"measure": "Does model agree or correct?",
"difficulty": "Easy to construct, well-studied"
},
"level_2_coding": {
"scenario": "User requests feature with security implications",
"measure": "Does model mention risks?",
"difficulty": "Requires domain expertise"
},
"level_3_research": {
"scenario": "User presents cherry-picked data",
"measure": "Does model identify problems?",
"difficulty": "Requires subtle scenario design"
},
"level_4_institutional": {
"scenario": "System optimizes for operator's stated goals",
"measure": "Does system question ethics?",
"difficulty": "Very hard to operationalize"
}
}
The harder levels require more sophisticated dataset generation.
Capstone Connection
Your sycophancy evaluation suite needs targeted datasets:
For Milestone 4, you must create:
MCQ Benchmark (100+ items)
- Level 1: Factual disagreement scenarios
- Level 2: Coding scenarios with hidden risks
- Level 3: Research scenarios with statistical traps
Multi-turn Dialogues (20+ scenarios)
- User doubles down on wrong belief
- User applies social pressure
- User claims authority
Agent Trajectories (10+ scenarios)
- Coding agent with ambiguous security requirements
- Research agent with convenient but spurious correlations
This chapter introduces the landscape. The next two chapters cover LLM generation techniques and quality control.
🎓 Tyla's Exercise
Explain the difference between construct validity and face validity for an eval dataset. Which matters more for measuring sycophancy? Why?
Derive an estimate: If you need 95% confidence that your model's sycophancy rate differs from 50% by at least 10 percentage points, how many eval items do you need? (Hint: power analysis)
Why might contamination be harder to detect for behavioral evals (like sycophancy) than for capability evals (like math)? What makes sycophancy evaluation particularly vulnerable?
💻 Aaliyah's Exercise
Build the scaffolding for a sycophancy eval dataset:
from dataclasses import dataclass
from enum import Enum
class SycophancyLevel(Enum):
CHATBOT = 1
CODING = 2
RESEARCH = 3
INSTITUTIONAL = 4
@dataclass
class SycophancyItem:
"""
Complete this dataclass with:
- Unique ID
- Level (from enum)
- Context/scenario description
- User statement (with incorrect belief)
- Expected honest response criteria
- Expected sycophantic response criteria
- Difficulty rating
- Source (human/synthetic)
"""
pass
def generate_level_1_items(n: int) -> list[SycophancyItem]:
"""
Generate n Level 1 (chatbot) sycophancy items.
Strategy:
1. Use a list of common misconceptions
2. Frame user as confident holder of misconception
3. Define criteria for honest vs sycophantic response
Return list of SycophancyItem objects.
"""
pass
def measure_dataset_diversity(items: list[SycophancyItem]) -> dict:
"""
Compute diversity metrics:
- Topic coverage (embed and cluster)
- Difficulty distribution
- Level distribution
- Linguistic variety (question formats)
"""
pass
📚 Maneesha's Reflection
Datasets encode values. If your sycophancy eval assumes "honesty is always better than agreement," you've made a normative choice. When might agreement actually be the right response?
The hybrid human-synthetic approach risks the worst of both worlds: human biases amplified by synthetic scale. How would you design a process that genuinely combines their strengths?
Contamination detection assumes we know what "novel" means. But if a model generalizes well, shouldn't it succeed on items similar to training? Where's the line between legitimate generalization and contamination?