Dataset Generation for Evals: Introduction

The quality of your evaluation is bounded by the quality of your dataset. Garbage in, garbage out—but for safety-critical AI systems.


Why Generate Eval Datasets?

Standard benchmarks measure standard capabilities. But you need to measure:

  1. Specific failure modes — Does your model sycophantically agree with wrong users?
  2. Edge cases — What happens when the user is confidently incorrect?
  3. Domain-specific behaviors — Does the coding assistant mention security implications?
Existing benchmarks → "Is this model smart?"
Custom eval datasets → "Does this model fail in ways that matter to us?"

You can't evaluate sycophancy with MMLU. You need targeted data.


Types of Eval Datasets

Type Purpose Example
MCQ Benchmarks Quick, scalable behavior measurement "User claims X (wrong). Does model agree?"
Free-form Response Nuanced behavior analysis "Explain your reasoning to a user who believes X"
Dialogue Scenarios Multi-turn dynamics 5-turn conversation where user doubles down
Agent Trajectories Tool use and planning Coding agent given ambiguous requirements
Preference Pairs Training signal for RLHF (Sycophantic response, Honest response)

Each type serves different evaluation needs:

eval_dataset_types = {
    "mcq": {
        "scalability": "high",
        "nuance": "low",
        "automation": "easy",
        "use_case": "regression testing"
    },
    "freeform": {
        "scalability": "medium",
        "nuance": "high",
        "automation": "requires LLM judge",
        "use_case": "capability assessment"
    },
    "dialogue": {
        "scalability": "low",
        "nuance": "very high",
        "automation": "complex",
        "use_case": "behavioral dynamics"
    }
}

Human vs Synthetic Data

Human-generated data:

+ Authentic edge cases humans actually encounter
+ Captures subtle linguistic patterns
+ Ground truth labels are reliable
- Expensive ($0.50-$5+ per item)
- Slow (days to weeks)
- Limited scale (hundreds to low thousands)

LLM-generated (synthetic) data:

+ Cheap (~$0.001 per item)
+ Fast (thousands per hour)
+ Infinite scale
- May not capture real distribution
- Systematic blind spots
- Labels require validation

The practical answer: hybrid approaches.

def hybrid_dataset_generation():
    """
    1. Generate synthetic data at scale
    2. Filter with heuristics and LLM validation
    3. Human review of subset (10-20%)
    4. Use human feedback to improve generation
    5. Iterate
    """
    synthetic = generate_synthetic_samples(n=10000)
    filtered = llm_filter(synthetic, threshold=0.8)  # ~3000 remain
    human_reviewed = sample_for_review(filtered, n=500)
    final_dataset = apply_human_corrections(filtered, human_reviewed)
    return final_dataset

Quality vs Quantity

More data isn't always better. Consider:

100 high-quality items:

10,000 low-quality items:

def dataset_quality_metrics(dataset):
    return {
        "diversity": measure_embedding_spread(dataset),
        "label_agreement": inter_annotator_agreement(dataset),
        "difficulty_distribution": plot_item_difficulty(dataset),
        "coverage": behavior_coverage_map(dataset)
    }

The quality hierarchy:

1. Construct validity — Does the item actually test what you think?
2. Label accuracy — Is the ground truth actually true?
3. Diversity — Does dataset cover the behavior space?
4. Scale — Is there enough statistical power?

Scale matters only after quality is established.


Data Contamination

The silent killer of eval validity.

Contamination happens when:

# Signs of contamination
def detect_contamination_signals(model, eval_items):
    signals = []

    for item in eval_items:
        # 1. Suspiciously high confidence
        if model.confidence(item) > 0.99:
            signals.append(("high_confidence", item))

        # 2. Verbatim completion
        if model.completes_exactly(item.stem, item.answer):
            signals.append(("verbatim", item))

        # 3. Performance inconsistent with difficulty
        if item.expected_difficulty == "hard" and model.correct(item):
            if model.incorrect(similar_novel_item):
                signals.append(("difficulty_mismatch", item))

    return signals

Mitigation strategies:

  1. Temporal splits — Create items after the model's training cutoff
  2. Novel scenarios — Use situations that couldn't exist in training data
  3. Held-out transformations — Systematically modify public items
  4. Canary tokens — Include detectably unique strings

The Sycophancy Dataset Challenge

For your capstone, you need eval data that tests sycophancy across levels:

sycophancy_dataset_requirements = {
    "level_1_chatbot": {
        "scenario": "User states false belief confidently",
        "measure": "Does model agree or correct?",
        "difficulty": "Easy to construct, well-studied"
    },
    "level_2_coding": {
        "scenario": "User requests feature with security implications",
        "measure": "Does model mention risks?",
        "difficulty": "Requires domain expertise"
    },
    "level_3_research": {
        "scenario": "User presents cherry-picked data",
        "measure": "Does model identify problems?",
        "difficulty": "Requires subtle scenario design"
    },
    "level_4_institutional": {
        "scenario": "System optimizes for operator's stated goals",
        "measure": "Does system question ethics?",
        "difficulty": "Very hard to operationalize"
    }
}

The harder levels require more sophisticated dataset generation.


Capstone Connection

Your sycophancy evaluation suite needs targeted datasets:

For Milestone 4, you must create:

  1. MCQ Benchmark (100+ items)

    • Level 1: Factual disagreement scenarios
    • Level 2: Coding scenarios with hidden risks
    • Level 3: Research scenarios with statistical traps
  2. Multi-turn Dialogues (20+ scenarios)

    • User doubles down on wrong belief
    • User applies social pressure
    • User claims authority
  3. Agent Trajectories (10+ scenarios)

    • Coding agent with ambiguous security requirements
    • Research agent with convenient but spurious correlations

This chapter introduces the landscape. The next two chapters cover LLM generation techniques and quality control.


🎓 Tyla's Exercise

  1. Explain the difference between construct validity and face validity for an eval dataset. Which matters more for measuring sycophancy? Why?

  2. Derive an estimate: If you need 95% confidence that your model's sycophancy rate differs from 50% by at least 10 percentage points, how many eval items do you need? (Hint: power analysis)

  3. Why might contamination be harder to detect for behavioral evals (like sycophancy) than for capability evals (like math)? What makes sycophancy evaluation particularly vulnerable?


💻 Aaliyah's Exercise

Build the scaffolding for a sycophancy eval dataset:

from dataclasses import dataclass
from enum import Enum

class SycophancyLevel(Enum):
    CHATBOT = 1
    CODING = 2
    RESEARCH = 3
    INSTITUTIONAL = 4

@dataclass
class SycophancyItem:
    """
    Complete this dataclass with:
    - Unique ID
    - Level (from enum)
    - Context/scenario description
    - User statement (with incorrect belief)
    - Expected honest response criteria
    - Expected sycophantic response criteria
    - Difficulty rating
    - Source (human/synthetic)
    """
    pass

def generate_level_1_items(n: int) -> list[SycophancyItem]:
    """
    Generate n Level 1 (chatbot) sycophancy items.

    Strategy:
    1. Use a list of common misconceptions
    2. Frame user as confident holder of misconception
    3. Define criteria for honest vs sycophantic response

    Return list of SycophancyItem objects.
    """
    pass

def measure_dataset_diversity(items: list[SycophancyItem]) -> dict:
    """
    Compute diversity metrics:
    - Topic coverage (embed and cluster)
    - Difficulty distribution
    - Level distribution
    - Linguistic variety (question formats)
    """
    pass

📚 Maneesha's Reflection

  1. Datasets encode values. If your sycophancy eval assumes "honesty is always better than agreement," you've made a normative choice. When might agreement actually be the right response?

  2. The hybrid human-synthetic approach risks the worst of both worlds: human biases amplified by synthetic scale. How would you design a process that genuinely combines their strengths?

  3. Contamination detection assumes we know what "novel" means. But if a model generalizes well, shouldn't it succeed on items similar to training? Where's the line between legitimate generalization and contamination?