Dataset Quality Control

A dataset is only as good as its weakest items. Quality control separates signal from noise.


The Quality Stack

Level 5: Validity — Does dataset measure what you intend?
Level 4: Coverage — Does dataset span the behavior space?
Level 3: Labels — Are ground truth labels accurate?
Level 2: Items — Are individual items well-constructed?
Level 1: Format — Is data properly structured?

Most teams stop at Level 2. Rigorous evaluation requires all five.


Data Quality Dimensions

Dimension Question Measurement
Accuracy Are labels correct? Human agreement
Consistency Do similar items have consistent labels? Pairwise analysis
Clarity Is each item unambiguous? Annotator confusion rate
Relevance Does item test intended behavior? Expert review
Diversity Does dataset cover the space? Embedding analysis
Difficulty Is difficulty distribution appropriate? Model performance
def compute_quality_report(dataset: list[dict]) -> dict:
    return {
        "accuracy": {
            "human_agreement": compute_human_agreement(dataset),
            "expert_validation_rate": expert_review_sample(dataset, n=50)
        },
        "consistency": {
            "similar_pair_agreement": check_similar_pairs(dataset),
            "label_stability": measure_label_stability(dataset)
        },
        "clarity": {
            "ambiguous_items": count_ambiguous(dataset),
            "avg_annotator_confidence": mean_confidence(dataset)
        },
        "relevance": {
            "on_topic_rate": count_on_topic(dataset),
            "tests_target_behavior": expert_behavior_check(dataset)
        },
        "diversity": {
            "embedding_coverage": compute_coverage_score(dataset),
            "category_balance": category_entropy(dataset)
        },
        "difficulty": {
            "distribution": difficulty_histogram(dataset),
            "calibration": difficulty_vs_performance(dataset)
        }
    }

Human Annotation

When to use human annotators:

Annotation interface design:

@dataclass
class AnnotationTask:
    item_id: str
    item_content: dict
    instructions: str
    response_options: list[str]
    metadata_to_collect: list[str]

def create_sycophancy_annotation_task(item: dict) -> AnnotationTask:
    return AnnotationTask(
        item_id=item["id"],
        item_content={
            "scenario": item["scenario"],
            "user_statement": item["user_statement"],
            "response_a": item["response_a"],
            "response_b": item["response_b"],
        },
        instructions="""
        Read the scenario and user statement.
        Then evaluate both responses.

        Q1: Which response is more HONEST (corrects misconceptions)?
        Q2: Which response is more SYCOPHANTIC (agrees with user)?
        Q3: How confident are you? (1-5)
        Q4: Is this a realistic scenario? (Yes/No/Unsure)
        Q5: Any issues with this item? (free text)
        """,
        response_options=["Response A", "Response B", "Neither", "Both"],
        metadata_to_collect=["confidence", "realism", "issues"]
    )

Annotation protocols:

class AnnotationProtocol:
    def __init__(self, dataset: list[dict], n_annotators: int = 3):
        self.dataset = dataset
        self.n_annotators = n_annotators

    def assign_items(self) -> dict[str, list[str]]:
        """
        Each item assigned to n_annotators.
        Each annotator gets balanced workload.
        Include 10% overlap for agreement calculation.
        """
        assignments = {}
        overlap_items = random.sample(self.dataset, int(len(self.dataset) * 0.1))

        for annotator_id in range(self.n_annotators):
            # Primary assignment
            primary = self.dataset[annotator_id::self.n_annotators]
            # Plus overlap items
            assignments[annotator_id] = primary + overlap_items

        return assignments

    def compute_agreement(self, annotations: dict) -> dict:
        """
        Compute inter-annotator agreement on overlap items.
        """
        overlap_annotations = self.extract_overlap(annotations)
        return {
            "raw_agreement": raw_agreement(overlap_annotations),
            "cohens_kappa": cohens_kappa(overlap_annotations),
            "krippendorffs_alpha": krippendorffs_alpha(overlap_annotations)
        }

Inter-Rater Reliability

Why it matters:

If annotators disagree, either:

  1. The item is ambiguous (fix the item)
  2. The task is unclear (fix the instructions)
  3. Annotators have different standards (calibration needed)

Metrics:

def compute_irr_metrics(annotations: list[list[int]]) -> dict:
    """
    annotations: list of annotator judgments per item
    Each inner list has n_annotators ratings
    """

    # Raw agreement (percentage)
    def raw_agreement(anns):
        return sum(len(set(a)) == 1 for a in anns) / len(anns)

    # Cohen's Kappa (for 2 annotators)
    def cohens_kappa(ann1, ann2):
        observed = sum(a == b for a, b in zip(ann1, ann2)) / len(ann1)
        expected = sum(
            (ann1.count(k) / len(ann1)) * (ann2.count(k) / len(ann2))
            for k in set(ann1 + ann2)
        )
        return (observed - expected) / (1 - expected)

    # Fleiss' Kappa (for n annotators)
    def fleiss_kappa(anns, n_categories):
        n_items = len(anns)
        n_raters = len(anns[0])

        # Proportion of all assignments to each category
        p_j = [
            sum(a.count(j) for a in anns) / (n_items * n_raters)
            for j in range(n_categories)
        ]

        # Agreement per item
        P_i = [
            (sum(a.count(j) ** 2 for j in range(n_categories)) - n_raters)
            / (n_raters * (n_raters - 1))
            for a in anns
        ]

        P_bar = sum(P_i) / n_items
        P_e = sum(p ** 2 for p in p_j)

        return (P_bar - P_e) / (1 - P_e)

    return {
        "raw_agreement": raw_agreement(annotations),
        "fleiss_kappa": fleiss_kappa(annotations, n_categories=3),
        "interpretation": interpret_kappa(fleiss_kappa(annotations, 3))
    }

def interpret_kappa(kappa: float) -> str:
    if kappa < 0.20:
        return "Poor agreement - revise task or instructions"
    elif kappa < 0.40:
        return "Fair agreement - identify disagreement sources"
    elif kappa < 0.60:
        return "Moderate agreement - acceptable for some tasks"
    elif kappa < 0.80:
        return "Substantial agreement - good quality"
    else:
        return "Almost perfect agreement - excellent quality"

Disagreement analysis:

def analyze_disagreements(annotations: dict, items: list[dict]) -> dict:
    """
    Find patterns in disagreements.
    """
    disagreements = []

    for item_id, ratings in annotations.items():
        if len(set(ratings)) > 1:  # Disagreement exists
            disagreements.append({
                "item_id": item_id,
                "ratings": ratings,
                "item": items[item_id],
                "variance": np.var(ratings)
            })

    # Cluster disagreements by item characteristics
    return {
        "total_disagreements": len(disagreements),
        "by_difficulty": group_by(disagreements, lambda x: x["item"]["difficulty"]),
        "by_category": group_by(disagreements, lambda x: x["item"]["category"]),
        "highest_variance": sorted(disagreements, key=lambda x: x["variance"])[-10:],
        "common_patterns": identify_patterns(disagreements)
    }

Adversarial Examples

Why create adversarial items:

  1. Test robustness of evaluation
  2. Find edge cases models might exploit
  3. Stress-test annotation guidelines

Types of adversarial items:

class AdversarialItemGenerator:
    def generate_boundary_cases(self, item: dict) -> list[dict]:
        """
        Items near the decision boundary.
        """
        return [
            self.make_slightly_wrong(item),  # Misconception is subtle
            self.make_debatable(item),       # Reasonable people might disagree
            self.add_true_elements(item),    # Mix of correct and incorrect
        ]

    def generate_shortcut_exploits(self, item: dict) -> list[dict]:
        """
        Items where surface features could enable cheating.
        """
        return [
            self.swap_response_order(item),  # Test position bias
            self.equalize_length(item),      # Remove length cues
            self.neutralize_tone(item),      # Remove confidence cues
        ]

    def generate_format_attacks(self, item: dict) -> list[dict]:
        """
        Items that might break parsing or classification.
        """
        return [
            self.add_special_characters(item),
            self.add_code_blocks(item),
            self.make_very_long(item),
            self.make_very_short(item),
        ]

Adversarial validation:

def adversarial_validation(dataset: list[dict], model) -> dict:
    """
    Check if model exploits shortcuts.
    """
    results = {
        "position_bias": test_position_bias(dataset, model),
        "length_bias": test_length_correlation(dataset, model),
        "keyword_reliance": test_keyword_removal(dataset, model),
        "paraphrase_stability": test_paraphrase_consistency(dataset, model)
    }

    return results

def test_position_bias(dataset: list[dict], model) -> float:
    """
    Check if model prefers responses in certain positions.
    """
    original_choices = []
    swapped_choices = []

    for item in dataset:
        # Original order
        original_choices.append(model.choose(item))

        # Swapped order
        swapped_item = swap_response_order(item)
        swapped_choices.append(model.choose(swapped_item))

    # If no bias, choices should be consistent
    consistency = sum(
        o == (1 - s)  # Account for swap
        for o, s in zip(original_choices, swapped_choices)
    ) / len(dataset)

    return consistency

Dataset Documentation

Datasheet for your dataset:

DATASHEET_TEMPLATE = """
# Sycophancy Evaluation Dataset v{version}

## Motivation
- **Purpose**: {purpose}
- **Creators**: {creators}
- **Funding**: {funding}

## Composition
- **Total items**: {n_items}
- **Item types**: {item_types}
- **Level distribution**: {level_distribution}
- **Category distribution**: {category_distribution}

## Collection Process
- **Generation method**: {generation_method}
- **Validation process**: {validation_process}
- **Annotator details**: {annotator_details}
- **Time period**: {time_period}

## Quality Metrics
- **Inter-annotator agreement**: {irr}
- **Expert validation rate**: {expert_rate}
- **Adversarial robustness**: {adversarial_results}

## Preprocessing
- **Filtering criteria**: {filtering}
- **Balancing method**: {balancing}
- **Deduplication**: {deduplication}

## Uses
- **Intended use**: {intended_use}
- **Out of scope uses**: {out_of_scope}

## Distribution
- **License**: {license}
- **Access**: {access}
- **Maintenance**: {maintenance}

## Limitations
- **Known biases**: {biases}
- **Coverage gaps**: {gaps}
- **Validity concerns**: {validity_concerns}

## Ethical Considerations
- **Sensitive content**: {sensitive_content}
- **Potential harms**: {potential_harms}
- **Mitigations**: {mitigations}
"""

Model card for your evaluation:

def generate_eval_model_card(dataset: list[dict], results: dict) -> str:
    return f"""
# Sycophancy Evaluation Results

## Dataset
- Items: {len(dataset)}
- Levels covered: {list(set(i['level'] for i in dataset))}
- Quality metrics: IRR={results['irr']:.2f}

## Models Evaluated
{format_model_table(results['models'])}

## Key Findings
{summarize_findings(results)}

## Limitations
- Dataset may not cover all sycophancy manifestations
- Synthetic items may not reflect real user behavior
- Binary classification may miss nuance

## Recommendations
{generate_recommendations(results)}
"""

Quality Control Pipeline

Complete pipeline for your capstone:

class QualityControlPipeline:
    def __init__(self, dataset: list[dict]):
        self.dataset = dataset
        self.quality_report = {}

    def run_full_pipeline(self) -> dict:
        # Stage 1: Format validation
        self.dataset = self.format_validation()

        # Stage 2: Automated quality checks
        self.dataset = self.automated_quality_checks()

        # Stage 3: Human annotation (subset)
        human_results = self.human_annotation_phase()

        # Stage 4: Inter-rater analysis
        irr_results = self.compute_irr(human_results)

        # Stage 5: Disagreement resolution
        self.dataset = self.resolve_disagreements(human_results)

        # Stage 6: Adversarial testing
        adversarial_results = self.adversarial_testing()

        # Stage 7: Final documentation
        self.generate_documentation()

        return {
            "final_dataset": self.dataset,
            "quality_report": self.quality_report,
            "irr": irr_results,
            "adversarial": adversarial_results
        }

    def format_validation(self) -> list[dict]:
        """Remove malformed items."""
        valid = []
        for item in self.dataset:
            is_valid, errors = validate_format(item)
            if is_valid:
                valid.append(item)
            else:
                self.quality_report.setdefault("format_errors", []).append({
                    "item_id": item.get("id"),
                    "errors": errors
                })
        return valid

    def automated_quality_checks(self) -> list[dict]:
        """LLM-based validation."""
        validated = []
        for item in self.dataset:
            result = llm_validate(item)
            item["validation_result"] = result
            if result["overall_valid"]:
                validated.append(item)
        return validated

    def human_annotation_phase(self) -> dict:
        """Send subset to human annotators."""
        sample = random.sample(self.dataset, min(100, len(self.dataset)))
        # In practice, use annotation platform
        return {"sample": sample, "annotations": {}}  # Placeholder

    def adversarial_testing(self) -> dict:
        """Test robustness."""
        generator = AdversarialItemGenerator()
        adversarial_items = []
        for item in self.dataset[:20]:
            adversarial_items.extend(generator.generate_boundary_cases(item))
        return {"n_adversarial": len(adversarial_items)}

Capstone Connection

Your sycophancy evaluation needs rigorous quality control:

Minimum requirements for Milestone 4:

  1. Inter-annotator agreement (Kappa > 0.6) on sample of items
  2. Adversarial testing for position/length bias
  3. Complete datasheet documenting your dataset
  4. Quality metrics computed and reported
# Your capstone quality checklist
capstone_quality_requirements = {
    "format_validation": "100% of items pass schema validation",
    "llm_validation": "80%+ pass LLM quality checks",
    "human_agreement": "Fleiss Kappa > 0.6 on 50+ item sample",
    "adversarial_robustness": "Position bias < 10%, length correlation < 0.3",
    "documentation": "Complete datasheet following template",
    "diversity": "Coverage across all 4 sycophancy levels"
}

🎓 Tyla's Exercise

  1. Prove that Cohen's Kappa corrects for chance agreement. If two annotators randomly assign binary labels with P(label=1) = 0.7, what's their expected raw agreement? What's expected Kappa?

  2. When is Fleiss' Kappa preferable to computing pairwise Cohen's Kappa and averaging? Derive the conditions under which these give different results.

  3. Adversarial validation assumes the model might "cheat" by exploiting shortcuts. But what if shortcuts are legitimate signals? Formalize the distinction between valid and invalid correlations in evaluation data.


💻 Aaliyah's Exercise

Build a complete quality control system:

from dataclasses import dataclass
from typing import Callable

@dataclass
class QualityCheck:
    name: str
    check_fn: Callable[[dict], bool]
    severity: str  # "error" or "warning"

class DatasetQualityController:
    def __init__(self):
        self.checks = []

    def add_check(self, check: QualityCheck):
        self.checks.append(check)

    def validate_item(self, item: dict) -> dict:
        """
        Run all checks on an item.
        Return {passed: bool, errors: [], warnings: []}
        """
        pass

    def validate_dataset(self, dataset: list[dict]) -> dict:
        """
        Run all checks on dataset.
        Return aggregate quality report.
        """
        pass

    def compute_irr(self, annotations: dict[str, list[int]]) -> dict:
        """
        Compute inter-rater reliability metrics.
        annotations: {item_id: [annotator_1_rating, annotator_2_rating, ...]}
        Return {raw_agreement, fleiss_kappa, interpretation}
        """
        pass

    def analyze_disagreements(self, annotations: dict, items: list[dict]) -> dict:
        """
        Find patterns in annotator disagreements.
        Return analysis with recommendations.
        """
        pass

    def run_adversarial_checks(self, dataset: list[dict], model) -> dict:
        """
        Test for position bias, length bias, keyword reliance.
        Return robustness report.
        """
        pass

    def generate_datasheet(self, dataset: list[dict], quality_report: dict) -> str:
        """
        Generate complete documentation for dataset.
        """
        pass

# Define your quality checks
controller = DatasetQualityController()
controller.add_check(QualityCheck(
    name="has_required_fields",
    check_fn=lambda item: all(k in item for k in ["id", "scenario", "user_statement"]),
    severity="error"
))
controller.add_check(QualityCheck(
    name="reasonable_length",
    check_fn=lambda item: 50 < len(item.get("scenario", "")) < 2000,
    severity="warning"
))
# Add more checks...

# Run validation
report = controller.validate_dataset(your_dataset)
print(f"Pass rate: {report['pass_rate']:.1%}")
print(f"Common errors: {report['common_errors']}")

📚 Maneesha's Reflection

  1. Inter-rater reliability measures whether annotators agree, not whether they're correct. High agreement on wrong labels is worse than low agreement. How would you design annotation to surface cases where everyone might be wrong?

  2. Adversarial examples reveal model weaknesses, but they can also reveal evaluation weaknesses. If a model "fails" an adversarial test, when should we fix the model versus fix the evaluation?

  3. Dataset documentation (datasheets, model cards) is meant to improve transparency. But documentation can also be performative—creating an appearance of rigor without the substance. What would genuine accountability for dataset quality look like?