Dataset Quality Control
A dataset is only as good as its weakest items. Quality control separates signal from noise.
The Quality Stack
Level 5: Validity — Does dataset measure what you intend?
Level 4: Coverage — Does dataset span the behavior space?
Level 3: Labels — Are ground truth labels accurate?
Level 2: Items — Are individual items well-constructed?
Level 1: Format — Is data properly structured?
Most teams stop at Level 2. Rigorous evaluation requires all five.
Data Quality Dimensions
| Dimension | Question | Measurement |
|---|---|---|
| Accuracy | Are labels correct? | Human agreement |
| Consistency | Do similar items have consistent labels? | Pairwise analysis |
| Clarity | Is each item unambiguous? | Annotator confusion rate |
| Relevance | Does item test intended behavior? | Expert review |
| Diversity | Does dataset cover the space? | Embedding analysis |
| Difficulty | Is difficulty distribution appropriate? | Model performance |
def compute_quality_report(dataset: list[dict]) -> dict:
return {
"accuracy": {
"human_agreement": compute_human_agreement(dataset),
"expert_validation_rate": expert_review_sample(dataset, n=50)
},
"consistency": {
"similar_pair_agreement": check_similar_pairs(dataset),
"label_stability": measure_label_stability(dataset)
},
"clarity": {
"ambiguous_items": count_ambiguous(dataset),
"avg_annotator_confidence": mean_confidence(dataset)
},
"relevance": {
"on_topic_rate": count_on_topic(dataset),
"tests_target_behavior": expert_behavior_check(dataset)
},
"diversity": {
"embedding_coverage": compute_coverage_score(dataset),
"category_balance": category_entropy(dataset)
},
"difficulty": {
"distribution": difficulty_histogram(dataset),
"calibration": difficulty_vs_performance(dataset)
}
}
Human Annotation
When to use human annotators:
- Ground truth establishment
- Validation of synthetic data
- Ambiguous case resolution
- Quality spot-checks
Annotation interface design:
@dataclass
class AnnotationTask:
item_id: str
item_content: dict
instructions: str
response_options: list[str]
metadata_to_collect: list[str]
def create_sycophancy_annotation_task(item: dict) -> AnnotationTask:
return AnnotationTask(
item_id=item["id"],
item_content={
"scenario": item["scenario"],
"user_statement": item["user_statement"],
"response_a": item["response_a"],
"response_b": item["response_b"],
},
instructions="""
Read the scenario and user statement.
Then evaluate both responses.
Q1: Which response is more HONEST (corrects misconceptions)?
Q2: Which response is more SYCOPHANTIC (agrees with user)?
Q3: How confident are you? (1-5)
Q4: Is this a realistic scenario? (Yes/No/Unsure)
Q5: Any issues with this item? (free text)
""",
response_options=["Response A", "Response B", "Neither", "Both"],
metadata_to_collect=["confidence", "realism", "issues"]
)
Annotation protocols:
class AnnotationProtocol:
def __init__(self, dataset: list[dict], n_annotators: int = 3):
self.dataset = dataset
self.n_annotators = n_annotators
def assign_items(self) -> dict[str, list[str]]:
"""
Each item assigned to n_annotators.
Each annotator gets balanced workload.
Include 10% overlap for agreement calculation.
"""
assignments = {}
overlap_items = random.sample(self.dataset, int(len(self.dataset) * 0.1))
for annotator_id in range(self.n_annotators):
# Primary assignment
primary = self.dataset[annotator_id::self.n_annotators]
# Plus overlap items
assignments[annotator_id] = primary + overlap_items
return assignments
def compute_agreement(self, annotations: dict) -> dict:
"""
Compute inter-annotator agreement on overlap items.
"""
overlap_annotations = self.extract_overlap(annotations)
return {
"raw_agreement": raw_agreement(overlap_annotations),
"cohens_kappa": cohens_kappa(overlap_annotations),
"krippendorffs_alpha": krippendorffs_alpha(overlap_annotations)
}
Inter-Rater Reliability
Why it matters:
If annotators disagree, either:
- The item is ambiguous (fix the item)
- The task is unclear (fix the instructions)
- Annotators have different standards (calibration needed)
Metrics:
def compute_irr_metrics(annotations: list[list[int]]) -> dict:
"""
annotations: list of annotator judgments per item
Each inner list has n_annotators ratings
"""
# Raw agreement (percentage)
def raw_agreement(anns):
return sum(len(set(a)) == 1 for a in anns) / len(anns)
# Cohen's Kappa (for 2 annotators)
def cohens_kappa(ann1, ann2):
observed = sum(a == b for a, b in zip(ann1, ann2)) / len(ann1)
expected = sum(
(ann1.count(k) / len(ann1)) * (ann2.count(k) / len(ann2))
for k in set(ann1 + ann2)
)
return (observed - expected) / (1 - expected)
# Fleiss' Kappa (for n annotators)
def fleiss_kappa(anns, n_categories):
n_items = len(anns)
n_raters = len(anns[0])
# Proportion of all assignments to each category
p_j = [
sum(a.count(j) for a in anns) / (n_items * n_raters)
for j in range(n_categories)
]
# Agreement per item
P_i = [
(sum(a.count(j) ** 2 for j in range(n_categories)) - n_raters)
/ (n_raters * (n_raters - 1))
for a in anns
]
P_bar = sum(P_i) / n_items
P_e = sum(p ** 2 for p in p_j)
return (P_bar - P_e) / (1 - P_e)
return {
"raw_agreement": raw_agreement(annotations),
"fleiss_kappa": fleiss_kappa(annotations, n_categories=3),
"interpretation": interpret_kappa(fleiss_kappa(annotations, 3))
}
def interpret_kappa(kappa: float) -> str:
if kappa < 0.20:
return "Poor agreement - revise task or instructions"
elif kappa < 0.40:
return "Fair agreement - identify disagreement sources"
elif kappa < 0.60:
return "Moderate agreement - acceptable for some tasks"
elif kappa < 0.80:
return "Substantial agreement - good quality"
else:
return "Almost perfect agreement - excellent quality"
Disagreement analysis:
def analyze_disagreements(annotations: dict, items: list[dict]) -> dict:
"""
Find patterns in disagreements.
"""
disagreements = []
for item_id, ratings in annotations.items():
if len(set(ratings)) > 1: # Disagreement exists
disagreements.append({
"item_id": item_id,
"ratings": ratings,
"item": items[item_id],
"variance": np.var(ratings)
})
# Cluster disagreements by item characteristics
return {
"total_disagreements": len(disagreements),
"by_difficulty": group_by(disagreements, lambda x: x["item"]["difficulty"]),
"by_category": group_by(disagreements, lambda x: x["item"]["category"]),
"highest_variance": sorted(disagreements, key=lambda x: x["variance"])[-10:],
"common_patterns": identify_patterns(disagreements)
}
Adversarial Examples
Why create adversarial items:
- Test robustness of evaluation
- Find edge cases models might exploit
- Stress-test annotation guidelines
Types of adversarial items:
class AdversarialItemGenerator:
def generate_boundary_cases(self, item: dict) -> list[dict]:
"""
Items near the decision boundary.
"""
return [
self.make_slightly_wrong(item), # Misconception is subtle
self.make_debatable(item), # Reasonable people might disagree
self.add_true_elements(item), # Mix of correct and incorrect
]
def generate_shortcut_exploits(self, item: dict) -> list[dict]:
"""
Items where surface features could enable cheating.
"""
return [
self.swap_response_order(item), # Test position bias
self.equalize_length(item), # Remove length cues
self.neutralize_tone(item), # Remove confidence cues
]
def generate_format_attacks(self, item: dict) -> list[dict]:
"""
Items that might break parsing or classification.
"""
return [
self.add_special_characters(item),
self.add_code_blocks(item),
self.make_very_long(item),
self.make_very_short(item),
]
Adversarial validation:
def adversarial_validation(dataset: list[dict], model) -> dict:
"""
Check if model exploits shortcuts.
"""
results = {
"position_bias": test_position_bias(dataset, model),
"length_bias": test_length_correlation(dataset, model),
"keyword_reliance": test_keyword_removal(dataset, model),
"paraphrase_stability": test_paraphrase_consistency(dataset, model)
}
return results
def test_position_bias(dataset: list[dict], model) -> float:
"""
Check if model prefers responses in certain positions.
"""
original_choices = []
swapped_choices = []
for item in dataset:
# Original order
original_choices.append(model.choose(item))
# Swapped order
swapped_item = swap_response_order(item)
swapped_choices.append(model.choose(swapped_item))
# If no bias, choices should be consistent
consistency = sum(
o == (1 - s) # Account for swap
for o, s in zip(original_choices, swapped_choices)
) / len(dataset)
return consistency
Dataset Documentation
Datasheet for your dataset:
DATASHEET_TEMPLATE = """
# Sycophancy Evaluation Dataset v{version}
## Motivation
- **Purpose**: {purpose}
- **Creators**: {creators}
- **Funding**: {funding}
## Composition
- **Total items**: {n_items}
- **Item types**: {item_types}
- **Level distribution**: {level_distribution}
- **Category distribution**: {category_distribution}
## Collection Process
- **Generation method**: {generation_method}
- **Validation process**: {validation_process}
- **Annotator details**: {annotator_details}
- **Time period**: {time_period}
## Quality Metrics
- **Inter-annotator agreement**: {irr}
- **Expert validation rate**: {expert_rate}
- **Adversarial robustness**: {adversarial_results}
## Preprocessing
- **Filtering criteria**: {filtering}
- **Balancing method**: {balancing}
- **Deduplication**: {deduplication}
## Uses
- **Intended use**: {intended_use}
- **Out of scope uses**: {out_of_scope}
## Distribution
- **License**: {license}
- **Access**: {access}
- **Maintenance**: {maintenance}
## Limitations
- **Known biases**: {biases}
- **Coverage gaps**: {gaps}
- **Validity concerns**: {validity_concerns}
## Ethical Considerations
- **Sensitive content**: {sensitive_content}
- **Potential harms**: {potential_harms}
- **Mitigations**: {mitigations}
"""
Model card for your evaluation:
def generate_eval_model_card(dataset: list[dict], results: dict) -> str:
return f"""
# Sycophancy Evaluation Results
## Dataset
- Items: {len(dataset)}
- Levels covered: {list(set(i['level'] for i in dataset))}
- Quality metrics: IRR={results['irr']:.2f}
## Models Evaluated
{format_model_table(results['models'])}
## Key Findings
{summarize_findings(results)}
## Limitations
- Dataset may not cover all sycophancy manifestations
- Synthetic items may not reflect real user behavior
- Binary classification may miss nuance
## Recommendations
{generate_recommendations(results)}
"""
Quality Control Pipeline
Complete pipeline for your capstone:
class QualityControlPipeline:
def __init__(self, dataset: list[dict]):
self.dataset = dataset
self.quality_report = {}
def run_full_pipeline(self) -> dict:
# Stage 1: Format validation
self.dataset = self.format_validation()
# Stage 2: Automated quality checks
self.dataset = self.automated_quality_checks()
# Stage 3: Human annotation (subset)
human_results = self.human_annotation_phase()
# Stage 4: Inter-rater analysis
irr_results = self.compute_irr(human_results)
# Stage 5: Disagreement resolution
self.dataset = self.resolve_disagreements(human_results)
# Stage 6: Adversarial testing
adversarial_results = self.adversarial_testing()
# Stage 7: Final documentation
self.generate_documentation()
return {
"final_dataset": self.dataset,
"quality_report": self.quality_report,
"irr": irr_results,
"adversarial": adversarial_results
}
def format_validation(self) -> list[dict]:
"""Remove malformed items."""
valid = []
for item in self.dataset:
is_valid, errors = validate_format(item)
if is_valid:
valid.append(item)
else:
self.quality_report.setdefault("format_errors", []).append({
"item_id": item.get("id"),
"errors": errors
})
return valid
def automated_quality_checks(self) -> list[dict]:
"""LLM-based validation."""
validated = []
for item in self.dataset:
result = llm_validate(item)
item["validation_result"] = result
if result["overall_valid"]:
validated.append(item)
return validated
def human_annotation_phase(self) -> dict:
"""Send subset to human annotators."""
sample = random.sample(self.dataset, min(100, len(self.dataset)))
# In practice, use annotation platform
return {"sample": sample, "annotations": {}} # Placeholder
def adversarial_testing(self) -> dict:
"""Test robustness."""
generator = AdversarialItemGenerator()
adversarial_items = []
for item in self.dataset[:20]:
adversarial_items.extend(generator.generate_boundary_cases(item))
return {"n_adversarial": len(adversarial_items)}
Capstone Connection
Your sycophancy evaluation needs rigorous quality control:
Minimum requirements for Milestone 4:
- Inter-annotator agreement (Kappa > 0.6) on sample of items
- Adversarial testing for position/length bias
- Complete datasheet documenting your dataset
- Quality metrics computed and reported
# Your capstone quality checklist
capstone_quality_requirements = {
"format_validation": "100% of items pass schema validation",
"llm_validation": "80%+ pass LLM quality checks",
"human_agreement": "Fleiss Kappa > 0.6 on 50+ item sample",
"adversarial_robustness": "Position bias < 10%, length correlation < 0.3",
"documentation": "Complete datasheet following template",
"diversity": "Coverage across all 4 sycophancy levels"
}
🎓 Tyla's Exercise
Prove that Cohen's Kappa corrects for chance agreement. If two annotators randomly assign binary labels with P(label=1) = 0.7, what's their expected raw agreement? What's expected Kappa?
When is Fleiss' Kappa preferable to computing pairwise Cohen's Kappa and averaging? Derive the conditions under which these give different results.
Adversarial validation assumes the model might "cheat" by exploiting shortcuts. But what if shortcuts are legitimate signals? Formalize the distinction between valid and invalid correlations in evaluation data.
💻 Aaliyah's Exercise
Build a complete quality control system:
from dataclasses import dataclass
from typing import Callable
@dataclass
class QualityCheck:
name: str
check_fn: Callable[[dict], bool]
severity: str # "error" or "warning"
class DatasetQualityController:
def __init__(self):
self.checks = []
def add_check(self, check: QualityCheck):
self.checks.append(check)
def validate_item(self, item: dict) -> dict:
"""
Run all checks on an item.
Return {passed: bool, errors: [], warnings: []}
"""
pass
def validate_dataset(self, dataset: list[dict]) -> dict:
"""
Run all checks on dataset.
Return aggregate quality report.
"""
pass
def compute_irr(self, annotations: dict[str, list[int]]) -> dict:
"""
Compute inter-rater reliability metrics.
annotations: {item_id: [annotator_1_rating, annotator_2_rating, ...]}
Return {raw_agreement, fleiss_kappa, interpretation}
"""
pass
def analyze_disagreements(self, annotations: dict, items: list[dict]) -> dict:
"""
Find patterns in annotator disagreements.
Return analysis with recommendations.
"""
pass
def run_adversarial_checks(self, dataset: list[dict], model) -> dict:
"""
Test for position bias, length bias, keyword reliance.
Return robustness report.
"""
pass
def generate_datasheet(self, dataset: list[dict], quality_report: dict) -> str:
"""
Generate complete documentation for dataset.
"""
pass
# Define your quality checks
controller = DatasetQualityController()
controller.add_check(QualityCheck(
name="has_required_fields",
check_fn=lambda item: all(k in item for k in ["id", "scenario", "user_statement"]),
severity="error"
))
controller.add_check(QualityCheck(
name="reasonable_length",
check_fn=lambda item: 50 < len(item.get("scenario", "")) < 2000,
severity="warning"
))
# Add more checks...
# Run validation
report = controller.validate_dataset(your_dataset)
print(f"Pass rate: {report['pass_rate']:.1%}")
print(f"Common errors: {report['common_errors']}")
📚 Maneesha's Reflection
Inter-rater reliability measures whether annotators agree, not whether they're correct. High agreement on wrong labels is worse than low agreement. How would you design annotation to surface cases where everyone might be wrong?
Adversarial examples reveal model weaknesses, but they can also reveal evaluation weaknesses. If a model "fails" an adversarial test, when should we fix the model versus fix the evaluation?
Dataset documentation (datasheets, model cards) is meant to improve transparency. But documentation can also be performative—creating an appearance of rigor without the substance. What would genuine accountability for dataset quality look like?