Tampered Experiment Eval

Can frontier LLMs detect methodological flaws injected into experiment designs? 28 human-validated items across multiple flaw categories.

Pairwise A/B (v1)

4 models

Items as rows, models as columns. Click cells for detailed reasoning.

Triage Detection (v1)

5 models

Items as rows, models as columns. V3 semantic judge scoring.

Pairwise A/B (v2)

4 models

MathArena-style: models as rows, items as columns. Compact color grid.

Triage Detection (v2)

5 models

MathArena-style: models as rows, items as columns. Compact color grid.