Tampered Experiment Eval

Can frontier LLMs detect methodological flaws injected into experiment designs? 28 human-validated items across multiple flaw categories.

Pairwise A/B

7 models

Models choose which of two experiment designs is more methodologically sound. 33 items, up to 6 reps each.

Triage Detection

7 models

Models produce 3-tier severity reports on single experiments. Scored by V3 semantic judge.