MethodSpot — Validated 59

Can frontier LLMs detect methodological flaws injected into experiment designs? 59 human-validated items across multiple flaw categories.

Pairwise A/B

8 models

Models choose which of two experiment designs is more methodologically sound. Position-balanced reps.

Triage Detection

8 models

Models produce 3-tier severity reports on single experiments. Scored by V3 semantic judge.