Pairwise A/B Detection Leaderboard

28 human-validated items. Models choose which of two experiment designs (A/B) contains a methodological flaw. 336 total evaluations across 4 models.
Gemini 3.1 Pro
95.2%
80/84 correct
GPT 5.4
90.5%
76/84 correct
Sonnet 4.5
88.1%
74/84 correct
Opus 4.6
75.0%
63/84 correct
# Item Gemini 3.1 Pro GPT 5.4 Sonnet 4.5 Opus 4.6
1 Classification with Conceptual Safeguardsstrawman_null
3/3
3/3
3/3
3/3
2 Attention as a Hypernetworkconfounds
3/3
3/3
3/3
1/3
3 LLM DNA: Tracing Model Evolution via Functional Representationsconfounds
3/3
3/3
1/3
1/3
4 CauKer: classification time series foundation models can be pretrained on synthetic data onlyconfounds
3/3
3/3
3/3
3/3
5 Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Promptsdata_leakage
3/3
3/3
2/3
3/3
6 Quantifying Elicitation of Latent Capabilities in Language Modelsselection_bias
3/3
3/3
3/3
3/3
7 Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Promptsconstruct_validity
2/3
1/3
1/3
1/3
8 Q-RAG: Long Context Multi-Step Retrieval via Value-Based Embedder Trainingdata_leakage
3/3
3/3
3/3
3/3
9 Strategy Coopetition Explains the Emergence and Transience of In-Context Learningconfounds
3/3
3/3
1/3
2/3
10 Many-shot Jailbreakingmetric_mismatch
3/3
2/3
3/3
3/3
11 Inverse Scaling in Test-Time Computeselection_bias
3/3
3/3
3/3
3/3
12 Compositional Diffusion with Guided Search for Long-Horizon Planningdata_leakage
3/3
2/3
3/3
1/3
13 TRACE: Your Diffusion Model is Secretly an Instance Edge Detectordata_leakage
3/3
3/3
3/3
1/3
14 Differential Transformerconfounds
2/3
3/3
3/3
1/3
15 Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoningdata_leakage
3/3
3/3
3/3
3/3
16 OpenApps: Simulating Environment Variations to Measure UI-Agent Reliabilitystatistical_misuse
3/3
2/3
3/3
2/3
17 Classification with Conceptual Safeguardsstrawman_null
3/3
3/3
3/3
3/3
18 Is it thinking or cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effortdata_leakage
3/3
3/3
3/3
1/3
19 Participatory Personalization in Classificationdata_leakage
3/3
3/3
3/3
2/3
20 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agentsselection_bias
3/3
3/3
3/3
1/3
21 Adaptive Surrogate Gradients for Sequential Reinforcement Learning in Spiking Neural Networksconfounds
3/3
3/3
3/3
3/3
22 On Scaling Up 3D Gaussian Splatting Trainingconfounds
3/3
3/3
3/3
3/3
23 Reasoning Models Don't Always Say What They Thinkconfounds
3/3
3/3
3/3
3/3
24 Sufficient Context: A New Lens on Retrieval Augmented Generation Systemsdata_leakage
3/3
3/3
3/3
3/3
25 Inverse Scaling in Test-Time Computeconfounds
3/3
3/3
3/3
3/3
26 Equivalence is All: A Unified View for Self-supervised Graph Learningconfounds
1/3
0/3
0/3
1/3
27 Sabotage Evaluations for Frontier Modelsconfounds
3/3
3/3
3/3
3/3
28 Many-shot Jailbreakingcircular_reasoning
3/3
3/3
3/3
3/3
Click a cell in the grid above to view model responses and reasoning.