We extract experiments from published ML papers and rewrite them as proposals for a research question. A model injects a methodological flaw into the proposal. Given the research context, models choose which of two designs — original vs tampered — is more methodologically sound. Models are not told that one design contains a flaw.
| Model (thinking budget) | Accuracy (95% CI) (?)Bootstrap CI over 59 items, not over individual reps. | Thinking (tokens) | 1 Ask a Strong LLM Judge when Your Reward Model is Uncertain confounds |
2 Ask a Strong LLM Judge when Your Reward Model is Uncertain confounds |
3 Attention as a Hypernetwork confounds |
4 Attention as a Hypernetwork confounds |
5 Auditing Differentially Private Machine Learning: How Private is Private SGD? data_leakage |
6 Classification with Conceptual Safeguards strawman_null |
7 Classification with Conceptual Safeguards strawman_null |
8 Counterfactual Memorization in Neural Language Models data_leakage |
9 Differential Transformer confounds |
10 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning confounds |
11 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning confounds |
12 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning data_leakage |
13 Equivalence is All: A Unified View for Self-supervised Graph Learning confounds |
14 TRACE: Your Diffusion Model is Secretly an Instance Edge Detector data_leakage |
15 Is it thinking or cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort data_leakage |
16 Q-RAG: Long Context Multi-Step Retrieval via Value-Based Embedder Training data_leakage |
17 Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts data_leakage |
18 Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts construct_validity |
19 LLM DNA: Tracing Model Evolution via Functional Representations confounds |
20 True Self-Supervised Novel View Synthesis is Transferable construct_validity |
21 Compositional Diffusion with Guided Search for Long-Horizon Planning data_leakage |
22 OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability statistical_misuse |
23 Hallucination Begins Where Saliency Drops confounds |
24 LLM Fingerprinting via Semantically Conditioned Watermarks data_leakage |
25 LLM Fingerprinting via Semantically Conditioned Watermarks construct_validity |
26 CauKer: classification time series foundation models can be pretrained on synthetic data only confounds |
27 IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning aggregation_bias |
28 IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning metric_mismatch |
29 Inverse Scaling in Test-Time Compute construct_validity |
30 Inverse Scaling in Test-Time Compute confounds |
31 Quantifying Elicitation of Latent Capabilities in Language Models selection_bias |
32 On the Effectiveness of Low Frequency Perturbations confounds |
33 On the Effectiveness of Low Frequency Perturbations confounds |
34 On the Effectiveness of Low Frequency Perturbations confounds |
35 Many-shot Jailbreaking metric_mismatch |
36 Many-shot Jailbreaking circular_reasoning |
37 Measuring Forgetting of Memorized Training Examples confounds |
38 MMA Training: Direct Input Space Margin Maximization through Adversarial Training robustness_theater |
39 MMA Training: Direct Input Space Margin Maximization through Adversarial Training robustness_theater |
40 MMA Training: Direct Input Space Margin Maximization through Adversarial Training selection_bias |
41 More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness confounds |
42 Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models confounds |
43 Participatory Personalization in Classification data_leakage |
44 A Probabilistic Perspective on Unlearning and Alignment for Large Language Models confounds |
45 Reasoning Models Don't Always Say What They Think confounds |
46 Scalable Extraction of Training Data from (Production) Language Models data_leakage |
47 On the Sensitivity of Adversarial Robustness to Input Data Distributions confounds |
48 On the Sensitivity of Adversarial Robustness to Input Data Distributions confounds |
49 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents selection_bias |
50 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents selection_bias |
51 SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues data_leakage |
52 Students Parrot Their Teachers: Membership Inference on Model Distillation circular_reasoning |
53 Students Parrot Their Teachers: Membership Inference on Model Distillation construct_validity |
54 Sufficient Context: A New Lens on Retrieval Augmented Generation Systems data_leakage |
55 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models selection_bias |
56 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models confounds |
57 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models strawman_null |
58 ValueNet: A New Dataset for Human Value Driven Dialogue System data_leakage |
59 Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning data_leakage |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro (high) | 79% |
852 |
12 | 0 | 12 | 3 | 12 | 12 | 12 | 12 | 11 | 2 | 0 | 12 | 2 | 12 | 12 | 12 | 10 | 3 | 2 | 12 | 3 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 8 | 12 | 7 | 11 | 12 | 12 | 8 | 9 | 12 | 12 | 0 | 11 | 12 | 1 | 12 | 8 | 0 | 12 | 12 | 12 | 6 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 |
| GPT 5.4 (high) | 69% |
952 |
12 | 0 | 10 | 2 | 12 | 12 | 12 | 0 | 10 | 10 | 6 | 12 | 2 | 12 | 7 | 12 | 3 | 0 | 10 | 12 | 6 | 12 | 12 | 12 | 12 | 12 | 8 | 0 | 11 | 12 | 8 | 2 | 5 | 6 | 12 | 11 | 5 | 9 | 12 | 8 | 0 | 9 | 12 | 0 | 12 | 12 | 0 | 0 | 12 | 10 | 12 | 11 | 12 | 12 | 11 | 1 | 6 | 12 | 12 |
| GPT 5.2 (high) | 59% |
436 |
5 | 0 | 12 | 0 | 11 | 7 | 12 | 0 | 10 | 9 | 7 | 12 | 0 | 10 | 3 | 12 | 2 | 0 | 8 | 12 | 6 | 12 | 8 | 10 | 6 | 12 | 4 | 0 | 2 | 8 | 4 | 2 | 3 | 0 | 8 | 11 | 0 | 12 | 11 | 10 | 0 | 8 | 12 | 0 | 12 | 12 | 0 | 3 | 12 | 11 | 12 | 12 | 12 | 12 | 9 | 3 | 1 | 12 | 12 |
| Sonnet 4.6 (60K) | 54% |
1,012 |
7 | 3 | 3 | 0 | 2 | 9 | 12 | 0 | 9 | 6 | 1 | 7 | 0 | 6 | 2 | 6 | 0 | 0 | 5 | 12 | 6 | 12 | 12 | 12 | 10 | 12 | 6 | 1 | 12 | 9 | 1 | 12 | 0 | 1 | 12 | 8 | 0 | 12 | 4 | 12 | 1 | 0 | 8 | 2 | 12 | 12 | 0 | 7 | 7 | 12 | 8 | 6 | 1 | 12 | 10 | 7 | 8 | 11 | 12 |
| Opus 4.7 (xhigh) | 52% |
202 |
2 | 0 | 1 | 2 | 10 | 6 | 6 | 0 | 9 | 12 | 8 | 6 | 0 | 0 | 12 | 6 | 0 | 0 | 3 | 12 | 5 | 12 | 12 | 12 | 12 | 12 | 1 | 3 | 11 | 6 | 1 | 0 | 0 | 3 | 12 | 6 | 7 | 12 | 12 | 12 | 0 | 6 | 6 | 0 | 8 | 7 | 0 | 7 | 11 | 12 | 8 | 12 | 6 | 5 | 7 | 5 | 0 | 12 | 12 |
| Sonnet 4.5 (60K) | 52% |
1,879 |
5 | 1 | 3 | 0 | 1 | 12 | 12 | 0 | 10 | 9 | 11 | 5 | 1 | 3 | 8 | 12 | 2 | 0 | 4 | 12 | 2 | 12 | 12 | 12 | 6 | 12 | 5 | 0 | 5 | 8 | 6 | 4 | 0 | 11 | 12 | 4 | 0 | 0 | 0 | 12 | 2 | 3 | 3 | 1 | 10 | 10 | 0 | 3 | 12 | 12 | 12 | 7 | 9 | 12 | 11 | 3 | 0 | 12 | 12 |
| Opus 4.5 (60K) | 50% |
1,402 |
3 | 0 | 1 | 3 | 2 | 8 | 10 | 0 | 7 | 7 | 6 | 1 | 0 | 0 | 7 | 6 | 4 | 4 | 6 | 12 | 4 | 12 | 12 | 12 | 11 | 12 | 6 | 6 | 6 | 10 | 4 | 4 | 0 | 6 | 12 | 6 | 0 | 12 | 1 | 10 | 1 | 0 | 0 | 0 | 9 | 12 | 0 | 4 | 12 | 12 | 12 | 6 | 8 | 6 | 9 | 6 | 0 | 11 | 12 |
| Opus 4.6 (64K) | 43% |
687 |
3 | 0 | 4 | 1 | 3 | 6 | 10 | 0 | 6 | 6 | 6 | 6 | 0 | 5 | 6 | 4 | 6 | 5 | 2 | 12 | 1 | 12 | 12 | 7 | 6 | 12 | 0 | 6 | 9 | 6 | 4 | 5 | 0 | 6 | 9 | 1 | 1 | 12 | 4 | 6 | 0 | 6 | 0 | 0 | 7 | 12 | 0 | 3 | 0 | 6 | 12 | 0 | 5 | 6 | 7 | 6 | 1 | 12 | 9 |
0 3,000 6,000 9,000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||