MethodSpot — Validated 59

We extract experiments from published ML papers and rewrite them as proposals for a research question. A model injects a methodological flaw into the proposal. Given the research context, models choose which of two designs — original vs tampered — is more methodologically sound. Models are not told that one design contains a flaw.

59 items 8 models 40 papers 12 reps per item Human-validated flaws

Gemini 3.1 Pro (high)

79.2%

[69.9, 87.7]

GPT 5.4 (high)

68.8%

[58.9, 78.1]

GPT 5.2 (high)

58.8%

[48.7, 68.8]

Sonnet 4.6 (60K)

53.7%

[43.9, 63.0]

Opus 4.7 (xhigh)

52.3%

[42.5, 62.0]

Sonnet 4.5 (60K)

52.0%

[42.1, 61.9]

Opus 4.5 (60K)

49.9%

[40.7, 59.2]

Opus 4.6 (64K)

42.7%

[34.6, 51.0]

CIs are bootstrapped over 59 items. Models with overlapping CIs may not be statistically distinguishable.

Model (thinking budget)	Accuracy	Item CI (?)Bootstrap CI over items (n=33). Per-item accuracy averaged over 12 reps, then resampled 10k times. Measures uncertainty from which items are in the set.	Run CI (?)t-interval over 12 shuffled runs. Each run takes one random rep per item (mixed positions). Measures evaluation stability / reproducibility on this fixed benchmark.	Position Bias	Thinking Tokens
Gemini 3.1 Pro (high)	79.2%	[70.1%, 87.4%]	[78.1%, 80.4%]	A: 81.6% [73,90] · B: 76.8% [66,86]	852 med · 877 mean
GPT 5.4 (high)	68.8%	[59.3%, 78.2%]	[66.3%, 71.3%]	A: 78.5% [69,88] · B: 59.0% [48,69] ▲19	952 med · 1,117 mean
GPT 5.2 (high)	58.8%	[48.7%, 68.8%]	[55.9%, 61.6%]	A: 68.4% [57,78] · B: 49.2% [38,60] ▲19	436 med · 470 mean
Sonnet 4.6 (60K)	53.7%	[44.1%, 63.4%]	[51.3%, 56.0%]	A: 63.8% [53,75] · B: 43.5% [33,54] ▲20	1,012 med · 1,247 mean
Opus 4.7 (xhigh)	52.3%	[42.7%, 62.0%]	[49.7%, 54.8%]	A: 66.9% [56,77] · B: 37.6% [27,49] ▲29	202 med · 265 mean
Sonnet 4.5 (60K)	52.0%	[42.1%, 61.9%]	[49.4%, 54.6%]	A: 52.0% [42,62] · B: 52.0% [41,63]	1,879 med · 2,210 mean
Opus 4.5 (60K)	49.9%	[40.8%, 58.9%]	[47.7%, 52.0%]	A: 66.1% [56,76] · B: 33.6% [23,44] ▲32	1,402 med · 1,755 mean
Opus 4.6 (64K)	42.7%	[34.6%, 50.8%]	[40.4%, 44.9%]	A: 65.5% [55,76] · B: 19.8% [11,29] ▲46	687 med · 1,003 mean
	0 3,000 6,000 9,000

12/12

9-11

4-8

1-3

0/12

Model (thinking budget)	Accuracy (95% CI) (?)Bootstrap CI over 59 items, not over individual reps.	Thinking (tokens)	1 Ask a Strong LLM Judge when Your Reward Model is Uncertain confounds	2 Ask a Strong LLM Judge when Your Reward Model is Uncertain confounds	3 Attention as a Hypernetwork confounds	4 Attention as a Hypernetwork confounds	5 Auditing Differentially Private Machine Learning: How Private is Private SGD? data_leakage	6 Classification with Conceptual Safeguards strawman_null	7 Classification with Conceptual Safeguards strawman_null	8 Counterfactual Memorization in Neural Language Models data_leakage	9 Differential Transformer confounds	10 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning confounds	11 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning confounds	12 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning data_leakage	13 Equivalence is All: A Unified View for Self-supervised Graph Learning confounds	14 TRACE: Your Diffusion Model is Secretly an Instance Edge Detector data_leakage	15 Is it thinking or cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort data_leakage	16 Q-RAG: Long Context Multi-Step Retrieval via Value-Based Embedder Training data_leakage	17 Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts data_leakage	18 Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts construct_validity	19 LLM DNA: Tracing Model Evolution via Functional Representations confounds	20 True Self-Supervised Novel View Synthesis is Transferable construct_validity	21 Compositional Diffusion with Guided Search for Long-Horizon Planning data_leakage	22 OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability statistical_misuse	23 Hallucination Begins Where Saliency Drops confounds	24 LLM Fingerprinting via Semantically Conditioned Watermarks data_leakage	25 LLM Fingerprinting via Semantically Conditioned Watermarks construct_validity	26 CauKer: classification time series foundation models can be pretrained on synthetic data only confounds	27 IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning aggregation_bias	28 IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning metric_mismatch	29 Inverse Scaling in Test-Time Compute construct_validity	30 Inverse Scaling in Test-Time Compute confounds	31 Quantifying Elicitation of Latent Capabilities in Language Models selection_bias	32 On the Effectiveness of Low Frequency Perturbations confounds	33 On the Effectiveness of Low Frequency Perturbations confounds	34 On the Effectiveness of Low Frequency Perturbations confounds	35 Many-shot Jailbreaking metric_mismatch	36 Many-shot Jailbreaking circular_reasoning	37 Measuring Forgetting of Memorized Training Examples confounds	38 MMA Training: Direct Input Space Margin Maximization through Adversarial Training robustness_theater	39 MMA Training: Direct Input Space Margin Maximization through Adversarial Training robustness_theater	40 MMA Training: Direct Input Space Margin Maximization through Adversarial Training selection_bias	41 More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness confounds	42 Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models confounds	43 Participatory Personalization in Classification data_leakage	44 A Probabilistic Perspective on Unlearning and Alignment for Large Language Models confounds	45 Reasoning Models Don't Always Say What They Think confounds	46 Scalable Extraction of Training Data from (Production) Language Models data_leakage	47 On the Sensitivity of Adversarial Robustness to Input Data Distributions confounds	48 On the Sensitivity of Adversarial Robustness to Input Data Distributions confounds	49 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents selection_bias	50 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents selection_bias	51 SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues data_leakage	52 Students Parrot Their Teachers: Membership Inference on Model Distillation circular_reasoning	53 Students Parrot Their Teachers: Membership Inference on Model Distillation construct_validity	54 Sufficient Context: A New Lens on Retrieval Augmented Generation Systems data_leakage	55 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models selection_bias	56 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models confounds	57 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models strawman_null	58 ValueNet: A New Dataset for Human Value Driven Dialogue System data_leakage	59 Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning data_leakage
Gemini 3.1 Pro (high)	79%	852	12	0	12	3	12	12	12	12	11	2	0	12	2	12	12	12	10	3	2	12	3	12	12	12	12	12	12	12	12	12	8	12	7	11	12	12	8	9	12	12	0	11	12	1	12	8	0	12	12	12	6	12	12	12	12	12	12	12	12
GPT 5.4 (high)	69%	952	12	0	10	2	12	12	12	0	10	10	6	12	2	12	7	12	3	0	10	12	6	12	12	12	12	12	8	0	11	12	8	2	5	6	12	11	5	9	12	8	0	9	12	0	12	12	0	0	12	10	12	11	12	12	11	1	6	12	12
GPT 5.2 (high)	59%	436	5	0	12	0	11	7	12	0	10	9	7	12	0	10	3	12	2	0	8	12	6	12	8	10	6	12	4	0	2	8	4	2	3	0	8	11	0	12	11	10	0	8	12	0	12	12	0	3	12	11	12	12	12	12	9	3	1	12	12
Sonnet 4.6 (60K)	54%	1,012	7	3	3	0	2	9	12	0	9	6	1	7	0	6	2	6	0	0	5	12	6	12	12	12	10	12	6	1	12	9	1	12	0	1	12	8	0	12	4	12	1	0	8	2	12	12	0	7	7	12	8	6	1	12	10	7	8	11	12
Opus 4.7 (xhigh)	52%	202	2	0	1	2	10	6	6	0	9	12	8	6	0	0	12	6	0	0	3	12	5	12	12	12	12	12	1	3	11	6	1	0	0	3	12	6	7	12	12	12	0	6	6	0	8	7	0	7	11	12	8	12	6	5	7	5	0	12	12
Sonnet 4.5 (60K)	52%	1,879	5	1	3	0	1	12	12	0	10	9	11	5	1	3	8	12	2	0	4	12	2	12	12	12	6	12	5	0	5	8	6	4	0	11	12	4	0	0	0	12	2	3	3	1	10	10	0	3	12	12	12	7	9	12	11	3	0	12	12
Opus 4.5 (60K)	50%	1,402	3	0	1	3	2	8	10	0	7	7	6	1	0	0	7	6	4	4	6	12	4	12	12	12	11	12	6	6	6	10	4	4	0	6	12	6	0	12	1	10	1	0	0	0	9	12	0	4	12	12	12	6	8	6	9	6	0	11	12
Opus 4.6 (64K)	43%	687	3	0	4	1	3	6	10	0	6	6	6	6	0	5	6	4	6	5	2	12	1	12	12	7	6	12	0	6	9	6	4	5	0	6	9	1	1	12	4	6	0	6	0	0	7	12	0	3	0	6	12	0	5	6	7	6	1	12	9
		0 3,000 6,000 9,000

Position bias statistic: (A-choices when orig=A + A-choices when orig=B − N/2) / (N/2). Zero means no bias. Positive values (red) indicate the model prefers whichever design is in position A, regardless of content. Negative values (blue) indicate B-preference. Cells with |bias| < 0.15 are left blank.

A-preference

No bias

B-preference

Orig=A acc

Orig=B acc

Model (thinking budget)

Orig=A / Orig=B (?)Blue dot = accuracy when original in position A. Orange dot = accuracy when original in position B. Bootstrap CIs over 59 items.

Gemini 3.1 Pro (high)

82%77%

+0.2

-0.2

+0.3

+0.5

+0.3

+0.5

-0.3

-0.8

+0.2

-0.5

+0.2

+0.7

+1.0

GPT 5.4 (high)

79%59%

+0.3

+1.0

+0.3

+0.2

+0.5

+0.3

+1.0

+0.7

+0.2

+0.7

+0.3

+0.8

+0.7

+0.2

+0.8

+0.5

+0.3

+0.5

+0.3

+0.2

-0.2

+1.0

GPT 5.2 (high)

68%49%

+0.5

-0.2

+0.8

+0.3

+0.2

+0.3

+0.2

+0.3

+0.7

+0.3

+0.7

+0.3

+1.0

+0.3

+0.7

+0.3

+0.5

+0.7

+0.2

+0.3

+0.7

+0.2

+0.5

+0.2

Sonnet 4.6 (60K)

64%44%

-0.5

+0.2

+0.3

+0.5

+1.0

+0.2

+0.8

+1.0

+0.8

+1.0

+0.3

+1.0

+0.2

+0.5

-0.2

+0.7

+0.2

-0.3

+0.8

-0.2

+0.7

-0.7

+0.2

+0.3

+0.8

+0.7

+0.2

Opus 4.7 (xhigh)

67%38%

-0.3

+0.2

+0.3

+1.0

+0.5

+0.7

+0.3

+1.0

+0.5

+0.8

+0.2

+0.5

+0.2

+1.0

+0.2

+0.5

+1.0

+0.8

+1.0

-0.3

+0.5

+0.2

+0.7

+1.0

+0.8

+0.5

+0.8

Sonnet 4.5 (60K)

52%52%

-0.8

-0.2

+0.2

-0.5

+0.2

-0.2

-0.7

+0.3

+0.7

+0.2

-0.2

+0.7

+0.3

-0.2

+0.3

+0.5

-0.2

+0.3

+0.5

-0.8

-0.2

-0.5

Opus 4.5 (60K)

66%34%

-0.2

+0.2

+0.5

+0.3

+0.7

+0.3

+0.8

+0.7

+0.2

+0.8

+1.0

+0.7

+0.2

+0.7

+1.0

+0.7

+0.3

+0.7

+1.0

+0.2

+0.3

+0.2

+0.7

+1.0

+0.7

+1.0

+0.5

+1.0

-0.2

Opus 4.6 (64K)

66%20%

-0.2

+0.7

+0.2

+0.5

+1.0

+0.3

+1.0

+0.8

+1.0

+0.7

+1.0

+0.8

+0.3

+0.2

+0.8

+1.0

-0.5

+1.0

+0.3

+0.8

+1.0

+0.5

+0.2

+0.7

+1.0

+0.8

+0.5

+1.0

+0.8

+1.0

+0.8

+1.0

+0.2

+0.5

Click a cell in the grid above to view model responses and reasoning.

MethodSpot — Validated 59 — Pairwise