MethodSpot — Validated 59

Same injected flaws as pairwise, but a different task: given only the tampered experiment (no clean comparison), models list all flaws and partition them by severity (Critical / Notable / Minor). An LLM judge (Sonnet 4.6) then checks whether the injected flaw appears in the model's Critical or Notable tiers.

59 items 8 models 40 papers 6 reps per item V3 semantic judge (Sonnet 4.6)

GPT 5.4 (high)

97.5%

[95.2, 98.7]

GPT 5.2 (high)

96.3%

[93.8, 97.8]

Opus 4.7 (xhigh)

95.8%

[90.5, 98.2]

Gemini 3.1 Pro (high)

91.8%

[88.5, 94.2]

Opus 4.6 (64K)

90.7%

[87.2, 93.3]

Sonnet 4.6 (60K)

90.1%

[86.6, 92.8]

Opus 4.5 (60K)

87.3%

[83.4, 90.4]

Sonnet 4.5 (60K)

85.6%

[81.6, 88.9]

Model (thinking budget)	Detection	Item CI (?)Bootstrap CI over items (n=33). Per-item accuracy averaged over 12 reps, then resampled 10k times. Measures uncertainty from which items are in the set.	Run CI (?)t-interval over 12 shuffled runs. Each run takes one random rep per item (mixed positions). Measures evaluation stability / reproducibility on this fixed benchmark.	Thinking Tokens
GPT 5.4 (high)	97.5%	[94.1%, 99.7%]	[96.1%, 98.8%]	3,493 med · 3,495 mean
GPT 5.2 (high)	96.3%	[92.9%, 98.9%]	[94.2%, 98.4%]	821 med · 840 mean
Opus 4.7 (xhigh)	95.8%	[89.8%, 100.0%]	[93.8%, 97.7%]	719 med · 766 mean
Gemini 3.1 Pro (high)	91.8%	[85.9%, 96.6%]	[89.1%, 94.5%]	954 med · 937 mean
Opus 4.6 (64K)	90.7%	[84.5%, 96.0%]	[89.4%, 90.9%]	1,785 med · 2,166 mean
Sonnet 4.6 (60K)	90.1%	[83.9%, 95.8%]	[86.8%, 93.5%]	2,937 med · 3,603 mean
Opus 4.5 (60K)	87.3%	[80.5%, 93.2%]	[84.2%, 90.4%]	2,666 med · 2,856 mean
Sonnet 4.5 (60K)	85.6%	[79.9%, 91.0%]	[81.9%, 89.3%]	5,556 med · 5,442 mean
	0 6,000 12,000 18,000

6/6

2-4

0/6

Model (thinking budget)	Detection (95% CI) (?)Wilson CI over 59 items.	Thinking (tokens)	1 Ask a Strong LLM Judge when Your Reward Model is Uncertain confounds	2 Ask a Strong LLM Judge when Your Reward Model is Uncertain confounds	3 Attention as a Hypernetwork confounds	4 Attention as a Hypernetwork confounds	5 Auditing Differentially Private Machine Learning: How Private is Private SGD? data_leakage	6 Classification with Conceptual Safeguards strawman_null	7 Classification with Conceptual Safeguards strawman_null	8 Counterfactual Memorization in Neural Language Models data_leakage	9 Differential Transformer confounds	10 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning confounds	11 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning confounds	12 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning data_leakage	13 Equivalence is All: A Unified View for Self-supervised Graph Learning confounds	14 TRACE: Your Diffusion Model is Secretly an Instance Edge Detector data_leakage	15 Is it thinking or cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort data_leakage	16 Q-RAG: Long Context Multi-Step Retrieval via Value-Based Embedder Training data_leakage	17 Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts data_leakage	18 Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts construct_validity	19 LLM DNA: Tracing Model Evolution via Functional Representations confounds	20 True Self-Supervised Novel View Synthesis is Transferable construct_validity	21 Compositional Diffusion with Guided Search for Long-Horizon Planning data_leakage	22 OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability statistical_misuse	23 Hallucination Begins Where Saliency Drops confounds	24 LLM Fingerprinting via Semantically Conditioned Watermarks data_leakage	25 LLM Fingerprinting via Semantically Conditioned Watermarks construct_validity	26 CauKer: classification time series foundation models can be pretrained on synthetic data only confounds	27 IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning aggregation_bias	28 IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning metric_mismatch	29 Inverse Scaling in Test-Time Compute construct_validity	30 Inverse Scaling in Test-Time Compute confounds	31 Quantifying Elicitation of Latent Capabilities in Language Models selection_bias	32 On the Effectiveness of Low Frequency Perturbations confounds	33 On the Effectiveness of Low Frequency Perturbations confounds	34 On the Effectiveness of Low Frequency Perturbations confounds	35 Many-shot Jailbreaking metric_mismatch	36 Many-shot Jailbreaking circular_reasoning	37 Measuring Forgetting of Memorized Training Examples confounds	38 MMA Training: Direct Input Space Margin Maximization through Adversarial Training robustness_theater	39 MMA Training: Direct Input Space Margin Maximization through Adversarial Training robustness_theater	40 MMA Training: Direct Input Space Margin Maximization through Adversarial Training selection_bias	41 More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness confounds	42 Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models confounds	43 Participatory Personalization in Classification data_leakage	44 A Probabilistic Perspective on Unlearning and Alignment for Large Language Models confounds	45 Reasoning Models Don't Always Say What They Think confounds	46 Scalable Extraction of Training Data from (Production) Language Models data_leakage	47 On the Sensitivity of Adversarial Robustness to Input Data Distributions confounds	48 On the Sensitivity of Adversarial Robustness to Input Data Distributions confounds	49 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents selection_bias	50 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents selection_bias	51 SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues data_leakage	52 Students Parrot Their Teachers: Membership Inference on Model Distillation circular_reasoning	53 Students Parrot Their Teachers: Membership Inference on Model Distillation construct_validity	54 Sufficient Context: A New Lens on Retrieval Augmented Generation Systems data_leakage	55 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models selection_bias	56 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models confounds	57 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models strawman_null	58 ValueNet: A New Dataset for Human Value Driven Dialogue System data_leakage	59 Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning data_leakage
GPT 5.4 (high)	97%	3,493	4	1	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	5	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	5	6	6
GPT 5.2 (high)	96%	821	6	6	6	6	6	6	5	6	6	6	6	6	6	6	6	6	6	4	6	5	6	6	3	6	5	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	5	6	6	6	6	6	6	6	6	6	6	6	6	6	6	2	6	6
Opus 4.7 (xhigh)	96%	719	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	0	2	0	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	1	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2
Gemini 3.1 Pro (high)	92%	954	0	4	6	6	6	6	6	6	6	6	5	6	6	6	6	6	6	6	6	6	6	6	5	5	0	6	6	4	6	6	6	5	6	6	6	6	6	6	6	6	6	6	6	6	6	3	6	6	6	6	3	6	6	6	5	4	6	6	6
Opus 4.6 (64K)	91%	1,785	2	2	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	5	6	2	6	6	1	5	1	6	6	2	6	6	6	6	6	6	6	6	4	6	6	6	6	4	6	6	6	6	6	6	5	6	6	6	6	6	6	6	5	6	6
Sonnet 4.6 (60K)	90%	2,937	4	3	6	6	6	6	6	6	6	6	6	4	6	6	6	6	5	2	6	4	6	6	0	6	1	6	6	1	6	6	6	6	6	6	6	6	3	6	6	6	6	4	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6
Opus 4.5 (60K)	87%	2,666	6	1	5	6	6	6	6	6	6	6	6	4	6	6	6	6	6	5	6	4	6	6	0	3	0	6	6	4	6	6	6	3	6	5	6	6	2	5	6	6	6	2	6	6	6	6	5	4	6	6	6	6	6	6	6	6	5	6	6
Sonnet 4.5 (60K)	86%	5,556	6	4	4	6	3	6	6	6	6	6	6	6	6	6	6	6	5	3	6	5	6	5	1	5	1	6	6	3	5	6	5	6	6	6	5	5	2	4	4	6	6	3	5	5	5	4	6	6	6	6	6	6	6	6	6	6	3	6	5
		0 6,000 12,000 18,000

Click a cell in the grid above to view triage reports and V3 scoring.

MethodSpot — Validated 59 — Triage