MethodSpot — Validated 59 — Triage

Same injected flaws as pairwise, but a different task: given only the tampered experiment (no clean comparison), models list all flaws and partition them by severity (Critical / Notable / Minor). An LLM judge (Sonnet 4.6) then checks whether the injected flaw appears in the model's Critical or Notable tiers.

59 items 8 models 40 papers 6 reps per item V3 semantic judge (Sonnet 4.6)
GPT 5.4 (high)
97.5%
[95.2, 98.7]
GPT 5.2 (high)
96.3%
[93.8, 97.8]
Opus 4.7 (xhigh)
95.8%
[90.5, 98.2]
Gemini 3.1 Pro (high)
91.8%
[88.5, 94.2]
Opus 4.6 (64K)
90.7%
[87.2, 93.3]
Sonnet 4.6 (60K)
90.1%
[86.6, 92.8]
Opus 4.5 (60K)
87.3%
[83.4, 90.4]
Sonnet 4.5 (60K)
85.6%
[81.6, 88.9]
6/6
5
2-4
1
0/6
Model (thinking budget) Detection (95% CI) (?)Wilson CI over 59 items. Thinking (tokens)
1
Ask a Strong LLM Judge when Your Reward Model is Uncertain
confounds
2
Ask a Strong LLM Judge when Your Reward Model is Uncertain
confounds
3
Attention as a Hypernetwork
confounds
4
Attention as a Hypernetwork
confounds
5
Auditing Differentially Private Machine Learning: How Private is Private SGD?
data_leakage
6
Classification with Conceptual Safeguards
strawman_null
7
Classification with Conceptual Safeguards
strawman_null
8
Counterfactual Memorization in Neural Language Models
data_leakage
9
Differential Transformer
confounds
10
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning
confounds
11
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning
confounds
12
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning
data_leakage
13
Equivalence is All: A Unified View for Self-supervised Graph Learning
confounds
14
TRACE: Your Diffusion Model is Secretly an Instance Edge Detector
data_leakage
15
Is it thinking or cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
data_leakage
16
Q-RAG: Long Context Multi-Step Retrieval via Value-Based Embedder Training
data_leakage
17
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
data_leakage
18
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
construct_validity
19
LLM DNA: Tracing Model Evolution via Functional Representations
confounds
20
True Self-Supervised Novel View Synthesis is Transferable
construct_validity
21
Compositional Diffusion with Guided Search for Long-Horizon Planning
data_leakage
22
OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability
statistical_misuse
23
Hallucination Begins Where Saliency Drops
confounds
24
LLM Fingerprinting via Semantically Conditioned Watermarks
data_leakage
25
LLM Fingerprinting via Semantically Conditioned Watermarks
construct_validity
26
CauKer: classification time series foundation models can be pretrained on synthetic data only
confounds
27
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning
aggregation_bias
28
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning
metric_mismatch
29
Inverse Scaling in Test-Time Compute
construct_validity
30
Inverse Scaling in Test-Time Compute
confounds
31
Quantifying Elicitation of Latent Capabilities in Language Models
selection_bias
32
On the Effectiveness of Low Frequency Perturbations
confounds
33
On the Effectiveness of Low Frequency Perturbations
confounds
34
On the Effectiveness of Low Frequency Perturbations
confounds
35
Many-shot Jailbreaking
metric_mismatch
36
Many-shot Jailbreaking
circular_reasoning
37
Measuring Forgetting of Memorized Training Examples
confounds
38
MMA Training: Direct Input Space Margin Maximization through Adversarial Training
robustness_theater
39
MMA Training: Direct Input Space Margin Maximization through Adversarial Training
robustness_theater
40
MMA Training: Direct Input Space Margin Maximization through Adversarial Training
selection_bias
41
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
confounds
42
Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models
confounds
43
Participatory Personalization in Classification
data_leakage
44
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models
confounds
45
Reasoning Models Don't Always Say What They Think
confounds
46
Scalable Extraction of Training Data from (Production) Language Models
data_leakage
47
On the Sensitivity of Adversarial Robustness to Input Data Distributions
confounds
48
On the Sensitivity of Adversarial Robustness to Input Data Distributions
confounds
49
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
selection_bias
50
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
selection_bias
51
SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues
data_leakage
52
Students Parrot Their Teachers: Membership Inference on Model Distillation
circular_reasoning
53
Students Parrot Their Teachers: Membership Inference on Model Distillation
construct_validity
54
Sufficient Context: A New Lens on Retrieval Augmented Generation Systems
data_leakage
55
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models
selection_bias
56
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models
confounds
57
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models
strawman_null
58
ValueNet: A New Dataset for Human Value Driven Dialogue System
data_leakage
59
Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
data_leakage
GPT 5.4 (high)
97%
3,493
4 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 6 6
GPT 5.2 (high)
96%
821
6 6 6 6 6 6 5 6 6 6 6 6 6 6 6 6 6 4 6 5 6 6 3 6 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 2 6 6
Opus 4.7 (xhigh)
96%
719
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Gemini 3.1 Pro (high)
92%
954
0 4 6 6 6 6 6 6 6 6 5 6 6 6 6 6 6 6 6 6 6 6 5 5 0 6 6 4 6 6 6 5 6 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 3 6 6 6 5 4 6 6 6
Opus 4.6 (64K)
91%
1,785
2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 6 2 6 6 1 5 1 6 6 2 6 6 6 6 6 6 6 6 4 6 6 6 6 4 6 6 6 6 6 6 5 6 6 6 6 6 6 6 5 6 6
Sonnet 4.6 (60K)
90%
2,937
4 3 6 6 6 6 6 6 6 6 6 4 6 6 6 6 5 2 6 4 6 6 0 6 1 6 6 1 6 6 6 6 6 6 6 6 3 6 6 6 6 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Opus 4.5 (60K)
87%
2,666
6 1 5 6 6 6 6 6 6 6 6 4 6 6 6 6 6 5 6 4 6 6 0 3 0 6 6 4 6 6 6 3 6 5 6 6 2 5 6 6 6 2 6 6 6 6 5 4 6 6 6 6 6 6 6 6 5 6 6
Sonnet 4.5 (60K)
86%
5,556
6 4 4 6 3 6 6 6 6 6 6 6 6 6 6 6 5 3 6 5 6 5 1 5 1 6 6 3 5 6 5 6 6 6 5 5 2 4 4 6 6 3 5 5 5 4 6 6 6 6 6 6 6 6 6 6 3 6 5
0
6,000
12,000
18,000
Click a cell in the grid above to view triage reports and V3 scoring.