Triage Detection Leaderboard

28 human-validated items. Models produce 3-tier severity reports on single experiments. V3 semantic judge scores detection quality. 280 total evaluations across 5 models.
Sonnet 4.6
91.1%
51/56 detected (E:49 V:2 NF:5)
Opus 4.6
96.4%
54/56 detected (E:48 V:6 NF:2)
Gemini 3.1 Pro
92.9%
52/56 detected (E:44 V:8 NF:4)
Sonnet 4.5
91.1%
51/56 detected (E:45 V:6 NF:5)
Gemini Flash
85.7%
48/56 detected (E:42 V:6 NF:8)
# Item Sonnet 4.6 Opus 4.6 Gemini 3.1 Pro Sonnet 4.5 Gemini Flash
1 Classification with Conceptual Safeguardsstrawman_null
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
2 Attention as a Hypernetworkconfounds
EE
2/2
EE
2/2
EE
2/2
E
1/2
E
1/2
3 LLM DNA: Tracing Model Evolution via Functional Representationsconfounds
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
4 CauKer: classification time series foundation models can be pretrained on synthetic data onlyconfounds
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
5 Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Promptsdata_leakage
EE
2/2
VE
2/2
EE
2/2
EE
2/2
EE
2/2
6 Quantifying Elicitation of Latent Capabilities in Language Modelsselection_bias
EE
2/2
EE
2/2
EE
2/2
EE
2/2
VE
2/2
7 Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Promptsconstruct_validity
0/2
VV
2/2
VV
2/2
V
1/2
VV
2/2
8 Q-RAG: Long Context Multi-Step Retrieval via Value-Based Embedder Trainingdata_leakage
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
9 Strategy Coopetition Explains the Emergence and Transience of In-Context Learningconfounds
EE
2/2
VE
2/2
VV
2/2
E
1/2
EV
2/2
10 Many-shot Jailbreakingmetric_mismatch
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
11 Inverse Scaling in Test-Time Computeselection_bias
EE
2/2
EE
2/2
EE
2/2
EV
2/2
EE
2/2
12 Compositional Diffusion with Guided Search for Long-Horizon Planningdata_leakage
EE
2/2
EE
2/2
EE
2/2
VV
2/2
EE
2/2
13 TRACE: Your Diffusion Model is Secretly an Instance Edge Detectordata_leakage
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
14 Differential Transformerconfounds
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
15 Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoningdata_leakage
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
16 OpenApps: Simulating Environment Variations to Measure UI-Agent Reliabilitystatistical_misuse
EE
2/2
EE
2/2
EE
2/2
EE
2/2
E
1/2
17 Classification with Conceptual Safeguardsstrawman_null
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
18 Is it thinking or cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effortdata_leakage
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
19 Participatory Personalization in Classificationdata_leakage
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
20 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agentsselection_bias
EE
2/2
VV
2/2
EE
2/2
EE
2/2
EE
2/2
21 Adaptive Surrogate Gradients for Sequential Reinforcement Learning in Spiking Neural Networksconfounds
0/2
0/2
0/2
V
1/2
0/2
22 On Scaling Up 3D Gaussian Splatting Trainingconfounds
V
1/2
EE
2/2
0/2
E
1/2
0/2
23 Reasoning Models Don't Always Say What They Thinkconfounds
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
24 Sufficient Context: A New Lens on Retrieval Augmented Generation Systemsdata_leakage
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
25 Inverse Scaling in Test-Time Computeconfounds
EE
2/2
EE
2/2
EE
2/2
EE
2/2
EE
2/2
26 Equivalence is All: A Unified View for Self-supervised Graph Learningconfounds
EV
2/2
EE
2/2
VV
2/2
VE
2/2
V
1/2
27 Sabotage Evaluations for Frontier Modelsconfounds
EE
2/2
EE
2/2
VV
2/2
EE
2/2
EV
2/2
28 Many-shot Jailbreakingcircular_reasoning
EE
2/2
EE
2/2
EE
2/2
EE
2/2
E
1/2
Click a cell in the grid above to view triage reports and V3 scoring.