Same injected flaws as pairwise, but a different task: given only the tampered experiment (no clean comparison), models list all flaws and partition them by severity (Critical / Notable / Minor). An LLM judge (Sonnet 4.6) then checks whether the injected flaw appears in the model's Critical or Notable tiers.
| Model (thinking budget) | Detection (95% CI) (?)Wilson CI over 59 items. | Thinking (tokens) | 1 Ask a Strong LLM Judge when Your Reward Model is Uncertain confounds |
2 Ask a Strong LLM Judge when Your Reward Model is Uncertain confounds |
3 Attention as a Hypernetwork confounds |
4 Attention as a Hypernetwork confounds |
5 Auditing Differentially Private Machine Learning: How Private is Private SGD? data_leakage |
6 Classification with Conceptual Safeguards strawman_null |
7 Classification with Conceptual Safeguards strawman_null |
8 Counterfactual Memorization in Neural Language Models data_leakage |
9 Differential Transformer confounds |
10 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning confounds |
11 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning confounds |
12 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning data_leakage |
13 Equivalence is All: A Unified View for Self-supervised Graph Learning confounds |
14 TRACE: Your Diffusion Model is Secretly an Instance Edge Detector data_leakage |
15 Is it thinking or cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort data_leakage |
16 Q-RAG: Long Context Multi-Step Retrieval via Value-Based Embedder Training data_leakage |
17 Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts data_leakage |
18 Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts construct_validity |
19 LLM DNA: Tracing Model Evolution via Functional Representations confounds |
20 True Self-Supervised Novel View Synthesis is Transferable construct_validity |
21 Compositional Diffusion with Guided Search for Long-Horizon Planning data_leakage |
22 OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability statistical_misuse |
23 Hallucination Begins Where Saliency Drops confounds |
24 LLM Fingerprinting via Semantically Conditioned Watermarks data_leakage |
25 LLM Fingerprinting via Semantically Conditioned Watermarks construct_validity |
26 CauKer: classification time series foundation models can be pretrained on synthetic data only confounds |
27 IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning aggregation_bias |
28 IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning metric_mismatch |
29 Inverse Scaling in Test-Time Compute construct_validity |
30 Inverse Scaling in Test-Time Compute confounds |
31 Quantifying Elicitation of Latent Capabilities in Language Models selection_bias |
32 On the Effectiveness of Low Frequency Perturbations confounds |
33 On the Effectiveness of Low Frequency Perturbations confounds |
34 On the Effectiveness of Low Frequency Perturbations confounds |
35 Many-shot Jailbreaking metric_mismatch |
36 Many-shot Jailbreaking circular_reasoning |
37 Measuring Forgetting of Memorized Training Examples confounds |
38 MMA Training: Direct Input Space Margin Maximization through Adversarial Training robustness_theater |
39 MMA Training: Direct Input Space Margin Maximization through Adversarial Training robustness_theater |
40 MMA Training: Direct Input Space Margin Maximization through Adversarial Training selection_bias |
41 More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness confounds |
42 Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models confounds |
43 Participatory Personalization in Classification data_leakage |
44 A Probabilistic Perspective on Unlearning and Alignment for Large Language Models confounds |
45 Reasoning Models Don't Always Say What They Think confounds |
46 Scalable Extraction of Training Data from (Production) Language Models data_leakage |
47 On the Sensitivity of Adversarial Robustness to Input Data Distributions confounds |
48 On the Sensitivity of Adversarial Robustness to Input Data Distributions confounds |
49 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents selection_bias |
50 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents selection_bias |
51 SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues data_leakage |
52 Students Parrot Their Teachers: Membership Inference on Model Distillation circular_reasoning |
53 Students Parrot Their Teachers: Membership Inference on Model Distillation construct_validity |
54 Sufficient Context: A New Lens on Retrieval Augmented Generation Systems data_leakage |
55 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models selection_bias |
56 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models confounds |
57 Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models strawman_null |
58 ValueNet: A New Dataset for Human Value Driven Dialogue System data_leakage |
59 Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning data_leakage |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT 5.4 (high) | 97% |
3,493 |
4 | 1 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 5 | 6 | 6 |
| GPT 5.2 (high) | 96% |
821 |
6 | 6 | 6 | 6 | 6 | 6 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 4 | 6 | 5 | 6 | 6 | 3 | 6 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 2 | 6 | 6 |
| Opus 4.7 (xhigh) | 96% |
719 |
2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 | 2 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| Gemini 3.1 Pro (high) | 92% |
954 |
0 | 4 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 5 | 5 | 0 | 6 | 6 | 4 | 6 | 6 | 6 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 3 | 6 | 6 | 6 | 6 | 3 | 6 | 6 | 6 | 5 | 4 | 6 | 6 | 6 |
| Opus 4.6 (64K) | 91% |
1,785 |
2 | 2 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 5 | 6 | 2 | 6 | 6 | 1 | 5 | 1 | 6 | 6 | 2 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 4 | 6 | 6 | 6 | 6 | 4 | 6 | 6 | 6 | 6 | 6 | 6 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 5 | 6 | 6 |
| Sonnet 4.6 (60K) | 90% |
2,937 |
4 | 3 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 4 | 6 | 6 | 6 | 6 | 5 | 2 | 6 | 4 | 6 | 6 | 0 | 6 | 1 | 6 | 6 | 1 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 3 | 6 | 6 | 6 | 6 | 4 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
| Opus 4.5 (60K) | 87% |
2,666 |
6 | 1 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 4 | 6 | 6 | 6 | 6 | 6 | 5 | 6 | 4 | 6 | 6 | 0 | 3 | 0 | 6 | 6 | 4 | 6 | 6 | 6 | 3 | 6 | 5 | 6 | 6 | 2 | 5 | 6 | 6 | 6 | 2 | 6 | 6 | 6 | 6 | 5 | 4 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 5 | 6 | 6 |
| Sonnet 4.5 (60K) | 86% |
5,556 |
6 | 4 | 4 | 6 | 3 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 5 | 3 | 6 | 5 | 6 | 5 | 1 | 5 | 1 | 6 | 6 | 3 | 5 | 6 | 5 | 6 | 6 | 6 | 5 | 5 | 2 | 4 | 4 | 6 | 6 | 3 | 5 | 5 | 5 | 4 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 3 | 6 | 5 |
0 6,000 12,000 18,000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||