Models choose which of two experiment designs is more methodologically sound. Position-balanced reps.
Models produce 3-tier severity reports on single experiments. Scored by V3 semantic judge.