Models choose which of two experiment designs is more methodologically sound. 33 items, up to 6 reps each.
Models produce 3-tier severity reports on single experiments. Scored by V3 semantic judge.