New run

Pick a test set and an agent. Every case is scored by an LLM-as-judge at temperature 0.

Test set

No test sets yet. Create one.

Agent

No agents yet. Create one.

Judge model

Scores agent responses 1–5 against the expected behavior.