Back to Runs

Runs/New

New run

Pick a test set and an agent. Every case is scored by an LLM-as-judge at temperature 0.

No test sets yet. Create one.

No agents yet. Create one.

Scores agent responses 1–5 against the expected behavior.

Cancel