Runs/New
New run
Pick a test set and an agent. Every case is scored by an LLM-as-judge at temperature 0.
No test sets yet. Create one.
No agents yet. Create one.
Scores agent responses 1–5 against the expected behavior.
Runs/New
Pick a test set and an agent. Every case is scored by an LLM-as-judge at temperature 0.
No test sets yet. Create one.
No agents yet. Create one.
Scores agent responses 1–5 against the expected behavior.