Measure what your prompts actually do.

Define test sets, define agents, run them against an LLM-as-judge, and compare the diff. Ship prompts with numbers, not vibes.

Get started

Step 1

Test set

Inputs + expected behaviors.

+ New
Step 2

Agent

A prompt + model.

Step 3

Run

Score the agent.

Step 4

Compare

Diff two runs.

Recent runs

No runs yet. Complete the steps above to start measuring.