Simulation Studio is regression testing for your agent. Instead of checking one prompt at a time, it replays a whole set of scenarios through the live agent, scores each reply, and shows you a single result you can trust. Its most important job is the baseline diff: it compares this run to an earlier one and flags any scenario that used to pass but now fails — a regression — so a small wording change can't quietly break something that was working.

It's a safe test

A simulation runs your scenarios in a test setting against your draft (or a published version). It never touches live customer conversations, so you can run it as often as you like while you tune the agent.

Scenarios and suites

A scenario is a realistic customer message paired with what a good answer should contain — the same regression tests you create on the Evals tab. A suite is simply a named group of scenarios, so you can run the right set for the moment:

Smoke — a small set of everyday questions to check nothing is obviously broken.
Regression — the cases that have failed before, kept as a guard so they never regress again.
Adversarial — tricky or edge-case messages that have tripped the agent up.
Custom — any grouping that matches how your team works.

You don't have to use suites — you can always run All scenarios. Suites just let you run a focused battery quickly.

Run a simulation

Open the agent and go to Simulations.
Pick the scenarios to run — All scenarios, or a specific suite.
Choose what to run against: your current draft, or the published version that's live today.
Optionally pick a baseline — an earlier run to compare against — then choose Run simulation.
Wait for the run to finish. Convoship sends each scenario through the agent and scores the reply.

Read the results

Each run opens with a headline pass rate, then a row per scenario you can expand to see the exact prompt, the agent's reply, and why it passed or failed. Every reply is scored on three things:

Accuracy — did the reply meet the expected outcome for that scenario.
Resolution — did the agent actually resolve the request rather than punt it to a human.
Policy — did it stay within its guardrails instead of over-blocking a perfectly normal request.

When you run against a baseline, each scenario is tagged so you can see what changed since then:

Regressed — passed in the baseline, fails now. These are highlighted at the top because they're the ones to fix before you ship.
Fixed — failed in the baseline, passes now. Proof your change worked.
New — a scenario the baseline never saw.

Make it a release gate

Run a simulation before every publish, using your last good run as the baseline. If the regression count is zero, you're clear to ship. If it isn't, the failing scenarios tell you exactly what to fix first.

When to run it

Run a simulation before you first publish an agent, and again whenever you change its instructions, tasks, guardrails, or the tools it can use. Comparing each run to the previous one turns testing into a habit that catches problems while they're still cheap to fix.