Skip to main content

Running Simulations

This page explains how many simulations run when you start a run, how to run from the UI vs the SDK, and how to cover multiple scenarios, drivers, or targets (your test matrix).

How Many Simulations Run?

Each time you run a simulation in Okareo, you choose one Target, one Driver, and one Scenario. The Scenario can contain multiple Scenario Rows. Okareo runs one simulation per Scenario Row (one full conversation per row). You can also set Repeats to run the same row more than once (e.g. for statistical confidence).

Formula:

Total simulations = Number of Scenario Rows × Repeats

Scenario rowsRepeatsTotal simulations
111
515
326 (each row run twice)

So a single "Run" in the UI or a single run_simulation call in the SDK can execute many simulations in one batch. Each simulation produces one conversation (transcript + check results) that you can inspect in the app.

Running from the UI

  1. Go to Simulations and click + Create Multi-Turn Simulation (or use the quick-start with example Target, Driver, and Scenario).
  2. Select your Target, Driver, and Scenario. If the Scenario has 10 rows and Repeats is 1, you get 10 simulations.
  3. Choose Checks and any run options (e.g. max turns, first speaker).
  4. Click Run Simulation. Okareo queues and runs all simulations; you see progress and then a summary (e.g. score ring, pass/fail counts).
  5. Open the run to see the list of individual simulations (one per row × repeat) and drill into transcripts and per-turn checks.

See Simulation Introduction for the zero-setup flow and Prompt Target or Custom Endpoint for building your own Target, Driver, and Scenario.

Running from the SDK

Use the Okareo Python or TypeScript SDK to start a run with the same formula: one Target, one Driver, one Scenario (with N rows), and optional repeats. All N × Repeats simulations are executed in one call.

Python example (prompt target):

from okareo import Okareo
from okareo.model_under_test import Target, Driver, GenerationModel

okareo = Okareo(api_key="...")
target = Target(name="My Agent", target=GenerationModel(model_id="gpt-4o-mini", ...))
driver = Driver(name="Friendly User", prompt_template="...", ...)
scenario = okareo.create_scenario_set(ScenarioSetCreate(name="Test Cases", seed_data=[...]))

# One call runs (len(seed_data) × repeats) simulations
evaluation = okareo.run_simulation(
driver=driver,
target=target,
scenario=scenario,
name="Regression Run",
repeats=2, # each scenario row runs twice
checks=["behavior_adherence", "result_completed"],
)
print(evaluation.app_link) # open in app to see all simulations

Voice: Same idea with Voice Simulation—pass a scenario with multiple rows and repeats; each row becomes one or more voice sessions depending on repeats.

Covering Multiple Drivers, Targets, or Scenario Sets

A single run uses one Target, one Driver, and one Scenario. To cover a matrix (e.g. 2 drivers × 3 targets × 1 scenario set with 5 rows):

  • Option 1 — Multiple runs: Run a simulation for each combination (e.g. Driver A + Target 1, then Driver A + Target 2, …). You can do this manually in the UI or script it in the SDK with a loop.
  • Option 2 — Scenario rows as the matrix: Encode the variation in the Scenario instead. Use scenario rows to pass different "modes" or "personas" into the Driver prompt (e.g. { "persona": "frustrated", "task": "refund" }), and use one Driver and one Target. Then total simulations = rows × repeats.
  • Option 3 — CI / automation: Run different simulations (different targets or drivers) in parallel or sequence in your pipeline. See Scheduling Simulations and GitHub Actions or CircleCI.

For most teams, scenario rows × repeats covers the main test matrix (many test cases, optional repeats). Use multiple runs or CI when you need to compare different agents (targets) or different caller types (drivers) systematically.

Templating Driver and Target from scenario rows

Driver and Target properties can be supplied per run via the scenario row. Each scenario row’s input is not only for the Driver’s conversational prompt—it can also drive voices, behaviors, and call/session settings (e.g. phone numbers, endpoints). That way a single run with one Driver and one Target is fully templated: each row can specify different voices, different behaviors, or different numbers, and you get a full test matrix in one batch without multiple runs.

What you can template per row

Use caseScenario row inputWhere it’s used
Different voicesvoice, output_voice, or similar fieldsDriver and/or Target voice config (e.g. TTS voice per row).
Different behaviorspersona, tone, objective, taskDriver prompt template ({scenario_input.persona}, {scenario_input.objective}) so each row behaves differently.
Different numbers / endpointsfrom_number, to_number, endpoint_id, etc.Target or Driver config so each simulation uses different call numbers or endpoints.

Your Driver prompt template (and any Target or edge config that supports templating) can reference these fields with {scenario_input.<field>}. Each scenario row then supplies the values for that run. For example:

  • Voices: Put voice (e.g. "ash", "coral") in each row’s input; the Driver or voice Target uses {scenario_input.voice} so each simulation runs with a different voice.
  • Behaviors: Put persona, task, or productType in the row; the Driver prompt uses {scenario_input.persona} and {scenario_input.task} so each row exhibits different behavior (e.g. frustrated customer vs calm, refund vs exchange).
  • Numbers: Put from_number / to_number (or your platform’s equivalent) in the row so each simulation places or receives the call from different numbers, useful for testing inbound/outbound or multi-number setups.

With this, the entire simulation is templated: one Scenario with many rows, each row defining voices, behaviors, and optionally numbers (or other config). One run executes all rows × repeats, each with the Driver and Target properties implied by that row. See Voice Simulation for a full example with voices and scenario inputs, and Creating Drivers for prompt templates that use {scenario_input.*}.

Inspecting Results

After a run completes:

  • In the app: Open the run from the Simulations list. You see each individual simulation (one per scenario row × repeat) with transcript, checks, and metrics. Use this to compare behavior across rows or to debug a specific failure.
  • Reuse and compare: Run the same Scenario against a different Target or Driver in a separate run, then compare results side-by-side (e.g. before/after a prompt change).

Next Steps