Evaluation runtimes
Evaluation runtimes
An evaluation runtime is the strategy that drives a single test case from start to finish: it produces a transcript (in-process simulator, external phone/web call, etc.). Runtimes are pluggable — adding a new one means writing a single class and registering it.
The orchestrator owns concurrency, judging, persistence, and aggregate metrics. The runtime owns the conversation loop. This separation is what lets text and voice runtimes coexist without churning the orchestrator.
Built-in runtimes
| Kind | Used for | Available when |
|---|---|---|
connexity | Native: in-process user simulator + agent ↔ user text loop | always |
retell | Drives a Retell web call, then exposes the transcript for judging | agent is on the Retell platform with a configured Retell integration |
custom_endpoint | Posts to a user-provided HTTP endpoint that honors the OpenAI-compatible agent contract | Non-Retell agents (Custom/Webhook, Vapi, ElevenLabs, legacy rows without platform); agent must be in endpoint mode |
The active runtime is stored in RunConfig.runtime (inside eval_config.config JSONB). Absent value → connexity.
Where things live
Schema-side:
CRUD + routes:
How an eval run uses the runtime
crud.create_eval_config/update_eval_configcalls_validate_runtime(...). The runtime's ownvalidate_configruns, and tool-call-using test cases are rejected if the runtime is not Connexity.services.orchestrator.execute_runloads the snapshottedRunConfig, builds anAgentSnapshotandRunSnapshotonce, then dispatches each test case throughruntime.run_test_case(...)under aSemaphore(config.concurrency).- After the runtime returns a transcript, the orchestrator calls
judge.evaluate_transcript(...)to produce the verdict, computes per-case metrics, and persistsTestCaseResult. - Per-test-case failures land in
TestCaseResult.error_message; the run continues.
Snapshots
AgentSnapshot and RunSnapshot are frozen captures taken once per run. They hold everything a runtime needs to know about the agent and the run, so that RuntimeRunArgs stays a three-field struct:
Runtimes pull whatever they need from these structs (args.agent_snapshot.endpoint_url, args.run_snapshot.run_config, etc.) without the orchestrator passing 13 kwargs.
Adding a new runtime
Worked example: add a myvoice voice runtime.
1. Add the enum value
backend/app/models/enums.py:
2. Add a config class and extend the discriminated union
backend/app/models/schemas.py:
Re-export MyVoiceRuntimeConfig from app.models.__init__ if it should be importable as from app.models import ….
3. Implement the runtime
backend/app/services/eval_runtimes/voice/myvoice.py:
Runtimes must be safe to instantiate without arguments — the registry creates one shared instance per process.
4. Register it
backend/app/services/eval_runtimes/registry.py:
Append the runtime to the iteration order tuple in runtimes_for_platform so the dropdown order is stable.
5. Update tests
Add coverage in backend/app/tests/services/eval_runtimes/:
test_<name>.py— runtime-local:supported_for_platform,validate_config,test_connection, andrun_test_casehappy path.- Add a case to
test_dispatch.pyproving_execute_single_test_caseroutes to the new runtime when configured.
6. Regenerate the API client
The new runtime config kind shows up in the OpenAPI schema; the frontend SDK must be re-generated:
CI fails if the generated client is stale.
Contract for run_test_case
runtime.run_test_case(runtime_config, args, session) returns a TestCaseRunResult:
TestCaseRunResult.transcript: list[ConversationTurn]is consumed for latency/turn metrics and is fed into the Connexity judge by the orchestrator. An empty transcript signals a no-op (e.g. transcript fetch failed); the orchestrator skips the judge and marks the case failed.TestCaseRunResult.agent_token_usage/platform_token_usage/agent_cost_usd/platform_cost_usdare merged with the judge's own usage/cost by the orchestrator. Runtimes that don't have meaningful values can leave them empty.TestCaseRunResult.runtime_metadata: dict[str, Any] | Noneis an opaque per-runtime escape hatch. Voice runtimes use it to attach platform call ids, recording URLs, etc. Text runtimes typically leave itNone.- Raise to mark the case errored — the exception message becomes
TestCaseResult.error_message.
Runtimes must not call the judge themselves. The orchestrator always runs the Connexity judge on the returned transcript.
Sharing the text loop
TextRuntimeBase owns the runtime-agnostic loop: user simulator turns, turn ordering, terminating tool calls, timeouts, cancellation, and result assembly. It does not know how to call a Connexity endpoint, a custom endpoint, or Retell.
Text runtimes provide the agent side:
build_text_agent_config(...)resolves the per-runtime agent settings.do_agent_turn(...)executes one agent turn and appends the assistant/tool turns to the transcript.
ConnexityRuntime drives agent turns only through
:class:~app.services.agent_simulator.AgentSimulator (platform-mode agents).
CustomEndpointRuntime drives agent turns only through HTTP POST to your
endpoint (endpoint-mode agents). They share TextRuntimeBase for the user
simulator loop only — not agent inference. Retell will implement its own
do_agent_turn(...).