Connexity

Test Case Schema Specification

Test Case Schema Specification

Schema version: 1.0.0

Design Principles

  1. Test cases do NOT contain agent_system_prompt or agent_tools. Those are captured on the Run entity at eval execution time.
  2. user_context is a free-form dict — the simulator prompt receives it as a JSON dump. Any domain-specific fields work automatically.
  3. expected_outcomes is also free-form — keys are descriptive labels the judge interprets semantically, not code-parsed enums.
  4. expected_tool_calls defines which tools the agent should (or should not) call, and what parameters are expected. The platform handles mock dispatch and response injection separately from the test case definition.
  5. Simulation is always LLM-persona-driven: the LLM imitates a specific user via a system prompt built from persona + user_context.
  6. max_turns is nullable — when omitted or set to null, the conversation runs until the agent or simulator terminates it naturally (no cap).

Field Reference

Top-level Fields

FieldTypeDefaultRequiredDescription
namestrYesHuman-readable short name (max 255 chars)
descriptionstr | nullnullNoWhat this test case exercises (for humans)
difficulty"normal" | "hard""normal"NoTwo-level difficulty for filtering and weighting
tagslist[str][]NoFree-form tags for grouping/filtering. Pre-seeded: "normal", "red-team", "edge-case"
status"draft" | "active" | "archived""active"NoLifecycle state. Only active test cases run by default
personaPersona | nullnullNoWho the simulated user is (see nested type below)
initial_messagestr | nullnullNoFirst message the simulated user sends to the agent
user_contextdict[str, Any] | nullnullNoFree-form knowledge the user "has". JSON-dumped into simulator prompt. Domain-specific: add any fields needed
max_turnsint | nullnullNoMax conversation turns. null = no cap, conversation runs until agent or simulator terminates naturally
expected_outcomesdict[str, Any] | nullnullNoFree-form success criteria. Keys = descriptive labels, values = expected state (bool, string, etc.). Judge interprets semantically
expected_tool_callslist[ExpectedToolCall] | nullnullNoTool call expectations for judge evaluation (see nested type below)
evaluation_criteria_overridestr | nullnullNoCustom judge prompt section. Overrides default criteria for this test case

Database-only Fields (auto-generated)

FieldTypeDescription
iduuidPrimary key, auto-generated
created_atdatetimeServer-set on creation
updated_atdatetimeServer-set on each update

Nested Types

Persona

Controls who the simulated user is. Dumped into the LLM simulator's system prompt.

FieldTypeRequiredDescription
typestrYesShort persona archetype label (e.g. "polite-customer", "frustrated-user")
descriptionstrYesDetailed persona description
instructionsstrYesBehavioral directives for the LLM simulator

ExpectedToolCall

Defines which tools the agent should call during the test case.

FieldTypeRequiredDescription
toolstrYesTool/function name the agent should invoke
expected_paramsdict[str, Any] | nullNoKey parameters the judge verifies. null = any params acceptable

How Attributes Are Consumed

AttributeConsumerPurpose
idRunner, DB, dashboardUnique identification, logging, storage
name, descriptionDashboard, docsHuman-readable display
difficultySelector, dashboardFiltering, distribution weighting
tagsSelector, dashboardGrouping, filtering
statusCRUD API, selectorLifecycle — only active test cases run by default
personaSimulator prompt builderDumped into LLM system prompt — controls behavior and tone
initial_messageSimulator, judgeFirst turn sent to agent; shown to judge as context
user_contextSimulator prompt builderJSON-dumped into simulator prompt as domain knowledge
max_turnsRunnerCaps conversation length; null = unlimited
expected_outcomesJudgeFree-form criteria the judge evaluates against
expected_tool_callsJudgeVerifies correct tool usage and parameters
evaluation_criteria_overrideJudgeReplaces default judge criteria for this test case. When set, the judge uses this text instead of the platform's default scoring rubric

Validation Constraints

  • name is required and must be ≤ 255 characters.
  • difficulty must be one of: normal, hard.
  • status must be one of: draft, active, archived.
  • tags is a PostgreSQL TEXT[] column with a GIN index for efficient containment queries.
  • persona, user_context, expected_outcomes, expected_tool_calls are stored as JSONB columns.
  • All JSONB fields are nullable — omitting them is valid for draft or minimal test cases.

Example

{
  "name": "Refund Request — Valid",
  "description": "Customer requests refund within 30-day window",
  "difficulty": "normal",
  "tags": ["billing", "refund", "happy-path"],
  "status": "active",
  "persona": {
    "type": "polite-customer",
    "description": "Polite customer who purchased 5 days ago",
    "instructions": "Be cooperative but insistent on getting a full refund. Provide order number when asked."
  },
  "initial_message": "Hi, I'd like to request a refund for my recent order.",
  "user_context": {
    "order_id": "ORD-12345",
    "purchase_date": "2026-03-15",
    "amount": 49.99
  },
  "max_turns": 10,
  "expected_outcomes": {
    "refund_initiated": true,
    "customer_satisfied": true
  },
  "expected_tool_calls": [
    {
      "tool": "lookup_order",
      "expected_params": { "order_id": "ORD-12345" }
    }
  ]
}

See examples/test-cases/ for more examples covering normal, red-team, edge-case, tool-heavy, and multi-turn cases.

On this page