Connexity

LLM Judge Evaluation Criteria

LLM Judge Evaluation Criteria

This document describes the evaluation metrics used by the LLM-as-a-judge module to score agent conversation transcripts. The canonical definitions live in backend/app/services/judge_metrics.py; this document provides a human-readable reference for the criteria, tiers, scoring rules, and configuration options.

Overview

The judge evaluates completed transcripts against a set of metrics. Each metric belongs to a tier and uses either a scored (0-5 integer) or binary (pass/fail) scale. Metrics carry configurable weights that determine their contribution to the overall score.

PropertyDescription
Overall scoreWeighted sum of normalized metric scores, 0-100.
Pass/failoverall_score >= pass_threshold.

Per-Metric Failure Diagnostics

When a metric scores poorly, the judge generates:

  • failure_code — a free-form snake_case label describing the failure mode (e.g. wrong_tool_selected, hallucinated_result, missing_confirmation). These are suggestions; the judge is not limited to a fixed set. null when the metric is acceptable or better.
  • turns — a list of integer turn indices where the issue was observed. Empty list if no issue.

Metric Tiers

TierPurposeMetrics
ExecutionDid the agent use the right tools correctly?tool_routing, parameter_extraction, result_interpretation, task_completion
KnowledgeDid the agent stay grounded and follow rules?grounding_fidelity, instruction_compliance
ProcessDid the agent manage the conversation well?information_gathering, conversation_management
DeliveryWas the response natural and TTS-friendly?response_delivery

Default Metrics (8 scored)

The following metrics are included by default when no custom metric selection is provided. All use the 0-5 scored scale.

1. Tool Routing

  • ID: tool_routing
  • Tier: Execution
  • Default weight: 0.15
  • Measures: Correct tool names and call sequence.
ScoreCriteria
5All expected tools called in correct sequence. No unnecessary calls.
4All critical tools called. Minor sequence deviation or one redundant call.
3One expected tool missed OR one wrong tool called, but core flow mostly intact.
2Multiple tool errors. Flow significantly impacted but partially functional.
1Most tool calls incorrect or missing. Only 1 of N expected tools called.
0No tools called when required, or entirely wrong tool set used.

2. Parameter Extraction

  • ID: parameter_extraction
  • Tier: Execution
  • Default weight: 0.15
  • Measures: Argument values correctly extracted from conversation for tools.
ScoreCriteria
5All parameters correct. Values accurately extracted from user input.
4All critical parameters correct. One minor parameter slightly off.
3One critical parameter wrong or missing, affecting tool outcome.
2Multiple parameter errors. Tool may have returned wrong results or failed.
1Most parameters fabricated or missing. Values not grounded in conversation.
0No parameters extracted from conversation. All values fabricated or empty.

3. Result Interpretation

  • ID: result_interpretation
  • Tier: Execution
  • Default weight: 0.15
  • Measures: Tool output accurately reflected in agent response.
ScoreCriteria
5Tool output accurately and completely reflected. Errors handled gracefully.
4Tool output mostly accurate. Minor omission that doesn't mislead user.
3One meaningful inaccuracy in conveying tool results.
2Significant misrepresentation of tool output.
1Tool output largely ignored or contradicted.
0Tool output completely ignored. No connection to what the tool returned.

4. Grounding Fidelity

  • ID: grounding_fidelity
  • Tier: Knowledge
  • Default weight: 0.125
  • Measures: Every agent claim traceable to context, tools, or business rules.
ScoreCriteria
5Every specific claim grounded. Appropriate hedging for uncertain info.
4All critical claims grounded. One minor unverifiable statement.
3One meaningful ungrounded claim that could mislead user.
2Multiple ungrounded claims. Mix of fabricated facts and invented policies.
1Most claims ungrounded. Agent is largely confabulating.
0Response is entirely fabricated with no connection to provided context.

5. Instruction Compliance

  • ID: instruction_compliance
  • Tier: Knowledge
  • Default weight: 0.125
  • Measures: Agent follows explicit rules from system prompt and business rules.
ScoreCriteria
5All instructions followed precisely. Stayed within role/scope.
4All critical instructions followed. One minor deviation.
3One meaningful instruction violated. Core functionality intact.
2Multiple instructions violated. Partially outside defined boundaries.
1Most instructions ignored. Largely operating outside its defined role.
0Agent completely disregards system prompt and business rules.

6. Information Gathering

  • ID: information_gathering
  • Tier: Process
  • Default weight: 0.10
  • Measures: Required info collected before action; previously stated info reused.
ScoreCriteria
5All required info collected before action. No redundant questions.
4All critical info collected. One redundant question or minor missed detail.
3One required field missing before action, or forgot one previously stated detail.
2Multiple gaps in info collection. Acted on incomplete data.
1Most required info not collected. Acted with largely incomplete data.
0No info gathering attempted.

7. Conversation Management

  • ID: conversation_management
  • Tier: Process
  • Default weight: 0.10
  • Measures: Ambiguity handling, error recovery, and conversation closure.
ScoreCriteria
5Ambiguity clarified. Errors acknowledged and corrected. Proper goodbye.
4Good management overall. One minor missed opportunity.
3One meaningful management failure.
2Multiple management failures. Conversation disjointed.
1Conversation poorly managed throughout.
0No conversation management. Agent froze or produced incoherent sequence.

8. Response Delivery

  • ID: response_delivery
  • Tier: Delivery
  • Default weight: 0.10
  • Measures: Concise, natural, TTS-friendly, non-repetitive responses.
ScoreCriteria
5All responses concise. Natural phrasing. No TTS-hostile formatting.
4Mostly natural and concise. One minor issue.
3One meaningful delivery issue (e.g. 2+ questions in one turn).
2Multiple delivery issues. Robotic or verbose.
1Pervasive delivery problems.
0Responses entirely unsuitable for voice delivery.

Opt-in Metrics

Task Completion (binary)

  • ID: task_completion
  • Tier: Execution
  • Default weight: 0 (must supply explicit weight when selected)
  • Scale: Binary pass/fail
  • Measures: Whether the agent completed the primary task from expected_outcomes.

To include this metric, add it to JudgeConfig.metrics with an explicit weight:

{
  "metrics": [
    { "metric": "tool_routing", "weight": 1.0 },
    { "metric": "task_completion", "weight": 0.5 }
  ]
}

Configuration

JudgeConfig

FieldTypeDefaultDescription
metricslist[MetricSelection] | nullnull (use defaults)Selected metrics with optional weight overrides.
pass_thresholdfloat75.0Minimum overall score (0-100) to pass.
modelstring | nullnullJudge LLM model override.
providerstring | nullnullJudge LLM provider override.

Weight Resolution

  1. If metrics is null or empty, the 8 default scored metrics are used with their default_weight values.
  2. Weights are renormalized to sum to 1.0 after selection.
  3. task_completion requires an explicit weight when selected (its default weight is 0).

Test case-level override

Each test case can carry an evaluation_criteria_override (free text) that is appended to the judge user prompt as a "Test case-specific evaluation emphasis" section. This allows test case authors to add context without changing which metrics are evaluated.

API

GET /config/available-metrics

Returns all registered metrics (including opt-in) for UI discovery.

{
  "data": [
    {
      "name": "tool_routing",
      "display_name": "Tool Routing",
      "description": "Correct tool names and call sequence.",
      "tier": "execution",
      "default_weight": 0.15,
      "score_type": "scored",
      "rubric": "...",
      "include_in_defaults": true
    }
  ],
  "count": 9
}