LLM Judges
LLM judges use a language model to evaluate agent responses against custom criteria defined in a prompt file.
Configuration
Section titled “Configuration”Reference an LLM judge in your eval file:
assert: - name: semantic_check type: llm_judge prompt: ./judges/correctness.mdPrompt Files
Section titled “Prompt Files”The prompt file defines evaluation criteria and scoring guidelines. It can be a markdown text template or a TypeScript/JavaScript dynamic template.
Markdown Template
Section titled “Markdown Template”Write evaluation instructions as markdown. Template variables are interpolated:
# Evaluation Criteria
Evaluate the candidate's response to the following question:
**Question:** {{question}}**Criteria:** {{criteria}}**Reference Answer:** {{reference_answer}}**Candidate Answer:** {{answer}}
## Scoring
Score the response from 0.0 to 1.0 based on:1. Correctness — does the answer match the expected outcome?2. Completeness — does it address all parts of the question?3. Clarity — is the response clear and well-structured?Available Template Variables
Section titled “Available Template Variables”| Variable | Source |
|---|---|
question | First user message content |
criteria | Test criteria field |
reference_answer | Last expected message content |
answer | Last candidate response content |
sidecar | Test sidecar metadata |
rubrics | Test rubrics (if defined) |
file_changes | Unified diff of workspace file changes (when workspace_template is configured) |
TypeScript Template
Section titled “TypeScript Template”For dynamic prompt generation, use the definePromptTemplate function from @agentv/eval:
#!/usr/bin/env bunimport { definePromptTemplate } from '@agentv/eval';
export default definePromptTemplate((ctx) => { const rubric = ctx.config?.rubric as string | undefined;
return `You are evaluating an AI assistant's response.
## Question${ctx.question}
## Candidate Answer${ctx.answer}
${ctx.referenceAnswer ? `## Reference Answer\n${ctx.referenceAnswer}` : ''}
${rubric ? `## Evaluation Criteria\n${rubric}` : ''}
Evaluate and provide a score from 0 to 1.`;});How It Works
Section titled “How It Works”- AgentV renders the prompt template with variables from the test
- The rendered prompt is sent to the judge target (configured in targets.yaml)
- The LLM returns a structured evaluation with score, hits, misses, and reasoning
- Results are recorded in the output JSONL
Script Configuration
Section titled “Script Configuration”When using TypeScript templates, configure them in YAML with optional config data passed to the script:
assert: - name: custom-eval type: llm_judge prompt: script: [bun, run, ../prompts/custom-evaluator.ts] config: rubric: "Your rubric here" strictMode: trueThe config object is available as ctx.config inside the template function.
Available Context Fields
Section titled “Available Context Fields”TypeScript templates receive a context object with these fields:
| Field | Type | Description |
|---|---|---|
question | string | First user message content |
answer | string | Last entry in output |
referenceAnswer | string | Last entry in expected_output |
criteria | string | Test criteria field |
expectedOutput | Message[] | Full resolved expected output |
output | Message[] | Full provider output messages |
trace | TraceSummary | Execution metrics summary |
config | object | Custom config from YAML |
Template Variable Derivation
Section titled “Template Variable Derivation”Template variables are derived internally through three layers:
1. Authoring Layer
Section titled “1. Authoring Layer”What users write in YAML or JSONL:
inputorinput— two syntaxes for the same data.input: "What is 2+2?"expands to[{ role: "user", content: "What is 2+2?" }]. If both are present,inputtakes precedence.expected_outputorexpected_output— two syntaxes for the same data.expected_output: "4"expands to[{ role: "assistant", content: "4" }]. If both are present,expected_outputtakes precedence.
2. Resolved Layer
Section titled “2. Resolved Layer”After parsing, canonical message arrays replace the shorthand fields:
input: TestMessage[]— canonical resolved inputexpected_output: TestMessage[]— canonical resolved expected output
At this layer, input and expected_output no longer exist as separate fields.
3. Template Variable Layer
Section titled “3. Template Variable Layer”Derived strings injected into evaluator prompts:
| Variable | Derivation |
|---|---|
question | Content of the first user role entry in input |
criteria | Passed through from the test field |
reference_answer | Content of the last entry in expected_output |
answer | Content of the last entry in output |
input | Full resolved input array, JSON-serialized |
expected_output | Full resolved expected array, JSON-serialized |
output | Full provider output array, JSON-serialized |
file_changes | Unified diff of workspace file changes (when workspace_template is configured) |
Example flow:
# User writes:input: "What is 2+2?"expected_output: "The answer is 4"# Resolved:input: [{ role: "user", content: "What is 2+2?" }]expected_output: [{ role: "assistant", content: "The answer is 4" }]
# Derived template variables:question: "What is 2+2?"reference_answer: "The answer is 4"answer: (extracted from provider output at runtime)