Skip to content

Example Evaluations

This page collects complete eval file examples you can copy and adapt. Each demonstrates a different AgentV pattern.

A minimal eval with a single question and expected answer:

description: Basic arithmetic evaluation
execution:
target: default
tests:
- id: simple-addition
criteria: Correctly calculates 2+2
input: What is 2 + 2?
expected_output: "4"

Use multipart content to attach files alongside text prompts:

description: Code review with guidelines
execution:
target: azure_base
tests:
- id: code-review-basic
criteria: Assistant provides helpful code analysis with security considerations
input:
- role: system
content: You are an expert code reviewer.
- role: user
content:
- type: text
value: |-
Review this function for security issues:
```python
def get_user(user_id):
query = f"SELECT * FROM users WHERE id = {user_id}"
return db.execute(query)
```
- type: file
value: /prompts/security-guidelines.md
expected_output:
- role: assistant
content: |-
This code has a critical SQL injection vulnerability. The user_id is directly
interpolated into the query string without sanitization.
Recommended fix:
```python
def get_user(user_id):
query = "SELECT * FROM users WHERE id = ?"
return db.execute(query, (user_id,))
```

Combine a code judge and an LLM judge on the same test:

description: JSON generation with validation
execution:
target: default
tests:
- id: json-generation-with-validation
criteria: Generates valid JSON with required fields
assert:
- name: json_format_validator
type: code_judge
script: uv run validate_json.py
cwd: ./evaluators
- name: content_evaluator
type: llm_judge
prompt: ./judges/semantic_correctness.md
input: |-
Generate a JSON object for a user with name "Alice",
email "alice@example.com", and role "admin".
expected_output: |-
{
"name": "Alice",
"email": "alice@example.com",
"role": "admin"
}

Validate that an agent uses specific tools during execution:

description: Tool usage validation
execution:
target: mock_agent
tests:
# Validate minimum tool usage (order doesn't matter)
- id: research-depth
criteria: Agent researches thoroughly
input: Research REST vs GraphQL
assert:
- name: research-check
type: tool_trajectory
mode: any_order
minimums:
knowledgeSearch: 2
documentRetrieve: 1
# Validate exact tool sequence
- id: auth-flow
criteria: Agent follows auth sequence
input: Authenticate user
assert:
- name: auth-sequence
type: tool_trajectory
mode: exact
expected:
- tool: checkCredentials
- tool: generateToken

Evaluate pre-existing trace files without running an agent:

description: Static trace evaluation
execution:
target: static_trace
tests:
- id: validate-trace-file
criteria: Trace contains required steps
input: Analyze trace
assert:
- name: trace-check
type: tool_trajectory
mode: in_order
expected:
- tool: webSearch
- tool: readFile

Test multi-turn interactions where intermediate messages set context:

description: Multi-turn debugging session with clarifying questions
execution:
target: default
tests:
- id: debug-with-clarification
criteria: |-
Assistant conducts a multi-turn debugging session, asking clarification
questions when needed, correctly diagnosing the bug, and proposing a clear
fix with rationale.
input:
- role: system
content: You are an expert debugging assistant who reasons step by step, asks clarifying questions, and explains fixes clearly.
- role: user
content: |-
I'm getting an off-by-one error in this function, but I can't see why:
```python
def get_items(items):
result = []
for i in range(len(items) - 1):
result.append(items[i])
return result
```
Sometimes the last element is missing. Can you help debug this?
- role: assistant
content: |-
I can help debug this. Before I propose a fix, could you tell me:
- What output you expect for an example input list
- What output you actually get
- role: user
content: |-
For `[1, 2, 3, 4]` I expect `[1, 2, 3, 4]`, but I get `[1, 2, 3]`.
expected_output:
- role: assistant
content: |-
You have an off-by-one error in your loop bounds.
You're iterating with `range(len(items) - 1)`, which stops before the last index.
To include all items, you can either:
- Use `range(len(items))`, or
- Iterate directly over the list: `for item in items:`
Here's a corrected version:
```python
def get_items(items):
result = []
for item in items:
result.append(item)
return result
```

Evaluate external batch runners that process all tests in one invocation:

description: Batch CLI demo (AML screening)
execution:
target: batch_cli
tests:
- id: aml-001
criteria: |-
Batch runner returns JSON with decision=CLEAR.
expected_output:
- role: assistant
content:
decision: CLEAR
input:
- role: system
content: You are a deterministic AML screening batch checker.
- role: user
content:
request:
type: aml_screening_check
jurisdiction: AU
effective_date: 2025-01-01
row:
id: aml-001
customer_name: Example Customer A
origin_country: NZ
destination_country: AU
transaction_type: INTERNATIONAL_TRANSFER
amount: 5000
currency: USD
assert:
- name: decision-check
type: code_judge
script: bun run ./scripts/check-batch-cli-output.ts
cwd: .
- id: aml-002
criteria: |-
Batch runner returns JSON with decision=REVIEW.
expected_output:
- role: assistant
content:
decision: REVIEW
input:
- role: system
content: You are a deterministic AML screening batch checker.
- role: user
content:
request:
type: aml_screening_check
jurisdiction: AU
effective_date: 2025-01-01
row:
id: aml-002
customer_name: Example Customer B
origin_country: IR
destination_country: AU
transaction_type: INTERNATIONAL_TRANSFER
amount: 2000
currency: USD
assert:
- name: decision-check
type: code_judge
script: bun run ./scripts/check-batch-cli-output.ts
cwd: .
  • execution.target: batch_cli — configure CLI provider with provider_batching: true
  • The batch runner reads the eval YAML via --eval flag and outputs JSONL keyed by id
  • Put structured data in user.content as objects for the runner to extract
  • Use expected_output with object fields for structured expected output
  • Each test has its own evaluator to validate its portion of the output
  • Absolute paths (start with /): resolved from the repository root
    • Example: /prompts/guidelines.md resolves to <repo_root>/prompts/guidelines.md
  • Relative paths (start with ./ or ../): resolved from the eval file directory
    • Example: ../../prompts/file.md goes two directories up, then into prompts/
  • Be specific about what success looks like
  • Mention key elements that must be present
  • For classification tasks, specify the expected category
  • For reasoning tasks, describe the thought process expected
  • Show the pattern, not rigid templates
  • Allow for natural language variation
  • Focus on semantic correctness over exact matching
  • Evaluators handle the actual validation logic