Example Evaluations

This page collects complete eval file examples you can copy and adapt. Each demonstrates a different AgentV pattern.

Basic Q&A

A minimal eval with a single question and expected answer:

description: Basic arithmetic evaluation
execution:
  target: default

tests:
  - id: simple-addition
    criteria: Correctly calculates 2+2

    input: What is 2 + 2?

    expected_output: "4"

Code Review with File References

Use multipart content to attach files alongside text prompts:

description: Code review with guidelines
execution:
  target: azure_base

tests:
  - id: code-review-basic
    criteria: Assistant provides helpful code analysis with security considerations

    input:
      - role: system
        content: You are an expert code reviewer.
      - role: user
        content:
          - type: text
            value: |-
              Review this function for security issues:

              ```python
              def get_user(user_id):
                  query = f"SELECT * FROM users WHERE id = {user_id}"
                  return db.execute(query)
              ```
          - type: file
            value: /prompts/security-guidelines.md

    expected_output:
      - role: assistant
        content: |-
          This code has a critical SQL injection vulnerability. The user_id is directly
          interpolated into the query string without sanitization.

          Recommended fix:
          ```python
          def get_user(user_id):
              query = "SELECT * FROM users WHERE id = ?"
              return db.execute(query, (user_id,))
          ```

Multi-Evaluator

Combine a code judge and an LLM judge on the same test:

description: JSON generation with validation
execution:
  target: default

tests:
  - id: json-generation-with-validation
    criteria: Generates valid JSON with required fields

    assert:
      - name: json_format_validator
        type: code_judge
        script: uv run validate_json.py
        cwd: ./evaluators
      - name: content_evaluator
        type: llm_judge
        prompt: ./judges/semantic_correctness.md

    input: |-
      Generate a JSON object for a user with name "Alice",
      email "alice@example.com", and role "admin".

    expected_output: |-
      {
        "name": "Alice",
        "email": "alice@example.com",
        "role": "admin"
      }

Tool Trajectory

Validate that an agent uses specific tools during execution:

description: Tool usage validation
execution:
  target: mock_agent

tests:
  # Validate minimum tool usage (order doesn't matter)
  - id: research-depth
    criteria: Agent researches thoroughly
    input: Research REST vs GraphQL
    assert:
      - name: research-check
        type: tool_trajectory
        mode: any_order
        minimums:
          knowledgeSearch: 2
          documentRetrieve: 1

  # Validate exact tool sequence
  - id: auth-flow
    criteria: Agent follows auth sequence
    input: Authenticate user
    assert:
      - name: auth-sequence
        type: tool_trajectory
        mode: exact
        expected:
          - tool: checkCredentials
          - tool: generateToken

Static Trace

Evaluate pre-existing trace files without running an agent:

description: Static trace evaluation
execution:
  target: static_trace

tests:
  - id: validate-trace-file
    criteria: Trace contains required steps
    input: Analyze trace
    assert:
      - name: trace-check
        type: tool_trajectory
        mode: in_order
        expected:
          - tool: webSearch
          - tool: readFile

Multi-Turn Conversation

Test multi-turn interactions where intermediate messages set context:

description: Multi-turn debugging session with clarifying questions
execution:
  target: default

tests:
  - id: debug-with-clarification
    criteria: |-
      Assistant conducts a multi-turn debugging session, asking clarification
      questions when needed, correctly diagnosing the bug, and proposing a clear
      fix with rationale.

    input:
      - role: system
        content: You are an expert debugging assistant who reasons step by step, asks clarifying questions, and explains fixes clearly.
      - role: user
        content: |-
          I'm getting an off-by-one error in this function, but I can't see why:

          ```python
          def get_items(items):
              result = []
              for i in range(len(items) - 1):
                  result.append(items[i])
              return result
          ```

          Sometimes the last element is missing. Can you help debug this?
      - role: assistant
        content: |-
          I can help debug this. Before I propose a fix, could you tell me:
          - What output you expect for an example input list
          - What output you actually get
      - role: user
        content: |-
          For `[1, 2, 3, 4]` I expect `[1, 2, 3, 4]`, but I get `[1, 2, 3]`.

    expected_output:
      - role: assistant
        content: |-
          You have an off-by-one error in your loop bounds.
          You're iterating with `range(len(items) - 1)`, which stops before the last index.
          To include all items, you can either:
          - Use `range(len(items))`, or
          - Iterate directly over the list: `for item in items:`

          Here's a corrected version:

          ```python
          def get_items(items):
              result = []
              for item in items:
                  result.append(item)
              return result
          ```

Batch CLI

Evaluate external batch runners that process all tests in one invocation:

description: Batch CLI demo (AML screening)
execution:
  target: batch_cli

tests:
  - id: aml-001
    criteria: |-
      Batch runner returns JSON with decision=CLEAR.

    expected_output:
      - role: assistant
        content:
          decision: CLEAR

    input:
      - role: system
        content: You are a deterministic AML screening batch checker.
      - role: user
        content:
          request:
            type: aml_screening_check
            jurisdiction: AU
            effective_date: 2025-01-01
          row:
            id: aml-001
            customer_name: Example Customer A
            origin_country: NZ
            destination_country: AU
            transaction_type: INTERNATIONAL_TRANSFER
            amount: 5000
            currency: USD

    assert:
      - name: decision-check
        type: code_judge
        script: bun run ./scripts/check-batch-cli-output.ts
        cwd: .

  - id: aml-002
    criteria: |-
      Batch runner returns JSON with decision=REVIEW.

    expected_output:
      - role: assistant
        content:
          decision: REVIEW

    input:
      - role: system
        content: You are a deterministic AML screening batch checker.
      - role: user
        content:
          request:
            type: aml_screening_check
            jurisdiction: AU
            effective_date: 2025-01-01
          row:
            id: aml-002
            customer_name: Example Customer B
            origin_country: IR
            destination_country: AU
            transaction_type: INTERNATIONAL_TRANSFER
            amount: 2000
            currency: USD

    assert:
      - name: decision-check
        type: code_judge
        script: bun run ./scripts/check-batch-cli-output.ts
        cwd: .

Batch CLI Pattern Notes

execution.target: batch_cli — configure CLI provider with provider_batching: true
The batch runner reads the eval YAML via --eval flag and outputs JSONL keyed by id
Put structured data in user.content as objects for the runner to extract
Use expected_output with object fields for structured expected output
Each test has its own evaluator to validate its portion of the output

File Path Conventions

Absolute paths (start with /): resolved from the repository root
- Example: /prompts/guidelines.md resolves to <repo_root>/prompts/guidelines.md
Relative paths (start with ./ or ../): resolved from the eval file directory
- Example: ../../prompts/file.md goes two directories up, then into prompts/

Tips for Writing criteria

Be specific about what success looks like
Mention key elements that must be present
For classification tasks, specify the expected category
For reasoning tasks, describe the thought process expected

Tips for Writing expected_output

Show the pattern, not rigid templates
Allow for natural language variation
Focus on semantic correctness over exact matching
Evaluators handle the actual validation logic