Tests

Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.

Basic Structure

tests:
  - id: addition
    criteria: Correctly calculates 15 + 27 = 42

    input: What is 15 + 27?

    expected_output: "42"

Fields

Field	Required	Description
`id`	Yes	Unique identifier for the test
`criteria`	Yes	Description of what a correct response should contain
`input`	Yes	Input sent to the target (string, object, or message array). Alias: `input`
`expected_output`	No	Expected response for comparison (string, object, or message array). Alias: `expected_output`
`execution`	No	Per-case execution overrides (for example `target`, `skip_defaults`)
`workspace`	No	Per-case workspace config (overrides suite-level)
`metadata`	No	Arbitrary key-value pairs passed to lifecycle scripts
`rubrics`	No	Structured evaluation criteria
`assert`	No	Per-test evaluators
`sidecar`	No	Additional metadata passed to evaluators

Input

The simplest form is a string, which expands to a single user message:

input: What is 15 + 27?

For multi-turn or system messages, use a message array:

input:
  - role: system
    content: You are a helpful math tutor.
  - role: user
    content: What is 15 + 27?

Expected Output

Optional reference response for comparison by evaluators. A string expands to a single assistant message:

expected_output: "42"

For structured or multi-message expected output, use a message array:

expected_output:
  - role: assistant
    content: "42"

Per-Case Execution Overrides

Override the default target or evaluators for specific tests:

tests:
  - id: complex-case
    criteria: Provides detailed explanation
    input: Explain quicksort algorithm

    execution:
      target: gpt4_target
    assert:
      - name: depth_check
        type: llm_judge
        prompt: ./judges/depth.md

Per-case assert evaluators are merged with root-level assert evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set execution.skip_defaults: true:

assert:
  - name: latency_check
    type: latency
    threshold: 5000

tests:
  - id: normal-case
    criteria: Returns correct answer
    input: What is 2+2?
    # Gets latency_check from root-level assert

  - id: special-case
    criteria: Handles edge case
    input: Handle this edge case
    execution:
      skip_defaults: true
    assert:
      - name: custom_eval
        type: llm_judge
    # Does NOT get latency_check

Per-Case Workspace Config

Override the suite-level workspace config for individual tests. Test-level fields replace suite-level fields:

workspace:
  before_all:
    script: ["bun", "run", "default-setup.ts"]

tests:
  - id: case-1
    criteria: Should work
    input: Do something
    workspace:
      before_all:
        script: ["bun", "run", "custom-setup.ts"]

  - id: case-2
    criteria: Should also work
    input: Do something else
    # Inherits suite-level before_all

See Workspace Lifecycle Hooks for the full workspace config reference.

Per-Case Metadata

Pass arbitrary key-value pairs to lifecycle scripts via the metadata field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:

tests:
  - id: sympy-20590
    criteria: Bug should be fixed
    input: Fix the diophantine equation bug
    metadata:
      repo: sympy/sympy
      base_commit: "abc123def"
    workspace:
      before_all:
        script: ["python", "checkout_repo.py"]

The metadata field is included in the stdin JSON passed to lifecycle scripts as case_metadata.

Per-Test Assertions

The assert field defines evaluators directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.

Deterministic Assertions

These evaluators run without an LLM call and produce binary (0 or 1) scores:

Type	Field	Description
`contains`	`value`	Checks if output contains the substring
`regex`	`value`	Checks if output matches the regex pattern
`is_json`	—	Checks if output is valid JSON
`equals`	`value`	Checks if output exactly equals the value (trimmed)

tests:
  - id: json-api
    criteria: Returns valid JSON with status field
    input: Return the system status as JSON
    assert:
      - type: is_json
      - type: contains
        value: '"status"'

Assertion evaluators auto-generate a name when one is not provided (e.g., contains-DENIED, is_json).

Rubric Assertions

Use type: rubrics with a criteria array to define structured LLM-judged evaluation criteria inline:

tests:
  - id: denied-party
    criteria: Must identify denied party
    input:
      - role: user
        content: Screen "Acme Corp" against denied parties list
    expected_output:
      - role: assistant
        content: "DENIED"
    assert:
      - type: contains
        value: "DENIED"
        required: true
      - type: rubrics
        criteria:
          - id: accuracy
            outcome: Correctly identifies the denied party
            weight: 5.0
          - id: reasoning
            outcome: Provides clear reasoning for the decision
            weight: 3.0

Required Gates

Any evaluator in assert can be marked as required. When a required evaluator fails, the overall test verdict is fail regardless of the aggregate score.

Value	Behavior
`required: true`	Must score >= 0.8 (default threshold) to pass
`required: 0.6`	Must score >= 0.6 to pass (custom threshold between 0 and 1)

assert:
  - type: contains
    value: "DENIED"
    required: true          # must pass (>= 0.8)
  - type: rubrics
    required: 0.6           # must score at least 0.6
    criteria:
      - id: quality
        outcome: Response is well-structured
        weight: 1.0

Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to fail.

Assert Merge Behavior

assert can be defined at both suite and test levels:

Per-test assert evaluators run first.
Suite-level assert evaluators are appended automatically.
Set execution.skip_defaults: true on a test to skip suite-level defaults.

Sidecar Metadata

Pass additional context to evaluators via the sidecar field:

tests:
  - id: code-gen
    criteria: Generates valid Python
    sidecar:
      language: python
      difficulty: medium
    input: Write a function to sort a list