Skip to content

Tests

Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.

tests:
- id: addition
criteria: Correctly calculates 15 + 27 = 42
input: What is 15 + 27?
expected_output: "42"
FieldRequiredDescription
idYesUnique identifier for the test
criteriaYesDescription of what a correct response should contain
inputYesInput sent to the target (string, object, or message array). Alias: input
expected_outputNoExpected response for comparison (string, object, or message array). Alias: expected_output
executionNoPer-case execution overrides (for example target, skip_defaults)
workspaceNoPer-case workspace config (overrides suite-level)
metadataNoArbitrary key-value pairs passed to lifecycle scripts
rubricsNoStructured evaluation criteria
assertNoPer-test evaluators
sidecarNoAdditional metadata passed to evaluators

The simplest form is a string, which expands to a single user message:

input: What is 15 + 27?

For multi-turn or system messages, use a message array:

input:
- role: system
content: You are a helpful math tutor.
- role: user
content: What is 15 + 27?

Optional reference response for comparison by evaluators. A string expands to a single assistant message:

expected_output: "42"

For structured or multi-message expected output, use a message array:

expected_output:
- role: assistant
content: "42"

Override the default target or evaluators for specific tests:

tests:
- id: complex-case
criteria: Provides detailed explanation
input: Explain quicksort algorithm
execution:
target: gpt4_target
assert:
- name: depth_check
type: llm_judge
prompt: ./judges/depth.md

Per-case assert evaluators are merged with root-level assert evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set execution.skip_defaults: true:

assert:
- name: latency_check
type: latency
threshold: 5000
tests:
- id: normal-case
criteria: Returns correct answer
input: What is 2+2?
# Gets latency_check from root-level assert
- id: special-case
criteria: Handles edge case
input: Handle this edge case
execution:
skip_defaults: true
assert:
- name: custom_eval
type: llm_judge
# Does NOT get latency_check

Override the suite-level workspace config for individual tests. Test-level fields replace suite-level fields:

workspace:
before_all:
script: ["bun", "run", "default-setup.ts"]
tests:
- id: case-1
criteria: Should work
input: Do something
workspace:
before_all:
script: ["bun", "run", "custom-setup.ts"]
- id: case-2
criteria: Should also work
input: Do something else
# Inherits suite-level before_all

See Workspace Lifecycle Hooks for the full workspace config reference.

Pass arbitrary key-value pairs to lifecycle scripts via the metadata field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:

tests:
- id: sympy-20590
criteria: Bug should be fixed
input: Fix the diophantine equation bug
metadata:
repo: sympy/sympy
base_commit: "abc123def"
workspace:
before_all:
script: ["python", "checkout_repo.py"]

The metadata field is included in the stdin JSON passed to lifecycle scripts as case_metadata.

The assert field defines evaluators directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.

These evaluators run without an LLM call and produce binary (0 or 1) scores:

TypeFieldDescription
containsvalueChecks if output contains the substring
regexvalueChecks if output matches the regex pattern
is_jsonChecks if output is valid JSON
equalsvalueChecks if output exactly equals the value (trimmed)
tests:
- id: json-api
criteria: Returns valid JSON with status field
input: Return the system status as JSON
assert:
- type: is_json
- type: contains
value: '"status"'

Assertion evaluators auto-generate a name when one is not provided (e.g., contains-DENIED, is_json).

Use type: rubrics with a criteria array to define structured LLM-judged evaluation criteria inline:

tests:
- id: denied-party
criteria: Must identify denied party
input:
- role: user
content: Screen "Acme Corp" against denied parties list
expected_output:
- role: assistant
content: "DENIED"
assert:
- type: contains
value: "DENIED"
required: true
- type: rubrics
criteria:
- id: accuracy
outcome: Correctly identifies the denied party
weight: 5.0
- id: reasoning
outcome: Provides clear reasoning for the decision
weight: 3.0

Any evaluator in assert can be marked as required. When a required evaluator fails, the overall test verdict is fail regardless of the aggregate score.

ValueBehavior
required: trueMust score >= 0.8 (default threshold) to pass
required: 0.6Must score >= 0.6 to pass (custom threshold between 0 and 1)
assert:
- type: contains
value: "DENIED"
required: true # must pass (>= 0.8)
- type: rubrics
required: 0.6 # must score at least 0.6
criteria:
- id: quality
outcome: Response is well-structured
weight: 1.0

Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to fail.

assert can be defined at both suite and test levels:

  • Per-test assert evaluators run first.
  • Suite-level assert evaluators are appended automatically.
  • Set execution.skip_defaults: true on a test to skip suite-level defaults.

Pass additional context to evaluators via the sidecar field:

tests:
- id: code-gen
criteria: Generates valid Python
sidecar:
language: python
difficulty: medium
input: Write a function to sort a list