Tests
Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.
Basic Structure
Section titled “Basic Structure”tests: - id: addition criteria: Correctly calculates 15 + 27 = 42
input: What is 15 + 27?
expected_output: "42"Fields
Section titled “Fields”| Field | Required | Description |
|---|---|---|
id | Yes | Unique identifier for the test |
criteria | Yes | Description of what a correct response should contain |
input | Yes | Input sent to the target (string, object, or message array). Alias: input |
expected_output | No | Expected response for comparison (string, object, or message array). Alias: expected_output |
execution | No | Per-case execution overrides (for example target, skip_defaults) |
workspace | No | Per-case workspace config (overrides suite-level) |
metadata | No | Arbitrary key-value pairs passed to lifecycle scripts |
rubrics | No | Structured evaluation criteria |
assert | No | Per-test evaluators |
sidecar | No | Additional metadata passed to evaluators |
The simplest form is a string, which expands to a single user message:
input: What is 15 + 27?For multi-turn or system messages, use a message array:
input: - role: system content: You are a helpful math tutor. - role: user content: What is 15 + 27?Expected Output
Section titled “Expected Output”Optional reference response for comparison by evaluators. A string expands to a single assistant message:
expected_output: "42"For structured or multi-message expected output, use a message array:
expected_output: - role: assistant content: "42"Per-Case Execution Overrides
Section titled “Per-Case Execution Overrides”Override the default target or evaluators for specific tests:
tests: - id: complex-case criteria: Provides detailed explanation input: Explain quicksort algorithm
execution: target: gpt4_target assert: - name: depth_check type: llm_judge prompt: ./judges/depth.mdPer-case assert evaluators are merged with root-level assert evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set execution.skip_defaults: true:
assert: - name: latency_check type: latency threshold: 5000
tests: - id: normal-case criteria: Returns correct answer input: What is 2+2? # Gets latency_check from root-level assert
- id: special-case criteria: Handles edge case input: Handle this edge case execution: skip_defaults: true assert: - name: custom_eval type: llm_judge # Does NOT get latency_checkPer-Case Workspace Config
Section titled “Per-Case Workspace Config”Override the suite-level workspace config for individual tests. Test-level fields replace suite-level fields:
workspace: before_all: script: ["bun", "run", "default-setup.ts"]
tests: - id: case-1 criteria: Should work input: Do something workspace: before_all: script: ["bun", "run", "custom-setup.ts"]
- id: case-2 criteria: Should also work input: Do something else # Inherits suite-level before_allSee Workspace Lifecycle Hooks for the full workspace config reference.
Per-Case Metadata
Section titled “Per-Case Metadata”Pass arbitrary key-value pairs to lifecycle scripts via the metadata field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:
tests: - id: sympy-20590 criteria: Bug should be fixed input: Fix the diophantine equation bug metadata: repo: sympy/sympy base_commit: "abc123def" workspace: before_all: script: ["python", "checkout_repo.py"]The metadata field is included in the stdin JSON passed to lifecycle scripts as case_metadata.
Per-Test Assertions
Section titled “Per-Test Assertions”The assert field defines evaluators directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.
Deterministic Assertions
Section titled “Deterministic Assertions”These evaluators run without an LLM call and produce binary (0 or 1) scores:
| Type | Field | Description |
|---|---|---|
contains | value | Checks if output contains the substring |
regex | value | Checks if output matches the regex pattern |
is_json | — | Checks if output is valid JSON |
equals | value | Checks if output exactly equals the value (trimmed) |
tests: - id: json-api criteria: Returns valid JSON with status field input: Return the system status as JSON assert: - type: is_json - type: contains value: '"status"'Assertion evaluators auto-generate a name when one is not provided (e.g., contains-DENIED, is_json).
Rubric Assertions
Section titled “Rubric Assertions”Use type: rubrics with a criteria array to define structured LLM-judged evaluation criteria inline:
tests: - id: denied-party criteria: Must identify denied party input: - role: user content: Screen "Acme Corp" against denied parties list expected_output: - role: assistant content: "DENIED" assert: - type: contains value: "DENIED" required: true - type: rubrics criteria: - id: accuracy outcome: Correctly identifies the denied party weight: 5.0 - id: reasoning outcome: Provides clear reasoning for the decision weight: 3.0Required Gates
Section titled “Required Gates”Any evaluator in assert can be marked as required. When a required evaluator fails, the overall test verdict is fail regardless of the aggregate score.
| Value | Behavior |
|---|---|
required: true | Must score >= 0.8 (default threshold) to pass |
required: 0.6 | Must score >= 0.6 to pass (custom threshold between 0 and 1) |
assert: - type: contains value: "DENIED" required: true # must pass (>= 0.8) - type: rubrics required: 0.6 # must score at least 0.6 criteria: - id: quality outcome: Response is well-structured weight: 1.0Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to fail.
Assert Merge Behavior
Section titled “Assert Merge Behavior”assert can be defined at both suite and test levels:
- Per-test
assertevaluators run first. - Suite-level
assertevaluators are appended automatically. - Set
execution.skip_defaults: trueon a test to skip suite-level defaults.
Sidecar Metadata
Section titled “Sidecar Metadata”Pass additional context to evaluators via the sidecar field:
tests: - id: code-gen criteria: Generates valid Python sidecar: language: python difficulty: medium input: Write a function to sort a list