Code Judges

Code judges are scripts that evaluate agent responses deterministically. Write them in any language — Python, TypeScript, Node, or any executable.

Contract

Code judges communicate via stdin/stdout JSON:

Input (stdin):

{
  "question": "What is 15 + 27?",
  "criteria": "Correctly calculates 15 + 27 = 42",
  "answer": "The answer is 42.",
  "reference_answer": "42",
  "sidecar": {}
}

Output (stdout):

{
  "score": 1.0,
  "hits": ["Answer contains correct value (42)"],
  "misses": [],
  "reasoning": "Passed 1 check(s)"
}

Output Field	Type	Description
`score`	`number`	0.0 to 1.0
`hits`	`string[]`	Criteria that passed
`misses`	`string[]`	Criteria that failed
`reasoning`	`string`	Explanation of the score

Python Example

import json, sys
data = json.load(sys.stdin)
answer = data.get("answer", "")

hits = []
misses = []

if "42" in answer:
    hits.append("Answer contains correct value (42)")
else:
    misses.append("Answer does not contain expected value (42)")

score = 1.0 if hits else 0.0

print(json.dumps({
    "score": score,
    "hits": hits,
    "misses": misses,
    "reasoning": f"Passed {len(hits)} check(s)"
}))

TypeScript Example

import { readFileSync } from "fs";

const data = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const answer: string = data.answer ?? "";

const hits: string[] = [];
const misses: string[] = [];

if (answer.includes("42")) {
  hits.push("Answer contains correct value (42)");
} else {
  misses.push("Answer does not contain expected value (42)");
}

console.log(JSON.stringify({
  score: hits.length > 0 ? 1.0 : 0.0,
  hits,
  misses,
  reasoning: `Passed ${hits.length} check(s)`,
}));

Referencing in Eval Files

assert:
  - name: my_validator
    type: code_judge
    script: ./validators/check_answer.py

@agentv/eval SDK

The @agentv/eval package provides a declarative API with automatic stdin/stdout handling. Use defineCodeJudge to skip boilerplate:

#!/usr/bin/env bun
import { defineCodeJudge } from '@agentv/eval';

export default defineCodeJudge(({ answer, criteria }) => {
  const hits: string[] = [];
  const misses: string[] = [];

  if (answer.includes(criteria)) {
    hits.push('Answer matches expected outcome');
  } else {
    misses.push('Answer does not match expected outcome');
  }

  const total = hits.length + misses.length;
  return {
    score: total === 0 ? 0 : hits.length / total,
    hits,
    misses,
    reasoning: `Passed ${hits.length}/${total} checks`,
  };
});

SDK exports: defineCodeJudge, Message, ToolCall, TraceSummary, CodeJudgeInput, CodeJudgeResult

Target Access

Code judges can call an LLM through a target proxy for metrics that require multiple LLM calls (contextual precision, semantic similarity, etc.).

Configuration

Add a target block to the evaluator config:

assert:
  - name: contextual-precision
    type: code_judge
    script: bun scripts/contextual-precision.ts
    target:
      max_calls: 10  # Default: 50

Usage

Use createTargetClient from the SDK:

#!/usr/bin/env bun
import { createTargetClient, defineCodeJudge } from '@agentv/eval';

export default defineCodeJudge(async ({ question, answer }) => {
  const target = createTargetClient();
  if (!target) return { score: 0, misses: ['Target not configured'] };

  const response = await target.invoke({
    question: `Is this relevant to: ${question}? Response: ${answer}`,
    systemPrompt: 'Respond with JSON: { "relevant": true/false }'
  });

  const result = JSON.parse(response.rawText ?? '{}');
  return { score: result.relevant ? 1.0 : 0.0 };
});

Use target.invokeBatch(requests) for multiple calls in parallel.

Environment variables (set automatically when target is configured):

Variable	Description
`AGENTV_TARGET_PROXY_URL`	Local proxy URL
`AGENTV_TARGET_PROXY_TOKEN`	Bearer token for authentication

Advanced Input Fields

Beyond the basic question, criteria, answer, and reference_answer fields, code judges receive additional context:

Field	Type	Description
`guideline_files`	`string[]`	Paths to guideline files referenced in the eval
`input_files`	`string[]`	Paths to input files referenced in the eval
`input`	`Message[]`	Full resolved input message array
`expected_output`	`Message[]`	Expected agent behavior including tool calls
`output`	`Message[]`	Actual agent execution trace with tool calls
`trace`	`TraceSummary`	Lightweight execution metrics
`file_changes`	`string \| null`	Unified diff of workspace file changes (when `workspace_template` is configured)
`workspace_path`	`string \| null`	Absolute path to the workspace directory (when `workspace_template` is configured)

trace structure

{
  "event_count": 5,
  "tool_names": ["fetch", "search"],
  "tool_calls_by_name": { "search": 2, "fetch": 1 },
  "error_count": 0,
  "llm_call_count": 2,
  "token_usage": { "input": 1000, "output": 500 },
  "cost_usd": 0.0015,
  "duration_ms": 3500,
  "start_time": "2026-02-13T10:00:00.000Z",
  "end_time": "2026-02-13T10:00:03.500Z"
}

Field	Type	Description
`event_count`	`number`	Total tool invocations
`tool_names`	`string[]`	Unique tool names used
`tool_calls_by_name`	`Record<string, number>`	Count per tool
`error_count`	`number`	Failed tool calls
`llm_call_count`	`number`	Number of LLM calls (assistant messages)
`token_usage`	`{input, output}`	Token consumption
`cost_usd`	`number`	Estimated cost
`duration_ms`	`number`	Total execution duration
`start_time`	`string`	ISO timestamp of first event
`end_time`	`string`	ISO timestamp of last event

Use expected_output for retrieval context in RAG evals (tool calls with outputs) and output for the actual agent execution trace from live runs.

Workspace Access

When workspace_template is configured on a target, code judges receive the workspace path in two ways:

JSON payload: workspace_path field in the stdin input
Environment variable: AGENTV_WORKSPACE_PATH

This enables functional grading — running commands like npm test, pytest, or cargo test directly in the agent’s workspace.

Example: Deploy-and-Test Pattern

#!/usr/bin/env bun
import { readFileSync } from "fs";
import { execFileSync } from "child_process";

const input = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const cwd = input.workspace_path;

const hits: string[] = [];
const misses: string[] = [];

// Stage 1: Install dependencies
try {
  execFileSync("npm", ["install"], { cwd, stdio: "pipe" });
  hits.push("npm install passed");
} catch { misses.push("npm install failed"); }

// Stage 2: Typecheck
try {
  execFileSync("npx", ["tsc", "--noEmit"], { cwd, stdio: "pipe" });
  hits.push("typecheck passed");
} catch { misses.push("typecheck failed"); }

// Stage 3: Run tests
try {
  execFileSync("npm", ["test"], { cwd, stdio: "pipe" });
  hits.push("tests passed");
} catch { misses.push("tests failed"); }

const total = hits.length + misses.length;
console.log(JSON.stringify({
  score: total > 0 ? hits.length / total : 0,
  hits,
  misses,
}));

targets:
  - name: my_agent
    provider: cli
    command_template: "my-agent --task {INPUT_FILE} --output {OUTPUT_FILE}"
    workspace_template: ./workspace-template

# dataset.eval.yaml
tests:
  - id: implement-feature
    criteria: Agent implements the feature correctly
    input: "Implement the TODO functions in src/index.ts"
    assert:
      - name: functional-check
        type: code_judge
        script: bun scripts/functional-check.ts

See examples/features/functional-grading/ for a complete working example.

Testing Locally

Test a code judge by piping JSON to stdin:

echo '{"question":"What is 2+2?","criteria":"4","answer":"4","reference_answer":"4","sidecar":{}}' | python validators/check_answer.py