Code Judges
Code judges are scripts that evaluate agent responses deterministically. Write them in any language — Python, TypeScript, Node, or any executable.
Contract
Section titled “Contract”Code judges communicate via stdin/stdout JSON:
Input (stdin):
{ "question": "What is 15 + 27?", "criteria": "Correctly calculates 15 + 27 = 42", "answer": "The answer is 42.", "reference_answer": "42", "sidecar": {}}Output (stdout):
{ "score": 1.0, "hits": ["Answer contains correct value (42)"], "misses": [], "reasoning": "Passed 1 check(s)"}| Output Field | Type | Description |
|---|---|---|
score | number | 0.0 to 1.0 |
hits | string[] | Criteria that passed |
misses | string[] | Criteria that failed |
reasoning | string | Explanation of the score |
Python Example
Section titled “Python Example”import json, sysdata = json.load(sys.stdin)answer = data.get("answer", "")
hits = []misses = []
if "42" in answer: hits.append("Answer contains correct value (42)")else: misses.append("Answer does not contain expected value (42)")
score = 1.0 if hits else 0.0
print(json.dumps({ "score": score, "hits": hits, "misses": misses, "reasoning": f"Passed {len(hits)} check(s)"}))TypeScript Example
Section titled “TypeScript Example”import { readFileSync } from "fs";
const data = JSON.parse(readFileSync("/dev/stdin", "utf-8"));const answer: string = data.answer ?? "";
const hits: string[] = [];const misses: string[] = [];
if (answer.includes("42")) { hits.push("Answer contains correct value (42)");} else { misses.push("Answer does not contain expected value (42)");}
console.log(JSON.stringify({ score: hits.length > 0 ? 1.0 : 0.0, hits, misses, reasoning: `Passed ${hits.length} check(s)`,}));Referencing in Eval Files
Section titled “Referencing in Eval Files”assert: - name: my_validator type: code_judge script: ./validators/check_answer.py@agentv/eval SDK
Section titled “@agentv/eval SDK”The @agentv/eval package provides a declarative API with automatic stdin/stdout handling. Use defineCodeJudge to skip boilerplate:
#!/usr/bin/env bunimport { defineCodeJudge } from '@agentv/eval';
export default defineCodeJudge(({ answer, criteria }) => { const hits: string[] = []; const misses: string[] = [];
if (answer.includes(criteria)) { hits.push('Answer matches expected outcome'); } else { misses.push('Answer does not match expected outcome'); }
const total = hits.length + misses.length; return { score: total === 0 ? 0 : hits.length / total, hits, misses, reasoning: `Passed ${hits.length}/${total} checks`, };});SDK exports: defineCodeJudge, Message, ToolCall, TraceSummary, CodeJudgeInput, CodeJudgeResult
Target Access
Section titled “Target Access”Code judges can call an LLM through a target proxy for metrics that require multiple LLM calls (contextual precision, semantic similarity, etc.).
Configuration
Section titled “Configuration”Add a target block to the evaluator config:
assert: - name: contextual-precision type: code_judge script: bun scripts/contextual-precision.ts target: max_calls: 10 # Default: 50Use createTargetClient from the SDK:
#!/usr/bin/env bunimport { createTargetClient, defineCodeJudge } from '@agentv/eval';
export default defineCodeJudge(async ({ question, answer }) => { const target = createTargetClient(); if (!target) return { score: 0, misses: ['Target not configured'] };
const response = await target.invoke({ question: `Is this relevant to: ${question}? Response: ${answer}`, systemPrompt: 'Respond with JSON: { "relevant": true/false }' });
const result = JSON.parse(response.rawText ?? '{}'); return { score: result.relevant ? 1.0 : 0.0 };});Use target.invokeBatch(requests) for multiple calls in parallel.
Environment variables (set automatically when target is configured):
| Variable | Description |
|---|---|
AGENTV_TARGET_PROXY_URL | Local proxy URL |
AGENTV_TARGET_PROXY_TOKEN | Bearer token for authentication |
Advanced Input Fields
Section titled “Advanced Input Fields”Beyond the basic question, criteria, answer, and reference_answer fields, code judges receive additional context:
| Field | Type | Description |
|---|---|---|
guideline_files | string[] | Paths to guideline files referenced in the eval |
input_files | string[] | Paths to input files referenced in the eval |
input | Message[] | Full resolved input message array |
expected_output | Message[] | Expected agent behavior including tool calls |
output | Message[] | Actual agent execution trace with tool calls |
trace | TraceSummary | Lightweight execution metrics |
file_changes | string | null | Unified diff of workspace file changes (when workspace_template is configured) |
workspace_path | string | null | Absolute path to the workspace directory (when workspace_template is configured) |
trace structure
Section titled “trace structure”{ "event_count": 5, "tool_names": ["fetch", "search"], "tool_calls_by_name": { "search": 2, "fetch": 1 }, "error_count": 0, "llm_call_count": 2, "token_usage": { "input": 1000, "output": 500 }, "cost_usd": 0.0015, "duration_ms": 3500, "start_time": "2026-02-13T10:00:00.000Z", "end_time": "2026-02-13T10:00:03.500Z"}| Field | Type | Description |
|---|---|---|
event_count | number | Total tool invocations |
tool_names | string[] | Unique tool names used |
tool_calls_by_name | Record<string, number> | Count per tool |
error_count | number | Failed tool calls |
llm_call_count | number | Number of LLM calls (assistant messages) |
token_usage | {input, output} | Token consumption |
cost_usd | number | Estimated cost |
duration_ms | number | Total execution duration |
start_time | string | ISO timestamp of first event |
end_time | string | ISO timestamp of last event |
Use expected_output for retrieval context in RAG evals (tool calls with outputs) and output for the actual agent execution trace from live runs.
Workspace Access
Section titled “Workspace Access”When workspace_template is configured on a target, code judges receive the workspace path in two ways:
- JSON payload:
workspace_pathfield in the stdin input - Environment variable:
AGENTV_WORKSPACE_PATH
This enables functional grading — running commands like npm test, pytest, or cargo test directly in the agent’s workspace.
Example: Deploy-and-Test Pattern
Section titled “Example: Deploy-and-Test Pattern”#!/usr/bin/env bunimport { readFileSync } from "fs";import { execFileSync } from "child_process";
const input = JSON.parse(readFileSync("/dev/stdin", "utf-8"));const cwd = input.workspace_path;
const hits: string[] = [];const misses: string[] = [];
// Stage 1: Install dependenciestry { execFileSync("npm", ["install"], { cwd, stdio: "pipe" }); hits.push("npm install passed");} catch { misses.push("npm install failed"); }
// Stage 2: Typechecktry { execFileSync("npx", ["tsc", "--noEmit"], { cwd, stdio: "pipe" }); hits.push("typecheck passed");} catch { misses.push("typecheck failed"); }
// Stage 3: Run teststry { execFileSync("npm", ["test"], { cwd, stdio: "pipe" }); hits.push("tests passed");} catch { misses.push("tests failed"); }
const total = hits.length + misses.length;console.log(JSON.stringify({ score: total > 0 ? hits.length / total : 0, hits, misses,}));targets: - name: my_agent provider: cli command_template: "my-agent --task {INPUT_FILE} --output {OUTPUT_FILE}" workspace_template: ./workspace-template
# dataset.eval.yamltests: - id: implement-feature criteria: Agent implements the feature correctly input: "Implement the TODO functions in src/index.ts" assert: - name: functional-check type: code_judge script: bun scripts/functional-check.tsSee examples/features/functional-grading/ for a complete working example.
Testing Locally
Section titled “Testing Locally”Test a code judge by piping JSON to stdin:
echo '{"question":"What is 2+2?","criteria":"4","answer":"4","reference_answer":"4","sidecar":{}}' | python validators/check_answer.py