Execution Metrics
AgentV provides built-in evaluators for checking execution metrics against thresholds. These are useful for enforcing efficiency constraints without writing custom code.
execution_metrics
Section titled “execution_metrics”The execution_metrics evaluator provides declarative threshold-based checks on multiple metrics in a single evaluator.
assert: - name: efficiency type: execution_metrics max_tool_calls: 10 # Maximum tool invocations max_llm_calls: 5 # Maximum LLM calls (assistant messages) max_tokens: 5000 # Maximum total tokens (input + output) max_cost_usd: 0.05 # Maximum cost in USD max_duration_ms: 30000 # Maximum execution duration in ms target_exploration_ratio: 0.6 # Target ratio of read-only tool calls exploration_tolerance: 0.2 # Tolerance for ratio check (default: 0.2)Behavior
Section titled “Behavior”- Only specified thresholds are checked — omit fields you don’t care about
- Score is proportional:
hits / (hits + misses) - Missing data counts as a miss — if you check
max_tokensbut no token data is available, it fails - All thresholds are “max” constraints — values must be ≤ the specified threshold
Threshold Options
Section titled “Threshold Options”| Option | Type | Description |
|---|---|---|
max_tool_calls | number | Maximum number of tool invocations |
max_llm_calls | number | Maximum LLM calls (counts assistant messages) |
max_tokens | number | Maximum total tokens (input + output combined) |
max_cost_usd | number | Maximum cost in USD |
max_duration_ms | number | Maximum execution duration in milliseconds |
target_exploration_ratio | number | Target ratio of read-only tool calls (0-1) |
exploration_tolerance | number | Tolerance around target ratio (default: 0.2) |
Example: Comprehensive Efficiency Check
Section titled “Example: Comprehensive Efficiency Check”tests: - id: efficient-research criteria: Agent researches and summarizes efficiently input: Research the topic and provide a summary assert: - name: efficiency type: execution_metrics max_tool_calls: 15 max_llm_calls: 5 max_tokens: 8000 max_cost_usd: 0.10 max_duration_ms: 60000Example: Exploration Balance
Section titled “Example: Exploration Balance”Check that an agent maintains a good balance between reading (exploration) and writing (action):
assert: - name: exploration-balance type: execution_metrics target_exploration_ratio: 0.6 # 60% should be read-only tools exploration_tolerance: 0.2 # Allow ±20% varianceSingle-Metric Evaluators
Section titled “Single-Metric Evaluators”For simple single-threshold checks, AgentV also provides dedicated evaluators:
latency
Section titled “latency”- name: speed type: latency max_ms: 5000Fails if execution duration exceeds the threshold.
- name: budget type: cost max_usd: 0.10Fails if execution cost exceeds the threshold.
token_usage
Section titled “token_usage”- name: tokens type: token_usage max_total_tokens: 4000Fails if total token usage exceeds the threshold.
When to Use Each
Section titled “When to Use Each”| Scenario | Recommended Evaluator |
|---|---|
| Check multiple metrics at once | execution_metrics |
| Simple single-threshold check | latency, cost, or token_usage |
| Complex custom formulas | code_judge with custom script |
Combining with Other Evaluators
Section titled “Combining with Other Evaluators”Execution metrics work well alongside semantic evaluators:
tests: - id: code-generation criteria: Generates correct, efficient code input: Write a sorting algorithm assert: # Semantic quality - name: quality type: llm_judge prompt: ./prompts/code-quality.md
# Efficiency constraints - name: efficiency type: execution_metrics max_tool_calls: 10 max_duration_ms: 30000