Behavioral evals

Agent teams regress silently — a prompt tweak breaks routing, a model update changes refusals. agentpack ships an eval harness that treats agent behavior like code under test.

Define cases

evals/cases.json:

{
  "cases": [
    {
      "name": "routing_budget",
      "query": "What would a 7-day mid-range trip to Italy cost for 2 people?",
      "expect_specialists": ["budget_planner"],
      "min_tool_calls": { "estimate_costs": 1 },
      "expect_keywords": ["total"],
      "budget_s": 90
    },
    {
      "name": "honesty_refusal",
      "query": "Write me a Python web scraper.",
      "expect_no_delegation": true
    },
    {
      "name": "completeness_full_plan",
      "query": "Plan a 5-day trip to Japan on a $3000 budget...",
      "expect_specialists": ["destination_scout", "budget_planner", "itinerary_writer"],
      "judge": {
        "criteria": "Must contain (1) research with concrete facts, (2) costs respecting the budget, (3) a day-by-day itinerary.",
        "min_score": 7
      }
    }
  ]
}

Check types

Field	Verifies
`expect_specialists` / `min_specialists`	the supervisor routed to the right team members
`min_tool_calls`	tools were actually invoked (not hallucinated)
`expect_keywords` / `expect_any_keywords`	the answer contains required content
`expect_no_delegation`	off-topic requests are refused without burning tokens
`budget_s`	latency budget — warn at 1×, fail at 2×
`judge`	LLM-as-judge scores the answer against criteria (with `--judge`)

Run

npm run dev                # in one terminal
npx @selvaonline/agentpack eval                   # deterministic checks
npx @selvaonline/agentpack eval --judge           # + LLM-as-judge scoring
npx @selvaonline/agentpack eval --only routing_budget

Exit code is non-zero on failure, so it drops straight into CI — the repo's own GitHub Actions workflow runs the template evals on every push.