How Vesanor Works
This page explains the core concepts behind Vesanor. Read this before diving into the technical docs — it gives you the mental model that makes everything else click.
Contracts
A contract is a set of rules for a tool call. It answers: "When my agent receives this prompt, what tools should it call, with what arguments, in what order?"
Contracts are YAML files:
tool: multi_tool_call
expect_tools:
- check_service_health
- pull_service_logs
- create_incident_ticket
tool_order: strict
expected_tool_calls:
- name: create_incident_ticket
argument_invariants:
- path: $.severity
equals: "P1"
This contract says: the model must call three tools in order, and the ticket must be severity P1. If the model skips a tool, reorders them, or assigns the wrong severity — the contract fails.
Contracts are deterministic and structural. They check tool names, argument values, types, and ordering. No fuzzy matching, no LLM-as-judge. Either it passes or it doesn't.
Two sources of contracts
- Inferred — auto-generated from observed tool calls. These check structure: fields exist, types are correct, enum values are valid. They represent what the model did.
- Customer — written or edited by you. These encode what the model should do. Your edits always take precedence over auto-generated assertions.
Packs
A pack is a directory of related contracts with their test fixtures:
packs/my-pack/
pack.yaml # metadata
contracts/
incident_response.yaml # contract rules
golden/
incident_response.success.json # expected request/response
negative/
incident_response.not_invoked.json # failure case
recordings/
incident_response.success.recording.json # captured response
- Contracts define the rules
- Golden fixtures are request/response pairs the model is tested against
- Recordings are captured responses from real API calls, replayed offline in CI
Golden fixtures and recordings
A golden fixture is a JSON file containing a request (messages + tools) and the expected response (tool calls). It's your test case.
A recording is a captured response from a real LLM API call. When you run with --capture-recordings, Vesanor saves the provider's actual response. Later, --provider recorded replays those responses offline — no API calls, no cost, deterministic results.
The relationship:
Golden fixture (your test case)
+ Recording (captured real response)
+ Contract (your rules)
= Deterministic CI test
Fingerprints
Every test result gets a fingerprint — an 8-character hash that uniquely identifies the outcome. Same input + same output = same fingerprint.
Why this matters:
- Regression detection — if a fingerprint changes, the model's behavior changed
- Deduplication — identical failures across runs produce the same fingerprint, so you count unique issues instead of total noise
- Trend tracking — the dashboard shows fingerprint history so you can see when issues appear and disappear
Fingerprints are computed on redacted data (after SecurityGate removes any secrets), so they're safe to display and compare.
The observe-promote-enforce loop
This is the core workflow for building and maintaining contracts:
1. Observe
Capture what your model actually does. Two ways:
- SDK: Wrap your client with
observe()— captures tool calls passively from your running application - CLI: Run
vesanor observewith observation specs — calls the provider and captures responses
The server auto-generates draft contracts from observed calls. These drafts check structure (fields exist, types are correct) but not specific values.
2. Promote
Review the auto-generated contracts. Then make editorial decisions:
- Add
equalschecks for values that should be constant - Add ordering constraints if tool sequence matters
- Add negative test cases for failure scenarios
- Set
pass_thresholdif partial success is acceptable
Promotion turns a draft into a truth contract — your enforced quality bar.
3. Enforce
Run truth contracts in CI and at runtime:
- CI:
npx vesanor --provider recordedruns contracts against recordings. Fails the pipeline on violations. - Runtime:
replay()wraps your client and governs contracts on every live call. Blocks illegal tool calls before they execute.
When your model changes behavior, the loop restarts: observe the new behavior, review, decide whether to update contracts or fix the model.
Two-lane CI
Vesanor separates deterministic safety from live evidence:
Lane A — Hard gate (merge-blocking)
- Runs against recorded fixtures only
- Deterministic — same input, same output, every time
- Offline — no API calls, no API keys, no cost
- Fast — milliseconds per contract
- Blocks merges on failure
Lane A answers: "Did anything break in our contract definitions or fixtures?"
Lane B — Evidence lane (advisory)
- Runs against live LLM providers
- Non-deterministic — model responses vary
- Never blocks merges
- Results feed the dashboard for trending and comparison
Lane B answers: "Does gpt-4o-mini still call our tools correctly? How does Anthropic compare?"
This separation means your merge gate never depends on third-party API availability. Recorded tests are always fast, free, and reliable.
Baselines and drift detection
A baseline (reference) is your known-good standard. It's built automatically from successful runs.
The lifecycle
CANDIDATE → ACTIVE → STALE → RETIRED
- CANDIDATE — a new configuration starts collecting runs. After 7 successful runs with the same configuration in a 10-day window, it auto-promotes.
- ACTIVE — the trusted reference. Every new run is compared against it. Regressions are flagged and can block merges.
- STALE — drift detected (the model's behavior shifted since the baseline was established). Comparisons become advisory.
- RETIRED — superseded by a newer baseline.
Baseline key
Each unique combination of six fields produces a distinct baseline:
| Field | What changes it |
|---|---|
| Contract hash | You update your contracts |
| Provider | You switch from OpenAI to Anthropic |
| Model ID | You upgrade from gpt-4o-mini to gpt-4o |
| Runner version | Vesanor ships an update |
| Environment hash | Your environment changes |
| Normalization profile | You change response normalization settings |
Change any of these and a new CANDIDATE starts. The old baseline stays in its current state.
Failure classification
When a contract fails, Vesanor classifies why:
| Classification | What happened |
|---|---|
tool_not_invoked | Model returned text instead of calling a tool |
wrong_tool | Model called a different tool than expected |
schema_violation | Arguments don't match the expected schema |
malformed_arguments | Tool arguments aren't valid JSON |
path_not_found | An assertion path doesn't exist in the response |
unexpected_error | Provider returned an error |
Each failure also gets a fingerprint. If the same failure recurs across runs, the fingerprint stays the same — making it easy to track recurring vs. new issues.
Runtime workflow governance with replay()
Contracts in CI catch regressions before they ship. replay() catches violations at runtime before the tool executes.
The enforcement pipeline runs on every LLM call:
- Narrow — remove tools the model shouldn't see in the current workflow phase
- Pre-check — enforce session limits (cost, call count, loops)
- Validate — check invariants, preconditions, forbidden tools, phase transitions
- Gate — block or strip illegal tool calls from the response
- Finalize — update session state, advance phase, record evidence
Protection levels
| Level | What it does | When to use |
|---|---|---|
| Monitor | Observes and captures, no blocking | Initial rollout, building confidence |
| Protect | Enforces contracts locally, blocks violations | Production agents without server dependency |
| Govern | Adds durable server-backed state and stronger evidence on wrapped paths | Shared sessions, approvals, stricter operational controls |
You choose the level by how you configure replay(). Monitor is mode: "shadow". Protect is mode: "enforce" without an API key. Govern adds apiKey + tools for server-backed coordination and receipts.
Provider abstraction
Write contracts once, run them against any provider. Vesanor normalizes the differences:
- Tool definitions — OpenAI wraps in
{ type: "function", function: {...} }, Anthropic uses{ input_schema }. The adapter translates. - Tool call responses — OpenAI returns
function.argumentsas a JSON string, Anthropic returnsinputas an object. Both normalize to{ id, name, arguments }. - System messages — OpenAI uses
role: "system", Anthropic expects a separatesystemfield. The adapter handles it.
Your assertion $.tool_calls[0].name works identically regardless of provider.
The dashboard
The dashboard at app.vesanor.com ties everything together:
| Page | What it shows |
|---|---|
| Today | Health snapshot — determinism score, attention items, activity feed |
| Contracts | Auto-generated and customer contracts with coverage analysis and a review workspace |
| Guard | Validation coverage — how many tool calls are being checked and what's failing |
| Shadow | Side-by-side provider comparisons with verdicts (safe to switch? needs review?) |
| Changes | Semantic diffs — what changed across contracts, runs, and baselines |
| References | Baseline lifecycle — trust posture, promotion status, drift alerts |
| Runs | Full run history with filtering, search, and drilldown |
Next steps
Now that you understand the concepts:
- Quickstart — get your first test running
- SDK Integration — wrap your client with
observe()andvalidate() - Writing Tests — full contract YAML reference
- Replay Quickstart — workflow governance in 5 minutes