Skip to main content

How Vesanor Works

This page explains the core concepts behind Vesanor. Read this before diving into the technical docs — it gives you the mental model that makes everything else click.


Contracts

A contract is a set of rules for a tool call. It answers: "When my agent receives this prompt, what tools should it call, with what arguments, in what order?"

Contracts are YAML files:

tool: multi_tool_call

expect_tools:
- check_service_health
- pull_service_logs
- create_incident_ticket

tool_order: strict

expected_tool_calls:
- name: create_incident_ticket
argument_invariants:
- path: $.severity
equals: "P1"

This contract says: the model must call three tools in order, and the ticket must be severity P1. If the model skips a tool, reorders them, or assigns the wrong severity — the contract fails.

Contracts are deterministic and structural. They check tool names, argument values, types, and ordering. No fuzzy matching, no LLM-as-judge. Either it passes or it doesn't.

Two sources of contracts

  • Inferred — auto-generated from observed tool calls. These check structure: fields exist, types are correct, enum values are valid. They represent what the model did.
  • Customer — written or edited by you. These encode what the model should do. Your edits always take precedence over auto-generated assertions.

Packs

A pack is a directory of related contracts with their test fixtures:

packs/my-pack/
pack.yaml # metadata
contracts/
incident_response.yaml # contract rules
golden/
incident_response.success.json # expected request/response
negative/
incident_response.not_invoked.json # failure case
recordings/
incident_response.success.recording.json # captured response
  • Contracts define the rules
  • Golden fixtures are request/response pairs the model is tested against
  • Recordings are captured responses from real API calls, replayed offline in CI

Golden fixtures and recordings

A golden fixture is a JSON file containing a request (messages + tools) and the expected response (tool calls). It's your test case.

A recording is a captured response from a real LLM API call. When you run with --capture-recordings, Vesanor saves the provider's actual response. Later, --provider recorded replays those responses offline — no API calls, no cost, deterministic results.

The relationship:

Golden fixture (your test case)
+ Recording (captured real response)
+ Contract (your rules)
= Deterministic CI test

Fingerprints

Every test result gets a fingerprint — an 8-character hash that uniquely identifies the outcome. Same input + same output = same fingerprint.

Why this matters:

  • Regression detection — if a fingerprint changes, the model's behavior changed
  • Deduplication — identical failures across runs produce the same fingerprint, so you count unique issues instead of total noise
  • Trend tracking — the dashboard shows fingerprint history so you can see when issues appear and disappear

Fingerprints are computed on redacted data (after SecurityGate removes any secrets), so they're safe to display and compare.


The observe-promote-enforce loop

This is the core workflow for building and maintaining contracts:

1. Observe

Capture what your model actually does. Two ways:

  • SDK: Wrap your client with observe() — captures tool calls passively from your running application
  • CLI: Run vesanor observe with observation specs — calls the provider and captures responses

The server auto-generates draft contracts from observed calls. These drafts check structure (fields exist, types are correct) but not specific values.

2. Promote

Review the auto-generated contracts. Then make editorial decisions:

  • Add equals checks for values that should be constant
  • Add ordering constraints if tool sequence matters
  • Add negative test cases for failure scenarios
  • Set pass_threshold if partial success is acceptable

Promotion turns a draft into a truth contract — your enforced quality bar.

3. Enforce

Run truth contracts in CI and at runtime:

  • CI: npx vesanor --provider recorded runs contracts against recordings. Fails the pipeline on violations.
  • Runtime: replay() wraps your client and governs contracts on every live call. Blocks illegal tool calls before they execute.

When your model changes behavior, the loop restarts: observe the new behavior, review, decide whether to update contracts or fix the model.


Two-lane CI

Vesanor separates deterministic safety from live evidence:

Lane A — Hard gate (merge-blocking)

  • Runs against recorded fixtures only
  • Deterministic — same input, same output, every time
  • Offline — no API calls, no API keys, no cost
  • Fast — milliseconds per contract
  • Blocks merges on failure

Lane A answers: "Did anything break in our contract definitions or fixtures?"

Lane B — Evidence lane (advisory)

  • Runs against live LLM providers
  • Non-deterministic — model responses vary
  • Never blocks merges
  • Results feed the dashboard for trending and comparison

Lane B answers: "Does gpt-4o-mini still call our tools correctly? How does Anthropic compare?"

This separation means your merge gate never depends on third-party API availability. Recorded tests are always fast, free, and reliable.


Baselines and drift detection

A baseline (reference) is your known-good standard. It's built automatically from successful runs.

The lifecycle

CANDIDATE → ACTIVE → STALE → RETIRED
  1. CANDIDATE — a new configuration starts collecting runs. After 7 successful runs with the same configuration in a 10-day window, it auto-promotes.
  2. ACTIVE — the trusted reference. Every new run is compared against it. Regressions are flagged and can block merges.
  3. STALE — drift detected (the model's behavior shifted since the baseline was established). Comparisons become advisory.
  4. RETIRED — superseded by a newer baseline.

Baseline key

Each unique combination of six fields produces a distinct baseline:

FieldWhat changes it
Contract hashYou update your contracts
ProviderYou switch from OpenAI to Anthropic
Model IDYou upgrade from gpt-4o-mini to gpt-4o
Runner versionVesanor ships an update
Environment hashYour environment changes
Normalization profileYou change response normalization settings

Change any of these and a new CANDIDATE starts. The old baseline stays in its current state.


Failure classification

When a contract fails, Vesanor classifies why:

ClassificationWhat happened
tool_not_invokedModel returned text instead of calling a tool
wrong_toolModel called a different tool than expected
schema_violationArguments don't match the expected schema
malformed_argumentsTool arguments aren't valid JSON
path_not_foundAn assertion path doesn't exist in the response
unexpected_errorProvider returned an error

Each failure also gets a fingerprint. If the same failure recurs across runs, the fingerprint stays the same — making it easy to track recurring vs. new issues.


Runtime workflow governance with replay()

Contracts in CI catch regressions before they ship. replay() catches violations at runtime before the tool executes.

The enforcement pipeline runs on every LLM call:

  1. Narrow — remove tools the model shouldn't see in the current workflow phase
  2. Pre-check — enforce session limits (cost, call count, loops)
  3. Validate — check invariants, preconditions, forbidden tools, phase transitions
  4. Gate — block or strip illegal tool calls from the response
  5. Finalize — update session state, advance phase, record evidence

Protection levels

LevelWhat it doesWhen to use
MonitorObserves and captures, no blockingInitial rollout, building confidence
ProtectEnforces contracts locally, blocks violationsProduction agents without server dependency
GovernAdds durable server-backed state and stronger evidence on wrapped pathsShared sessions, approvals, stricter operational controls

You choose the level by how you configure replay(). Monitor is mode: "shadow". Protect is mode: "enforce" without an API key. Govern adds apiKey + tools for server-backed coordination and receipts.


Provider abstraction

Write contracts once, run them against any provider. Vesanor normalizes the differences:

  • Tool definitions — OpenAI wraps in { type: "function", function: {...} }, Anthropic uses { input_schema }. The adapter translates.
  • Tool call responses — OpenAI returns function.arguments as a JSON string, Anthropic returns input as an object. Both normalize to { id, name, arguments }.
  • System messages — OpenAI uses role: "system", Anthropic expects a separate system field. The adapter handles it.

Your assertion $.tool_calls[0].name works identically regardless of provider.


The dashboard

The dashboard at app.vesanor.com ties everything together:

PageWhat it shows
TodayHealth snapshot — determinism score, attention items, activity feed
ContractsAuto-generated and customer contracts with coverage analysis and a review workspace
GuardValidation coverage — how many tool calls are being checked and what's failing
ShadowSide-by-side provider comparisons with verdicts (safe to switch? needs review?)
ChangesSemantic diffs — what changed across contracts, runs, and baselines
ReferencesBaseline lifecycle — trust posture, promotion status, drift alerts
RunsFull run history with filtering, search, and drilldown

Next steps

Now that you understand the concepts: