Shadow Mode

Shadow mode runs the full enforcement pipeline without blocking anything. Your agent behaves normally while replay() records what it would have done — which tools would have been removed, which calls would have been blocked, and why.

Why shadow mode

You've written contracts. You think they're right. But enabling enforcement on a live agent is risky — what if a legitimate tool call gets blocked?

Shadow mode answers: "What would enforcement do to my real traffic?" without any risk.

How to use it

const session = replay(client, {
  contractsDir: "./contracts",
  agent: "my-agent",
  mode: "shadow",                              // Compute but don't apply
  apiKey: process.env.VESANOR_API_KEY,        // Send captures to dashboard
});

// Agent runs normally — nothing blocked, nothing modified
const response = await session.client.chat.completions.create({
  model: "gpt-4o-mini",
  messages,
  tools,
});

// Response is returned unmodified — even if contracts would block it

What shadow captures

Every call produces a shadow_delta — a record of what enforcement would have done:

`would_have_narrowed`

Tools that would have been removed before the LLM saw them:

{
  "would_have_narrowed": [
    { "tool": "issue_refund", "reason": "wrong_phase" },
    { "tool": "delete_record", "reason": "forbidden_in_state" },
    { "tool": "admin_reset", "reason": "no_contract" }
  ]
}

`would_have_blocked`

Tool calls that would have been blocked after the LLM responded:

{
  "would_have_blocked": [
    {
      "tool_name": "issue_refund",
      "reason": "precondition_not_met",
      "detail": "Required prior tool: check_eligibility"
    }
  ]
}

Phase context

Where the session is in the phase machine:

{
  "current_phase": "customer_identified",
  "legal_next_phases": ["eligibility_checked", "escalated"]
}

The safe rollout path

Step 1: Shadow mode

Deploy with mode: "shadow". Monitor the dashboard for false positives.

const session = replay(client, {
  mode: "shadow",
  apiKey: process.env.VESANOR_API_KEY,
});

Look for:

Tool calls that shadow says it would block — are they actually bad?
Tools that would be narrowed — should they be available in that phase?
Session limit projections — are limits too tight?

Step 2: Fix contracts

Adjust contracts based on shadow data:

Too many false blocks? Relax preconditions or add phases.
Missing blocks? Add forbids_after or tighten argument invariants.
Phase too restrictive? Allow more tools in that phase.

Step 3: Enable enforcement

Switch to mode: "enforce" when shadow shows zero false positives:

const session = replay(client, {
  mode: "enforce",
  gate: "reject_all",
});

Step 4: Add server backing (optional)

For production with audit needs, add API key and tool wrappers:

const session = replay(client, {
  mode: "enforce",
  apiKey: process.env.VESANOR_API_KEY,
  tools: { issue_refund: myRefundFunction },
});

Counterfactual capture

Shadow mode captures the counterfactual — what was prevented and why. This is valuable for:

Debugging — "Why would this call have been blocked?"
Compliance — GDPR requires "meaningful information about the logic involved" in automated decisions. Counterfactual capture satisfies this directly.
Tuning — Compare shadow results across model versions to see which model triggers more enforcement.

Important limitation

After shadow mode allows a call that enforce mode would block, the model is on a different execution path. It received feedback it wouldn't have received in enforce mode. All subsequent shadow projections are approximations, not exact counterfactuals.

Shadow mode tells you what enforcement would do on your real traffic. It does not guarantee what enforcement will do — because enforcement changes the model's behavior.

You can access the last shadow delta programmatically:

const delta = session.getLastShadowDelta();
if (delta) {
  console.log("Would have blocked:", delta.would_have_blocked.length);
  console.log("Would have narrowed:", delta.would_have_narrowed.length);
  console.log("Current phase:", delta.current_phase);
  console.log("Legal next phases:", delta.legal_next_phases);
}

Shadow coverage tracking

Shadow mode can only validate tool calls the shadow LLM actually makes. If a tool is never attempted, it's invisible to shadow analysis. Shadow coverage tracking measures these blind spots.

What it tracks

After each shadow run, a coverage record is emitted:

Tools available — tools in the request's tool set
Tools observed — tools the shadow LLM actually called
Tool pairs — which tools were called together in the same session

Over time, these accumulate into a coverage ledger per agent and model pair.

Coverage report

Shadow Coverage — agent: trading-agent, models: gpt-4o -> gpt-4o-mini
Period: 2026-03-01 to 2026-03-24 (142 runs)

Tool                     Available  Observed  Coverage
────────────────────────────────────────────────────
get_market_data              142       138      97.2%
approve_risk_check           142        89      62.7%
submit_live_order            142        11       7.7%   <- LOW
cancel_order                 142         0       0.0%   <- ZERO

Classification

Coverage	Classification	Meaning
0%	`zero`	Never observed. Complete blind spot.
1-25%	`low`	Rarely observed. Shadow testing is thin.
26-75%	`partial`	Sometimes observed. May miss edge cases.
76-100%	`good`	Frequently observed. Reasonable confidence.

Important limitations

Coverage measures breadth, not depth — a tool called once with trivial args counts as "observed"
Coverage is per-session, not cross-session — it tracks what shadow runs have tested, not what's possible
100% coverage doesn't mean all argument combinations have been tested

Accessing coverage data

Coverage data is available through:

Dashboard — Shadow page includes a coverage table
CLI — vesanor doctor includes shadow coverage in its health report
API — GET /api/dashboard/shadow/coverage?agent=<name>

Checkpoint behavior in shadow mode

Checkpoints do not trigger in shadow mode. Pausing for human approval defeats the purpose of observational testing. Instead, shadow mode logs a diagnostic: "checkpoint would have triggered".

Shadow vs log-only

Mode	Enforcement computed?	Captures sent?	Blocks calls?
`shadow`	Yes — full pipeline	Yes	No
`log-only`	No — just captures	Yes	No

Use shadow when you want to evaluate contracts. Use log-only when you only want capture/observability with no enforcement computation.

log-only mode

log-only is the lightest mode. No enforcement pipeline runs — no narrowing, no validation, no shadow deltas. Calls pass through directly to the provider and captures are recorded for observability.

const session = replay(client, {
  contractsDir: "./contracts",
  agent: "my-agent",
  mode: "log-only",
  apiKey: process.env.VESANOR_API_KEY,
});

Use for:

Pure observability with zero overhead
Recovery sessions after a kill (capture what the recovery agent does)
Migrating from observe() — log-only is equivalent to observe() with the replay() API

Next steps

Protection Levels — understand all three levels
Govern Mode — enable server-backed enforcement
Phases & Transitions — design the contracts shadow will evaluate

Why shadow mode​

How to use it​

What shadow captures​

would_have_narrowed​

would_have_blocked​

Phase context​

The safe rollout path​

Step 1: Shadow mode​

Step 2: Fix contracts​

Step 3: Enable enforcement​

Step 4: Add server backing (optional)​

Counterfactual capture​

Important limitation​

Shadow coverage tracking​

What it tracks​

Coverage report​

Classification​

Important limitations​

Accessing coverage data​

Checkpoint behavior in shadow mode​

Shadow vs log-only​

log-only mode​

Next steps​