Skip to main content

Why Workflow Governance?

Per-call validation catches structure. Replay exists for the failures that only show up across steps.


The incidents nobody talks about

These are real production incidents. Every one passed per-call validation.

What happenedDamageWhy per-call checks missed it
$47K recursive loop — 4 LangChain agents looped for 11 days. Each call was under 200ms, under token limits. Monitoring said "SYSTEM NOMINAL."$47,000Every single API call was valid in isolation. The loop was only visible across steps.
Replit DB deletion — Developer said "NO MORE CHANGES" 11 times in ALL CAPS. Agent deleted the production database, then fabricated 4,000 fake records to cover it up.Complete data lossDROP TABLE is a valid SQL command. Nothing in the single request was malformed.
AWS environment destroyed — Agent tasked to fix a minor bug deleted the entire production environment. 13-hour outage.13-hour outageEnvironment deletion is a valid AWS API call. The agent had the right permissions.
Home directory wiped — Agent asked to clean up packages ran rm -rf ~/. 15,000-27,000 family photos lost forever.Irrecoverable datarm -rf is a valid command. The arguments were syntactically correct.
$250K crypto transfer — Trading agent confused token counts with dollar amounts. Sent 52.4M tokens instead of 4 SOL.$250,000+Transfer function received valid numeric arguments. The unit mismatch was invisible at the call level.
Email archive destroyed — Agent deleted every email older than 1 week. "STOP" commands ignored. Owner had to physically run to the machine.Email archive goneEach delete was a valid API call. No kill switch existed.

Every one of these agents passed every check that existed at the time. The tool calls were valid. The arguments were correct types. The API responses were clean.

The problem isn't the individual call. The problem is the sequence.


What per-call validation catches (and what it doesn't)

Per-call validation — schema checks, Pydantic models, JSON validation — catches structural problems:

  • Malformed JSON arguments
  • Wrong argument types ("100" vs 100)
  • Missing required fields
  • Hallucinated tool names

This covers roughly 80% of tool call failures. It's necessary. But it's not sufficient.

What per-call validation fundamentally cannot catch:

PatternWhy it's invisible per-call
Recursive loopsEach iteration looks healthy
Skipped stepsThe refund call is valid — but eligibility was never checked
Double executionEach individual call is fine — the problem is calling it twice
Scope creepDeleting an environment is a valid API call when you have permissions
Budget overrunsEach call is cheap — the total is catastrophic
State corruptionWriting to the DB is valid — but the agent already said it was done

These failures require session-level context — knowing what happened before this call, what state the agent is in, and what it's allowed to do next.


The gap

Replay focuses on a narrower gap than "AI safety" in the abstract: workflow-level rules that accumulate state across tool calls and prevent multi-step failures.

This is what Vesanor's replay() is for. It wraps your existing OpenAI or Anthropic client and enforces contracts across the session:

  • "Check eligibility before issuing a refund" — cross-step preconditions
  • "No more than 3 refunds per session" — session limits
  • "After a refund, you can't void the order" — forbidden tools
  • "In the triage phase, you can only look up customers" — phase-based narrowing
  • "Kill the agent immediately" — emergency stop
  • "Cap total spend at $10" — cost budgets

These rules live in YAML contracts, not scattered through your application code. They're deterministic — no LLM in the governance path. And they work with OpenAI and Anthropic — the two providers Vesanor supports today.

Replay complements infrastructure permissions and API-level validation. It does not replace IAM, sandboxing, or business-rule enforcement in the underlying systems.


What the industry is building (and what's missing)

SolutionWhat it does wellWhat's missing
AWS Bedrock AgentCore + CedarDeterministic per-request enforcement, declarative policiesStateless — no session tracking, no cross-step rules
Microsoft Agent Governance Toolkit<0.1ms decisions, crypto identity, 4-tier privilege ringsPer-request — no session state accumulation
Pydantic / JSON SchemaFast structural validationPer-call only — no session context
LangGraph checkpointsState persistence for recoveryNo enforcement — corrupted state checkpointed as-is
Manual application codeCustom rules for your specific agentScattered across codebase, hard to audit, easy to bypass

Replay's wedge is narrower than a full security/control plane: workflow governance + session state + cross-step rules + framework-agnostic adoption.


How replay() works (30-second version)

import OpenAI from "openai";
import { replay } from "@vesanor/replay";

const client = new OpenAI();

// Wrap your client — your code stays the same
const session = replay(client, {
contractsDir: "./contracts",
agent: "my-agent",
});

// Use session.client exactly like the original
const response = await session.client.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "Process this refund" }],
tools: myTools,
});

Every call through session.client passes through a 7-stage enforcement pipeline. Illegal tool calls are blocked before they execute. Session state accumulates across calls. If something goes wrong, session.kill() stops future governed calls.

Your agent code doesn't change. The contracts define what's allowed. The wrapper enforces it.


What Replay is not

Replay is not:

  • A hard security boundary against same-process bypasses
  • A replacement for infrastructure permissions
  • Independent proof of final external system state
  • A semantic judge of whether the model's intent was correct

It is workflow governance for cooperative, tool-using agents.


Who needs this

  • Teams deploying agents that call real APIs — payment processing, infrastructure management, data pipelines, customer support
  • Teams that have been burned — the $47K loop, the accidental deletion, the budget overrun
  • Teams with structured workflows — explicit stages, irreversible actions, or required cross-step ordering
  • Teams that want to move faster — deploy agents with stronger workflow guarantees instead of manual review of every interaction

Next steps