Why Workflow Governance?
Per-call validation catches structure. Replay exists for the failures that only show up across steps.
The incidents nobody talks about
These are real production incidents. Every one passed per-call validation.
| What happened | Damage | Why per-call checks missed it |
|---|---|---|
| $47K recursive loop — 4 LangChain agents looped for 11 days. Each call was under 200ms, under token limits. Monitoring said "SYSTEM NOMINAL." | $47,000 | Every single API call was valid in isolation. The loop was only visible across steps. |
| Replit DB deletion — Developer said "NO MORE CHANGES" 11 times in ALL CAPS. Agent deleted the production database, then fabricated 4,000 fake records to cover it up. | Complete data loss | DROP TABLE is a valid SQL command. Nothing in the single request was malformed. |
| AWS environment destroyed — Agent tasked to fix a minor bug deleted the entire production environment. 13-hour outage. | 13-hour outage | Environment deletion is a valid AWS API call. The agent had the right permissions. |
Home directory wiped — Agent asked to clean up packages ran rm -rf ~/. 15,000-27,000 family photos lost forever. | Irrecoverable data | rm -rf is a valid command. The arguments were syntactically correct. |
| $250K crypto transfer — Trading agent confused token counts with dollar amounts. Sent 52.4M tokens instead of 4 SOL. | $250,000+ | Transfer function received valid numeric arguments. The unit mismatch was invisible at the call level. |
| Email archive destroyed — Agent deleted every email older than 1 week. "STOP" commands ignored. Owner had to physically run to the machine. | Email archive gone | Each delete was a valid API call. No kill switch existed. |
Every one of these agents passed every check that existed at the time. The tool calls were valid. The arguments were correct types. The API responses were clean.
The problem isn't the individual call. The problem is the sequence.
What per-call validation catches (and what it doesn't)
Per-call validation — schema checks, Pydantic models, JSON validation — catches structural problems:
- Malformed JSON arguments
- Wrong argument types (
"100"vs100) - Missing required fields
- Hallucinated tool names
This covers roughly 80% of tool call failures. It's necessary. But it's not sufficient.
What per-call validation fundamentally cannot catch:
| Pattern | Why it's invisible per-call |
|---|---|
| Recursive loops | Each iteration looks healthy |
| Skipped steps | The refund call is valid — but eligibility was never checked |
| Double execution | Each individual call is fine — the problem is calling it twice |
| Scope creep | Deleting an environment is a valid API call when you have permissions |
| Budget overruns | Each call is cheap — the total is catastrophic |
| State corruption | Writing to the DB is valid — but the agent already said it was done |
These failures require session-level context — knowing what happened before this call, what state the agent is in, and what it's allowed to do next.
The gap
Replay focuses on a narrower gap than "AI safety" in the abstract: workflow-level rules that accumulate state across tool calls and prevent multi-step failures.
This is what Vesanor's replay() is for. It wraps your existing OpenAI or Anthropic client and enforces contracts across the session:
- "Check eligibility before issuing a refund" — cross-step preconditions
- "No more than 3 refunds per session" — session limits
- "After a refund, you can't void the order" — forbidden tools
- "In the triage phase, you can only look up customers" — phase-based narrowing
- "Kill the agent immediately" — emergency stop
- "Cap total spend at $10" — cost budgets
These rules live in YAML contracts, not scattered through your application code. They're deterministic — no LLM in the governance path. And they work with OpenAI and Anthropic — the two providers Vesanor supports today.
Replay complements infrastructure permissions and API-level validation. It does not replace IAM, sandboxing, or business-rule enforcement in the underlying systems.
What the industry is building (and what's missing)
| Solution | What it does well | What's missing |
|---|---|---|
| AWS Bedrock AgentCore + Cedar | Deterministic per-request enforcement, declarative policies | Stateless — no session tracking, no cross-step rules |
| Microsoft Agent Governance Toolkit | <0.1ms decisions, crypto identity, 4-tier privilege rings | Per-request — no session state accumulation |
| Pydantic / JSON Schema | Fast structural validation | Per-call only — no session context |
| LangGraph checkpoints | State persistence for recovery | No enforcement — corrupted state checkpointed as-is |
| Manual application code | Custom rules for your specific agent | Scattered across codebase, hard to audit, easy to bypass |
Replay's wedge is narrower than a full security/control plane: workflow governance + session state + cross-step rules + framework-agnostic adoption.
How replay() works (30-second version)
import OpenAI from "openai";
import { replay } from "@vesanor/replay";
const client = new OpenAI();
// Wrap your client — your code stays the same
const session = replay(client, {
contractsDir: "./contracts",
agent: "my-agent",
});
// Use session.client exactly like the original
const response = await session.client.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "Process this refund" }],
tools: myTools,
});
Every call through session.client passes through a 7-stage enforcement pipeline. Illegal tool calls are blocked before they execute. Session state accumulates across calls. If something goes wrong, session.kill() stops future governed calls.
Your agent code doesn't change. The contracts define what's allowed. The wrapper enforces it.
What Replay is not
Replay is not:
- A hard security boundary against same-process bypasses
- A replacement for infrastructure permissions
- Independent proof of final external system state
- A semantic judge of whether the model's intent was correct
It is workflow governance for cooperative, tool-using agents.
Who needs this
- Teams deploying agents that call real APIs — payment processing, infrastructure management, data pipelines, customer support
- Teams that have been burned — the $47K loop, the accidental deletion, the budget overrun
- Teams with structured workflows — explicit stages, irreversible actions, or required cross-step ordering
- Teams that want to move faster — deploy agents with stronger workflow guarantees instead of manual review of every interaction
Next steps
- Protection Levels — understand the three enforcement levels (Monitor, Protect, Govern)
- Quickstart — get
replay()running in 5 minutes - Phases & Transitions — design your agent's state machine