Workflow Governance
When your system uses multiple agents that hand off work to each other, single-session governance isn't enough. Workflow governance coordinates multiple sessions under one durable workflow_id with explicit handoffs, shared resource protection, and cross-session budget limits.
When you need workflow governance
Single-session replay() is sufficient when:
- One agent handles the entire task
- No delegation to other agents
- No shared resources between concurrent processes
Workflow governance is needed when:
- An orchestrator delegates tasks to specialist agents
- Multiple agents might act on the same resource (same order, same deployment)
- You need to kill an entire agent tree at once
- Cross-agent budgets matter (total cost across all agents)
How it works
The model
workflow.yaml ← defines roles, handoffs, limits, shared resources
├── orchestrator/
│ └── session.yaml + contracts/ ← root session (v1-v3 contracts)
├── code-scanner/
│ └── session.yaml + contracts/ ← child session
├── risk-analyst/
│ └── session.yaml + contracts/ ← child session
└── release-manager/
└── session.yaml + contracts/ ← child session
Each role has its own session.yaml and per-tool contracts — standard v1-v3 enforcement. The workflow.yaml adds coordination above sessions.
Key principle: The session remains the unit of authority. Each session is individually correct. The workflow governs coordination between them — it does not merge mutable state across agents.
workflow.yaml
schema_version: "1.0"
workflow: code-review-pipeline
roles:
- name: orchestrator
session_contract: packs/orchestrator/session.yaml
- name: code-scanner
session_contract: packs/code-scanner/session.yaml
- name: risk-analyst
session_contract: packs/risk-analyst/session.yaml
- name: release-manager
session_contract: packs/release-manager/session.yaml
handoffs:
- from: orchestrator
to: code-scanner
- from: orchestrator
to: risk-analyst
- from: code-scanner
to: release-manager
- from: risk-analyst
to: release-manager
workflow_limits:
max_sessions: 8
max_active_sessions: 4
max_total_steps: 100
max_total_cost: 25.00
max_open_handoffs: 6
shared_resources:
- alias: change_request
mode: single_writer
- alias: service_env
mode: exclusive_pending
- alias: tenant_migration
mode: serial_only
cancellation:
subtree_kill: true
workflow_kill: operator_only
Roles and handoffs
Creating a workflow (root session)
The orchestrator creates the workflow by starting a root session:
const session = replay(client, {
contractsDir: "packs/orchestrator/contracts",
agent: "orchestrator",
mode: "enforce",
apiKey: process.env.VESANOR_API_KEY,
workflow: {
type: "root",
role: "orchestrator",
// workflowId auto-generated if omitted
},
});
Offering a handoff
After completing some work, the orchestrator offers a handoff to a child role:
const ticket = await session.handoff({
toRole: "code-scanner",
handoffId: "handoff-pr42-scan",
summary: { task: "Review PR #42", priority: "high" },
});
// ticket.handoffId — use this to attach the child
Child session claims the handoff
A new agent process creates a child session that claims the handoff:
const childSession = replay(childClient, {
contractsDir: "packs/code-scanner/contracts",
agent: "code-scanner",
mode: "enforce",
apiKey: process.env.VESANOR_API_KEY,
workflow: {
type: "child",
workflowId: ticket.workflowId,
role: "code-scanner",
parentSessionId: ticket.parentSessionId,
handoffId: ticket.handoffId,
},
});
Single-claim semantics: Once one child claims a handoff, competing claims fail. No accidental parallel execution.
Handoff lifecycle
offered → claimed → in_progress → completed
| Status | Meaning |
|---|---|
offered | Parent offered the handoff, waiting for a child to claim it |
claimed | A child session attached and took ownership |
in_progress | Child produced its first authoritative committed step |
completed | Child session finished and all conditions met |
Reclaim
If a child claims a handoff but doesn't make progress (idle or stuck), the handoff can be reclaimed:
offered → claimed → [no progress] → offered (generation bumped)
Reclaim fails after progress. Once the child has made authoritative progress (in_progress), the handoff can't be reclaimed — the child owns it.
Shared resources
Shared resources prevent conflicts when multiple sessions act on the same entity (same order, same deployment environment, same database migration).
exclusive_pending
While one session has an unresolved pending step on a resource, no other session can open a conflicting step on the same resource.
shared_resources:
- alias: service_env
mode: exclusive_pending
Example: Release manager starts deploying to staging. While that deployment is pending, no other session can start a deployment to the same staging environment. Once the deployment resolves (succeeds, fails, or is discarded), the lock is released automatically.
single_writer
At most one session can be the mutating owner of a resource value at a time.
shared_resources:
- alias: change_request
mode: single_writer
Example: Two release managers try to stage the same change request. The first proposal succeeds. The second is rejected with WORKFLOW_RESOURCE_CONFLICT — only one session can write to that change request.
serial_only
Multiple sessions may act on a resource, but only one authoritative step can commit at a time. Re-checked at commit time under concurrency control.
shared_resources:
- alias: tenant_migration
mode: serial_only
Example: Two database migrators work on the same migration. Both can plan and prepare. But only one can commit at a time — the second must wait until the first's step is fully resolved.
How resources are matched
Resources are matched by alias + normalized value. The value is extracted from tool call arguments using resource definitions in the session contract:
# In session.yaml for release-manager role
resources:
change_request:
type: change_request
extract_from:
path: "$.change_id"
service_env:
type: environment
extract_from:
path: "$.environment"
When a tool call proposes { change_id: "CHG-123", environment: "staging" }, the pipeline extracts change_request=CHG-123 and service_env=staging, then checks for conflicts with other sessions in the workflow.
Workflow limits
Budgets that apply across all sessions in the workflow:
workflow_limits:
max_sessions: 8 # Total sessions ever created
max_active_sessions: 4 # Concurrent active sessions
max_total_steps: 100 # Steps across all sessions
max_total_cost: 25.00 # Cost across all sessions
max_open_handoffs: 6 # Unresolved handoffs at any time
When a limit is exceeded, preflight is blocked with WORKFLOW_BUDGET_EXCEEDED — before the LLM call.
Kill cascade
Session kill
Standard v1-v3 behavior — kills one session only.
Subtree kill
Kills a session and all its descendants:
orchestrator → code-scanner → release-manager
↑ killed (and all below)
Workflow kill
Kills every active session in the workflow. Rejects all future handoff claims.
Kill is durable. It's a control-plane event, not a best-effort signal. Killed sessions can't advance authoritative state. Future preflight/proposal/receipt requests are rejected. Workflow resource claims in the killed scope are released.
What child sessions inherit (and don't)
A child session starts fresh:
- Empty session state (no steps, no forbidden tools)
- Its own phase machine (from its own
session.yaml) - Its own contracts
A child session does not inherit:
- Parent's phase or step history
- Parent's forbidden tools
- Parent's loop counters or cost budget
- Parent's session limits
Dependencies between parent and child are explicit — through handoff summaries, artifact references, or workflow-scoped resource bindings. Nothing is inherited implicitly.
Stale worker detection
If a workflow branch is reassigned (handoff reclaimed and re-offered), the original worker's session becomes stale. Any attempt by the stale worker to prepare requests, register resource claims, or commit steps fails with a stale-worker error.
This is tracked through generation numbers that increment on reassignment. Stale workers have an older generation than the current workflow state.
Next steps
- Govern Mode — server-backed single-session enforcement
- Kill Switch — emergency stop semantics
- Security & Evidence — what gets captured across workflows