Production guide

Operational recipes for shipping Shroud — retry policy, idempotency, rate-limit handling, and a code-by-code error map. Use this page alongside the Threat model — Production checklist, which covers the confidentiality posture you also need before declaring a deployment production-ready.

Retry & backoff

Shroud distinguishes transient failures (worth retrying with backoff) from permanent failures (signalling a bug in the caller or a state issue). Retry rules:

Status	When	Retry?
`200`, `202`	Success / accepted-with-required-followup	n/a
`4xx` (other than `429`)	Caller bug — request is wrong	No. Fix the caller, don't retry.
`429` (Rate limit)	RPS budget tripped	Yes, after `Retry-After` seconds.
`429` (CU limit)	Plan/key CU budget exhausted	Yes, after `Retry-After` seconds — but consider upgrading the plan first.
`500 SHROUD_INTERNAL_ERROR`	Server-side bug	Once with backoff. If persistent, file a bug.
`500 SHROUD_TOOL_EXECUTION_ERROR`	Upstream tool/model failure	Yes, with backoff.
`503 SHROUD_SERVICE_UNAVAILABLE`	Upstream briefly unavailable	Yes, after `Retry-After` (typically `5`) seconds.

Recommended backoff schedule for retryable failures: exponential with jitter, capped. A reasonable default is

attempt_delay = min(60, base * 2^attempt) + random(0, jitter)
base    = 0.5 s
jitter  = 0.5 s
max     = 60 s
attempts = 5

When the response carries Retry-After, always honour it as the floor — your computed delay should be max(retry_after, computed).

Don't retry these

400 SHROUD_INVALID_PARAMS/SHROUD_SCHEMA_VIOLATION — caller request shape is wrong.
401 SHROUD_UNAUTHORIZED — key invalid, expired, or wrong environment. Surface to the user; don't loop.
403 SHROUD_PERMISSION_DENIED — the key lacks the capability for this tool. Provision a different key or change the request.
404 SHROUD_TOOL_NOT_FOUND/SHROUD_NOT_FOUND — the resource doesn't exist; retrying will not summon it.
409 SHROUD_CONFLICT — request conflicts with current state. Read state, then re-attempt with corrected request.
410 SHROUD_CONFIRMATION_EXPIRED — confirmation token expired. Get a fresh one (re-issue the original request without the expired token) instead of replaying.
422 SHROUD_IDEMPOTENCY_KEY_MISMATCH — same key reused with a different body. Generate a new idempotency key.

Idempotency

Side-effecting POST/PUT routes accept a Stripe-style Idempotency-Key header. The middleware caches the (api_key_id, key, body-hash) tuple for 24 hours and replays the exact (status, body, content-type) for any subsequent request that matches.

POST /v1/some-side-effect
Authorization: Bearer shroud_prod_...
Idempotency-Key: 8f5b8c2d-2c5a-4f1f-9c11-e6f0a82c2c64
Content-Type: application/json

{ ... }

Replays carry a Idempotent-Replayed: true response header so the client knows the call wasn't re-executed.

Where idempotency applies

The header is honoured on a fixed allowlist of small-body side-effecting routes (sourced from internal/management/testdata/idempotent-routes.json):

Method	Route
POST	`/tma/workspaces/{id}/cancel-downgrade`
POST	`/tma/workspaces/{id}/cu/purchase`
POST	`/tma/workspaces/{id}/downgrade`
POST	`/tma/workspaces/{id}/upgrade`
PUT	`/v1/admin/cocoon/{network}/models`

Adding a route to this list is gated by a CI test that compares the runtime registry against the JSON allowlist, so changes here are deliberate and reviewable. The header is not honoured on:

Chat completions (/v1/chat/completions) — streaming response, buffer would block flush.
MCP/JSON-RPC tool calls (/mcp, /rpc) — same SSE-flush constraint.
Any inference path through the Cocoon SDK or its HTTP shim.

Production teams that need at-most-once semantics on inference typically wrap the call themselves at the application layer, keyed on a request hash plus their own dedupe table.

Caps and lifecycle

Limit	Value	Notes
Replay TTL	24 hours	After 24h, the (api_key_id, key) slot is reclaimed and a fresh request can run.
Maximum key length	255 characters	Keys with whitespace or non-printable characters are rejected.
Maximum request body	1 MiB	Anything over this is rejected with `SHROUD_INVALID_PARAMS`.
Maximum response capture	1 MiB	If the handler emits more, the slot is released rather than persisted; the bytes still reach the original caller, but a retry will re-execute.

Idempotency error codes

Code	HTTP	Meaning
`SHROUD_INVALID_IDEMPOTENCY_KEY`	400	Header missing required prefix, exceeds 255 chars, contains whitespace or non-printable chars, or sent without an authenticated API key.
`SHROUD_IDEMPOTENCY_KEY_MISMATCH`	422	Key replayed with a body hash that differs from the cached one. Reuse means exact re-replay; mutate the request and you need a new key.
`SHROUD_CONFLICT`	409	Same `(api_key_id, key)` slot is in flight on a concurrent request. Retry once the original completes.
`SHROUD_INVALID_PARAMS`	400	Request body exceeds the 1 MiB idempotency-cache cap.

Rate limits and CU budgets

The gateway enforces two independent layers of throttling:

Request rate — a token-bucket limiter per API key (or per source IP for anonymous calls). Trips on bursts.
Credit Unit budget — a workspace-wide plan cap plus optional per-key rolling 24h/30d ceilings. Trips on cumulative usage.

Both surfaces emit the canonical Shroud envelope (see Error reference) with HTTP 429 Too Many Requests.

Request rate (token bucket)

The limiter is a per-key token bucket with rate = plan_rps and burst = 2 × plan_rps. Tokens regenerate continuously at the rate limit, so a steady stream at the rate is fine; short bursts up to twice the rate also clear without an error.

Bucket	Rate	Burst	Retry-After
Free key	2 RPS	4	1
Developer key	10 RPS	20	1
Startup key	50 RPS	100	1
Enterprise key	200 RPS	400	1
Per-IP, no API key	5 RPS	5	1

Anonymous traffic is bucketed per source IP (/32 for IPv4, /64 by default for IPv6) so a missing or malformed Authorization header caps a single client at 5 RPS regardless of plan. The cap is enforced even on routes that do not strictly require an API key — it sits ahead of expensive auth and billing work to keep public surfaces from being abused. See Authentication for the key-format and Bearer-prefix rules that govern when the per-key bucket applies.

A per-key override (api_keys.rps_limit) can lower the effective rate further: the limiter uses min(key_rps, plan_rps) so a Startup plan key with rps_limit = 5 is capped at 5 RPS even though the plan allows 50.

Response headers

On rejection the gateway sets:

Header	Value	Meaning
`Content-Type`	`application/json`	Both 429 paths return JSON envelopes; nothing falls back to plain text.
`Retry-After`	`1` (RPS) / `60` (CU)	Seconds. Treat as a floor and add jittered backoff on top.
`X-Request-Id`	echoed	Pass `X-Request-Id` on the request and the gateway echoes it on the response. Logging this id alongside any 429 makes server-side correlation a one-query lookup.

The OpenAI-compatible chat-completions path additionally returns Retry-After: 5 on 503 Service Unavailable when the upstream Cocoon worker is briefly unreachable; that envelope follows OpenAI shape rather than the SHROUD envelope. See OpenAI-compatible API for the special case.

CU budget rejection

Limit	`error_code`	`details.window`
Per-key 24h	`SHROUD_CU_LIMIT_EXCEEDED`	`"24h"`
Per-key 30d	`SHROUD_CU_LIMIT_EXCEEDED`	`"30d"`
Workspace plan + purchased balance	`SHROUD_CU_LIMIT_EXCEEDED`	absent

{
  "error": "CU limit exceeded",
  "error_code": "SHROUD_CU_LIMIT_EXCEEDED",
  "details": {
    "window": "24h",
    "used_cu_milli": 100000,
    "limit_cu_milli": 100000
  }
}

The same code covers both per-key and workspace scopes. Distinguish them by the presence of details.window — there is no separate plan_cu_limit_exceeded code.

Distinguishing rate-limit and CU rejections in client code:

# Pseudocode — adapt to your HTTP client.
if resp.status_code == 429:
    retry_after = int(resp.headers.get("Retry-After", "1"))
    body = resp.json() if resp.content else {}
    if body.get("error_code") == "SHROUD_CU_LIMIT_EXCEEDED":
        # CU budget — slow signal. Pace, alert, or upgrade.
        scope = "per-key" if "window" in body.get("details", {}) else "workspace"
        wait_or_upgrade(retry_after, scope)
    else:
        # SHROUD_RATE_LIMITED — back off briefly.
        sleep(retry_after)

Worked example: CU consumption against a per-tier limit

A team on the Developer plan (29,000,000 milli-CU/month included, 10 RPS) calls the Cocoon Qwen/Qwen3-32B model from a long-running batch job. Each call uses 50,000 prompt + 15,000 completion tokens (see Billing — Worked example). At a price_per_token_nano of 80 and a TON/USD rate of $5.50, each call consumes about 28,600 milli-CU.

How many such calls fit in the included monthly allowance:

calls_per_month = 29_000_000_000 / 28_600_000 ≈ 1,014

The 10 RPS bucket lets the job sustain 10 calls per second, but every call charges 28,600 milli-CU. At a sustained 10 RPS the workspace exhausts its monthly included budget in roughly 100 seconds:

seconds_to_drain = 29_000_000_000 / (10 × 28_600_000) ≈ 101 s

After that the workspace either begins consuming purchased CU balance or — if there is none — every subsequent call is rejected with SHROUD_CU_LIMIT_EXCEEDED and Retry-After: 60. The retry loop on a workspace-level cap is not effective: the cap will not clear until the next billing period. Provision more purchased CU, upgrade the plan, or pace the job.

CU budget hygiene

Read _meta.usedCUMilli (MCP) or result.usedCUMilli (JSON-RPC) or usage.usedCUMilli (Cocoon SDK) on every successful tool call to track real cost.
For inference, watch the usage.totalTokens in the streaming done event — billing converts via the per-network pricing config.
Set a workspace-level alert at, e.g., 70 % of monthly CU before the overage cliff.
On a sustained streak of usedCUMilli == 0 for successful calls, treat it as a billing-feed incident. The gateway falls back to zero on missing pricing config or missing TON/USD rate; the caller sees free service but the workspace is silently uncharged.

Error handling map

The full structured-code list is auto-generated and lives at Error codes. Use the table below as the caller-side action map for each code:

Code	HTTP	Action
`SHROUD_UNAUTHORIZED`	401	Prompt the user to re-authenticate. Don't retry.
`SHROUD_PERMISSION_DENIED`	403	The key lacks the capability for this tool. Surface to the user; switch keys.
`SHROUD_NOT_FOUND`	404	Resource missing. Don't retry.
`SHROUD_TOOL_NOT_FOUND`	404	Tool name is wrong or unsupported on this deployment. Call `tools.list` (JSON-RPC) or `GET /.well-known/mcp` to discover supported tools.
`SHROUD_INVALID_PARAMS`	400	Request schema violated. Validate before retry.
`SHROUD_SCHEMA_VIOLATION`	400	InputSchema validation failed. Check the tool definition; compare to your payload.
`SHROUD_BATCH_TOO_LARGE`	400	`shroud_batch` over 20 inner calls. Split.
`SHROUD_WRITE_IN_BATCH`	400	`shroud_batch` cannot mix write tools. Pull the write call out and run it directly.
`SHROUD_INVALID_IDEMPOTENCY_KEY`	400	Idempotency-Key invalid for this route. Drop the header or fix it.
`SHROUD_IDEMPOTENCY_KEY_MISMATCH`	422	Same key, different body. Use a fresh key.
`SHROUD_CONFIRMATION_REQUIRED`	202	Special: this is a write tool's first call. Resubmit with the issued `confirmation_token`.
`SHROUD_CONFIRMATION_EXPIRED`	410	Confirmation token expired. Re-issue the original request to get a fresh token.
`SHROUD_CONFLICT`	409	Resource state conflicts with request. Re-read, reconcile, retry.
`SHROUD_RATE_LIMITED`	429	RPS limit. Honour `Retry-After: 1`.
`SHROUD_CU_LIMIT_EXCEEDED`	429	CU limit. Pace, alert, or upgrade. Honour `Retry-After`.
`SHROUD_TOOL_EXECUTION_ERROR`	500	Upstream tool/model error. Retry with backoff.
`SHROUD_INTERNAL_ERROR`	500	Unexpected server error. One backoff retry; persistent → file a bug.
`SHROUD_SERVICE_UNAVAILABLE`	503	Upstream briefly down. Retry after `Retry-After`.

The envelope shape (where errorCode lives) differs by transport:

MCP — _meta.errorCode on a tool result with isError: true.
JSON-RPC — error.data.errorCode, alongside the JSON-RPC numeric code.
OpenAI HTTP path — OpenAI-shape {error: {message, type, code}} for OpenAI-compatible endpoints; SHROUD codes appear on the Shroud-native surface.

Full envelope examples in Error reference.

Observability

What to log per request:

HTTP status, response time, retry count.
_meta.usedCUMilli (MCP) / result.usedCUMilli (JSON-RPC) / usage.totalTokens (Cocoon streaming) — for cost attribution.
_meta.errorCode on failures (MCP) — never log the raw key.
Idempotent-Replayed header presence — distinguishes "this was a fresh side effect" from "we replayed a cached response".
Mcp-Session-Id (MCP) — useful for correlating an agent's tool call sequence on the server side.

What not to log:

API keys, full Authorization headers.
Raw request/response bodies for inference (they can contain user content). Hash a request id locally if you need to correlate.

For Cocoon SDK callers, also surface stream.usage.verified === false — it indicates the TEE-signed usage report failed verification, which is a security signal.

Confidentiality posture

This guide covers operational resilience. The orthogonal question — "is this deployment actually confidential?" — is covered by the Threat model production checklist. Both checklists should pass before claiming the privacy guarantee in production:

DCAP attestation mode on (TDX_VERIFICATION_MODE=dcap).
Non-empty TDX_ALLOWED_IMAGE_HASHES on the bridge deployment.
SDK uses the default AttestationPolicy or an explicit allowlist (never null).
Caller surfaces usage.verified failures.

Migrate from OpenAI — the drop-in path this guide assumes you've already taken.
Authentication — keys, plans, CU.
Billing — Credit Unit definitions, pricing formula, worked examples.
Error reference — canonical error envelope, HTTP status table, transport-specific notes.
Error codes — auto-generated source-of-truth table.
Threat model — confidentiality production checklist.

Last modified: 08 May 2026