Production guide
Operational recipes for shipping Shroud — retry policy, idempotency, rate-limit handling, and a code-by-code error map. Use this page alongside the Threat model — Production checklist, which covers the confidentiality posture you also need before declaring a deployment production-ready.
Retry & backoff
Shroud distinguishes transient failures (worth retrying with backoff) from permanent failures (signalling a bug in the caller or a state issue). Retry rules:
Status | When | Retry? |
|---|---|---|
| Success / accepted-with-required-followup | n/a |
| Caller bug — request is wrong | No. Fix the caller, don't retry. |
| RPS budget tripped | Yes, after |
| Plan/key CU budget exhausted | Yes, after |
| Server-side bug | Once with backoff. If persistent, file a bug. |
| Upstream tool/model failure | Yes, with backoff. |
| Upstream briefly unavailable | Yes, after |
Recommended backoff schedule for retryable failures: exponential with jitter, capped. A reasonable default is
When the response carries Retry-After, always honour it as the floor — your computed delay should be max(retry_after, computed).
Don't retry these
400 SHROUD_INVALID_PARAMS/SHROUD_SCHEMA_VIOLATION— caller request shape is wrong.401 SHROUD_UNAUTHORIZED— key invalid, expired, or wrong environment. Surface to the user; don't loop.403 SHROUD_PERMISSION_DENIED— the key lacks the capability for this tool. Provision a different key or change the request.404 SHROUD_TOOL_NOT_FOUND/SHROUD_NOT_FOUND— the resource doesn't exist; retrying will not summon it.409 SHROUD_CONFLICT— request conflicts with current state. Read state, then re-attempt with corrected request.410 SHROUD_CONFIRMATION_EXPIRED— confirmation token expired. Get a fresh one (re-issue the original request without the expired token) instead of replaying.422 SHROUD_IDEMPOTENCY_KEY_MISMATCH— same key reused with a different body. Generate a new idempotency key.
Idempotency
Side-effecting POST/PUT routes accept a Stripe-style Idempotency-Key header. The middleware caches the (api_key_id, key, body-hash) tuple for 24 hours and replays the exact (status, body, content-type) for any subsequent request that matches.
Replays carry a Idempotent-Replayed: true response header so the client knows the call wasn't re-executed.
Where idempotency applies
The header is honoured on a fixed allowlist of small-body side-effecting routes (sourced from internal/management/testdata/idempotent-routes.json):
Method | Route |
|---|---|
POST |
|
POST |
|
POST |
|
POST |
|
PUT |
|
Adding a route to this list is gated by a CI test that compares the runtime registry against the JSON allowlist, so changes here are deliberate and reviewable. The header is not honoured on:
Chat completions (
/v1/chat/completions) — streaming response, buffer would block flush.MCP/JSON-RPC tool calls (
/mcp,/rpc) — same SSE-flush constraint.Any inference path through the Cocoon SDK or its HTTP shim.
Production teams that need at-most-once semantics on inference typically wrap the call themselves at the application layer, keyed on a request hash plus their own dedupe table.
Caps and lifecycle
Limit | Value | Notes |
|---|---|---|
Replay TTL | 24 hours | After 24h, the (api_key_id, key) slot is reclaimed and a fresh request can run. |
Maximum key length | 255 characters | Keys with whitespace or non-printable characters are rejected. |
Maximum request body | 1 MiB | Anything over this is rejected with |
Maximum response capture | 1 MiB | If the handler emits more, the slot is released rather than persisted; the bytes still reach the original caller, but a retry will re-execute. |
Idempotency error codes
Code | HTTP | Meaning |
|---|---|---|
| 400 | Header missing required prefix, exceeds 255 chars, contains whitespace or non-printable chars, or sent without an authenticated API key. |
| 422 | Key replayed with a body hash that differs from the cached one. Reuse means exact re-replay; mutate the request and you need a new key. |
| 409 | Same |
| 400 | Request body exceeds the 1 MiB idempotency-cache cap. |
Rate limits and CU budgets
The gateway enforces two independent layers of throttling:
Request rate — a token-bucket limiter per API key (or per source IP for anonymous calls). Trips on bursts.
Credit Unit budget — a workspace-wide plan cap plus optional per-key rolling 24h/30d ceilings. Trips on cumulative usage.
Both surfaces emit the canonical Shroud envelope (see Error reference) with HTTP 429 Too Many Requests.
Request rate (token bucket)
The limiter is a per-key token bucket with rate = plan_rps and burst = 2 × plan_rps. Tokens regenerate continuously at the rate limit, so a steady stream at the rate is fine; short bursts up to twice the rate also clear without an error.
Bucket | Rate | Burst | Retry-After |
|---|---|---|---|
Free key | 2 RPS | 4 | 1 |
Developer key | 10 RPS | 20 | 1 |
Startup key | 50 RPS | 100 | 1 |
Enterprise key | 200 RPS | 400 | 1 |
Per-IP, no API key | 5 RPS | 5 | 1 |
Anonymous traffic is bucketed per source IP (/32 for IPv4, /64 by default for IPv6) so a missing or malformed Authorization header caps a single client at 5 RPS regardless of plan. The cap is enforced even on routes that do not strictly require an API key — it sits ahead of expensive auth and billing work to keep public surfaces from being abused. See Authentication for the key-format and Bearer-prefix rules that govern when the per-key bucket applies.
A per-key override (api_keys.rps_limit) can lower the effective rate further: the limiter uses min(key_rps, plan_rps) so a Startup plan key with rps_limit = 5 is capped at 5 RPS even though the plan allows 50.
Response headers
On rejection the gateway sets:
Header | Value | Meaning |
|---|---|---|
|
| Both 429 paths return JSON envelopes; nothing falls back to plain text. |
|
| Seconds. Treat as a floor and add jittered backoff on top. |
| echoed | Pass |
The OpenAI-compatible chat-completions path additionally returns Retry-After: 5 on 503 Service Unavailable when the upstream Cocoon worker is briefly unreachable; that envelope follows OpenAI shape rather than the SHROUD envelope. See OpenAI-compatible API for the special case.
CU budget rejection
Limit |
|
|
|---|---|---|
Per-key 24h |
|
|
Per-key 30d |
|
|
Workspace plan + purchased balance |
| absent |
The same code covers both per-key and workspace scopes. Distinguish them by the presence of details.window — there is no separate plan_cu_limit_exceeded code.
Distinguishing rate-limit and CU rejections in client code:
Worked example: CU consumption against a per-tier limit
A team on the Developer plan (29,000,000 milli-CU/month included, 10 RPS) calls the Cocoon Qwen/Qwen3-32B model from a long-running batch job. Each call uses 50,000 prompt + 15,000 completion tokens (see Billing — Worked example). At a price_per_token_nano of 80 and a TON/USD rate of $5.50, each call consumes about 28,600 milli-CU.
How many such calls fit in the included monthly allowance:
The 10 RPS bucket lets the job sustain 10 calls per second, but every call charges 28,600 milli-CU. At a sustained 10 RPS the workspace exhausts its monthly included budget in roughly 100 seconds:
After that the workspace either begins consuming purchased CU balance or — if there is none — every subsequent call is rejected with SHROUD_CU_LIMIT_EXCEEDED and Retry-After: 60. The retry loop on a workspace-level cap is not effective: the cap will not clear until the next billing period. Provision more purchased CU, upgrade the plan, or pace the job.
CU budget hygiene
Read
_meta.usedCUMilli(MCP) orresult.usedCUMilli(JSON-RPC) orusage.usedCUMilli(Cocoon SDK) on every successful tool call to track real cost.For inference, watch the
usage.totalTokensin the streamingdoneevent — billing converts via the per-network pricing config.Set a workspace-level alert at, e.g., 70 % of monthly CU before the overage cliff.
On a sustained streak of
usedCUMilli == 0for successful calls, treat it as a billing-feed incident. The gateway falls back to zero on missing pricing config or missing TON/USD rate; the caller sees free service but the workspace is silently uncharged.
Error handling map
The full structured-code list is auto-generated and lives at Error codes. Use the table below as the caller-side action map for each code:
Code | HTTP | Action |
|---|---|---|
| 401 | Prompt the user to re-authenticate. Don't retry. |
| 403 | The key lacks the capability for this tool. Surface to the user; switch keys. |
| 404 | Resource missing. Don't retry. |
| 404 | Tool name is wrong or unsupported on this deployment. Call |
| 400 | Request schema violated. Validate before retry. |
| 400 | InputSchema validation failed. Check the tool definition; compare to your payload. |
| 400 |
|
| 400 |
|
| 400 | Idempotency-Key invalid for this route. Drop the header or fix it. |
| 422 | Same key, different body. Use a fresh key. |
| 202 | Special: this is a write tool's first call. Resubmit with the issued |
| 410 | Confirmation token expired. Re-issue the original request to get a fresh token. |
| 409 | Resource state conflicts with request. Re-read, reconcile, retry. |
| 429 | RPS limit. Honour |
| 429 | CU limit. Pace, alert, or upgrade. Honour |
| 500 | Upstream tool/model error. Retry with backoff. |
| 500 | Unexpected server error. One backoff retry; persistent → file a bug. |
| 503 | Upstream briefly down. Retry after |
The envelope shape (where errorCode lives) differs by transport:
MCP —
_meta.errorCodeon a tool result withisError: true.JSON-RPC —
error.data.errorCode, alongside the JSON-RPC numericcode.OpenAI HTTP path — OpenAI-shape
{error: {message, type, code}}for OpenAI-compatible endpoints; SHROUD codes appear on the Shroud-native surface.
Full envelope examples in Error reference.
Observability
What to log per request:
HTTP status, response time, retry count.
_meta.usedCUMilli(MCP) /result.usedCUMilli(JSON-RPC) /usage.totalTokens(Cocoon streaming) — for cost attribution._meta.errorCodeon failures (MCP) — never log the raw key.Idempotent-Replayedheader presence — distinguishes "this was a fresh side effect" from "we replayed a cached response".Mcp-Session-Id(MCP) — useful for correlating an agent's tool call sequence on the server side.
What not to log:
API keys, full Authorization headers.
Raw request/response bodies for inference (they can contain user content). Hash a request id locally if you need to correlate.
For Cocoon SDK callers, also surface stream.usage.verified === false — it indicates the TEE-signed usage report failed verification, which is a security signal.
Confidentiality posture
This guide covers operational resilience. The orthogonal question — "is this deployment actually confidential?" — is covered by the Threat model production checklist. Both checklists should pass before claiming the privacy guarantee in production:
DCAP attestation mode on (
TDX_VERIFICATION_MODE=dcap).Non-empty
TDX_ALLOWED_IMAGE_HASHESon the bridge deployment.SDK uses the default
AttestationPolicyor an explicit allowlist (nevernull).Caller surfaces
usage.verifiedfailures.
Related
Migrate from OpenAI — the drop-in path this guide assumes you've already taken.
Authentication — keys, plans, CU.
Billing — Credit Unit definitions, pricing formula, worked examples.
Error reference — canonical error envelope, HTTP status table, transport-specific notes.
Error codes — auto-generated source-of-truth table.
Threat model — confidentiality production checklist.