Production guide

Operational recipes for shipping Shroud — retry policy, idempotency, rate-limit handling, and a code-by-code error map. Use this page alongside the Threat model — Production checklist, which covers the confidentiality posture you also need before declaring a deployment production-ready.

Retry & backoff

Shroud distinguishes transient failures (worth retrying with backoff) from permanent failures (signalling a bug in the caller or a state issue). Retry rules:

Status

When

Retry?

200, 202

Success / accepted-with-required-followup

n/a

4xx (other than 429)

Caller bug — request is wrong

No. Fix the caller, don't retry.

429 (Rate limit)

RPS budget tripped

Yes, after Retry-After seconds.

429 (CU limit)

Plan/key CU budget exhausted

Yes, after Retry-After seconds — but consider upgrading the plan first.

500 SHROUD_INTERNAL_ERROR

Server-side bug

Once with backoff. If persistent, file a bug.

500 SHROUD_TOOL_EXECUTION_ERROR

Upstream tool/model failure

Yes, with backoff.

503 SHROUD_SERVICE_UNAVAILABLE

Upstream briefly unavailable

Yes, after Retry-After (typically 5) seconds.

Recommended backoff schedule for retryable failures: exponential with jitter, capped. A reasonable default is

attempt_delay = min(60, base * 2^attempt) + random(0, jitter) base = 0.5 s jitter = 0.5 s max = 60 s attempts = 5

When the response carries Retry-After, always honour it as the floor — your computed delay should be max(retry_after, computed).

Don't retry these

  • 400 SHROUD_INVALID_PARAMS/SHROUD_SCHEMA_VIOLATION — caller request shape is wrong.

  • 401 SHROUD_UNAUTHORIZED — key invalid, expired, or wrong environment. Surface to the user; don't loop.

  • 403 SHROUD_PERMISSION_DENIED — the key lacks the capability for this tool. Provision a different key or change the request.

  • 404 SHROUD_TOOL_NOT_FOUND/SHROUD_NOT_FOUND — the resource doesn't exist; retrying will not summon it.

  • 409 SHROUD_CONFLICT — request conflicts with current state. Read state, then re-attempt with corrected request.

  • 410 SHROUD_CONFIRMATION_EXPIRED — confirmation token expired. Get a fresh one (re-issue the original request without the expired token) instead of replaying.

  • 422 SHROUD_IDEMPOTENCY_KEY_MISMATCH — same key reused with a different body. Generate a new idempotency key.

Idempotency

Side-effecting POST/PUT routes accept a Stripe-style Idempotency-Key header. The middleware caches the (api_key_id, key, body-hash) tuple for 24 hours and replays the exact (status, body, content-type) for any subsequent request that matches.

POST /v1/some-side-effect Authorization: Bearer shroud_prod_... Idempotency-Key: 8f5b8c2d-2c5a-4f1f-9c11-e6f0a82c2c64 Content-Type: application/json { ... }

Replays carry a Idempotent-Replayed: true response header so the client knows the call wasn't re-executed.

Where idempotency applies

The header is honoured on a fixed allowlist of small-body side-effecting routes (sourced from internal/management/testdata/idempotent-routes.json):

Method

Route

POST

/tma/workspaces/{id}/cancel-downgrade

POST

/tma/workspaces/{id}/cu/purchase

POST

/tma/workspaces/{id}/downgrade

POST

/tma/workspaces/{id}/upgrade

PUT

/v1/admin/cocoon/{network}/models

Adding a route to this list is gated by a CI test that compares the runtime registry against the JSON allowlist, so changes here are deliberate and reviewable. The header is not honoured on:

  • Chat completions (/v1/chat/completions) — streaming response, buffer would block flush.

  • MCP/JSON-RPC tool calls (/mcp, /rpc) — same SSE-flush constraint.

  • Any inference path through the Cocoon SDK or its HTTP shim.

Production teams that need at-most-once semantics on inference typically wrap the call themselves at the application layer, keyed on a request hash plus their own dedupe table.

Caps and lifecycle

Limit

Value

Notes

Replay TTL

24 hours

After 24h, the (api_key_id, key) slot is reclaimed and a fresh request can run.

Maximum key length

255 characters

Keys with whitespace or non-printable characters are rejected.

Maximum request body

1 MiB

Anything over this is rejected with SHROUD_INVALID_PARAMS.

Maximum response capture

1 MiB

If the handler emits more, the slot is released rather than persisted; the bytes still reach the original caller, but a retry will re-execute.

Idempotency error codes

Code

HTTP

Meaning

SHROUD_INVALID_IDEMPOTENCY_KEY

400

Header missing required prefix, exceeds 255 chars, contains whitespace or non-printable chars, or sent without an authenticated API key.

SHROUD_IDEMPOTENCY_KEY_MISMATCH

422

Key replayed with a body hash that differs from the cached one. Reuse means exact re-replay; mutate the request and you need a new key.

SHROUD_CONFLICT

409

Same (api_key_id, key) slot is in flight on a concurrent request. Retry once the original completes.

SHROUD_INVALID_PARAMS

400

Request body exceeds the 1 MiB idempotency-cache cap.

Rate limits and CU budgets

The gateway enforces two independent layers of throttling:

  1. Request rate — a token-bucket limiter per API key (or per source IP for anonymous calls). Trips on bursts.

  2. Credit Unit budget — a workspace-wide plan cap plus optional per-key rolling 24h/30d ceilings. Trips on cumulative usage.

Both surfaces emit the canonical Shroud envelope (see Error reference) with HTTP 429 Too Many Requests.

Request rate (token bucket)

The limiter is a per-key token bucket with rate = plan_rps and burst = 2 × plan_rps. Tokens regenerate continuously at the rate limit, so a steady stream at the rate is fine; short bursts up to twice the rate also clear without an error.

Bucket

Rate

Burst

Retry-After

Free key

2 RPS

4

1

Developer key

10 RPS

20

1

Startup key

50 RPS

100

1

Enterprise key

200 RPS

400

1

Per-IP, no API key

5 RPS

5

1

Anonymous traffic is bucketed per source IP (/32 for IPv4, /64 by default for IPv6) so a missing or malformed Authorization header caps a single client at 5 RPS regardless of plan. The cap is enforced even on routes that do not strictly require an API key — it sits ahead of expensive auth and billing work to keep public surfaces from being abused. See Authentication for the key-format and Bearer-prefix rules that govern when the per-key bucket applies.

A per-key override (api_keys.rps_limit) can lower the effective rate further: the limiter uses min(key_rps, plan_rps) so a Startup plan key with rps_limit = 5 is capped at 5 RPS even though the plan allows 50.

Response headers

On rejection the gateway sets:

Header

Value

Meaning

Content-Type

application/json

Both 429 paths return JSON envelopes; nothing falls back to plain text.

Retry-After

1 (RPS) / 60 (CU)

Seconds. Treat as a floor and add jittered backoff on top.

X-Request-Id

echoed

Pass X-Request-Id on the request and the gateway echoes it on the response. Logging this id alongside any 429 makes server-side correlation a one-query lookup.

The OpenAI-compatible chat-completions path additionally returns Retry-After: 5 on 503 Service Unavailable when the upstream Cocoon worker is briefly unreachable; that envelope follows OpenAI shape rather than the SHROUD envelope. See OpenAI-compatible API for the special case.

CU budget rejection

Limit

error_code

details.window

Per-key 24h

SHROUD_CU_LIMIT_EXCEEDED

"24h"

Per-key 30d

SHROUD_CU_LIMIT_EXCEEDED

"30d"

Workspace plan + purchased balance

SHROUD_CU_LIMIT_EXCEEDED

absent

{ "error": "CU limit exceeded", "error_code": "SHROUD_CU_LIMIT_EXCEEDED", "details": { "window": "24h", "used_cu_milli": 100000, "limit_cu_milli": 100000 } }

The same code covers both per-key and workspace scopes. Distinguish them by the presence of details.window — there is no separate plan_cu_limit_exceeded code.

Distinguishing rate-limit and CU rejections in client code:

# Pseudocode — adapt to your HTTP client. if resp.status_code == 429: retry_after = int(resp.headers.get("Retry-After", "1")) body = resp.json() if resp.content else {} if body.get("error_code") == "SHROUD_CU_LIMIT_EXCEEDED": # CU budget — slow signal. Pace, alert, or upgrade. scope = "per-key" if "window" in body.get("details", {}) else "workspace" wait_or_upgrade(retry_after, scope) else: # SHROUD_RATE_LIMITED — back off briefly. sleep(retry_after)

Worked example: CU consumption against a per-tier limit

A team on the Developer plan (29,000,000 milli-CU/month included, 10 RPS) calls the Cocoon Qwen/Qwen3-32B model from a long-running batch job. Each call uses 50,000 prompt + 15,000 completion tokens (see Billing — Worked example). At a price_per_token_nano of 80 and a TON/USD rate of $5.50, each call consumes about 28,600 milli-CU.

How many such calls fit in the included monthly allowance:

calls_per_month = 29_000_000_000 / 28_600_000 ≈ 1,014

The 10 RPS bucket lets the job sustain 10 calls per second, but every call charges 28,600 milli-CU. At a sustained 10 RPS the workspace exhausts its monthly included budget in roughly 100 seconds:

seconds_to_drain = 29_000_000_000 / (10 × 28_600_000) ≈ 101 s

After that the workspace either begins consuming purchased CU balance or — if there is none — every subsequent call is rejected with SHROUD_CU_LIMIT_EXCEEDED and Retry-After: 60. The retry loop on a workspace-level cap is not effective: the cap will not clear until the next billing period. Provision more purchased CU, upgrade the plan, or pace the job.

CU budget hygiene

  • Read _meta.usedCUMilli (MCP) or result.usedCUMilli (JSON-RPC) or usage.usedCUMilli (Cocoon SDK) on every successful tool call to track real cost.

  • For inference, watch the usage.totalTokens in the streaming done event — billing converts via the per-network pricing config.

  • Set a workspace-level alert at, e.g., 70 % of monthly CU before the overage cliff.

  • On a sustained streak of usedCUMilli == 0 for successful calls, treat it as a billing-feed incident. The gateway falls back to zero on missing pricing config or missing TON/USD rate; the caller sees free service but the workspace is silently uncharged.

Error handling map

The full structured-code list is auto-generated and lives at Error codes. Use the table below as the caller-side action map for each code:

Code

HTTP

Action

SHROUD_UNAUTHORIZED

401

Prompt the user to re-authenticate. Don't retry.

SHROUD_PERMISSION_DENIED

403

The key lacks the capability for this tool. Surface to the user; switch keys.

SHROUD_NOT_FOUND

404

Resource missing. Don't retry.

SHROUD_TOOL_NOT_FOUND

404

Tool name is wrong or unsupported on this deployment. Call tools.list (JSON-RPC) or GET /.well-known/mcp to discover supported tools.

SHROUD_INVALID_PARAMS

400

Request schema violated. Validate before retry.

SHROUD_SCHEMA_VIOLATION

400

InputSchema validation failed. Check the tool definition; compare to your payload.

SHROUD_BATCH_TOO_LARGE

400

shroud_batch over 20 inner calls. Split.

SHROUD_WRITE_IN_BATCH

400

shroud_batch cannot mix write tools. Pull the write call out and run it directly.

SHROUD_INVALID_IDEMPOTENCY_KEY

400

Idempotency-Key invalid for this route. Drop the header or fix it.

SHROUD_IDEMPOTENCY_KEY_MISMATCH

422

Same key, different body. Use a fresh key.

SHROUD_CONFIRMATION_REQUIRED

202

Special: this is a write tool's first call. Resubmit with the issued confirmation_token.

SHROUD_CONFIRMATION_EXPIRED

410

Confirmation token expired. Re-issue the original request to get a fresh token.

SHROUD_CONFLICT

409

Resource state conflicts with request. Re-read, reconcile, retry.

SHROUD_RATE_LIMITED

429

RPS limit. Honour Retry-After: 1.

SHROUD_CU_LIMIT_EXCEEDED

429

CU limit. Pace, alert, or upgrade. Honour Retry-After.

SHROUD_TOOL_EXECUTION_ERROR

500

Upstream tool/model error. Retry with backoff.

SHROUD_INTERNAL_ERROR

500

Unexpected server error. One backoff retry; persistent → file a bug.

SHROUD_SERVICE_UNAVAILABLE

503

Upstream briefly down. Retry after Retry-After.

The envelope shape (where errorCode lives) differs by transport:

  • MCP_meta.errorCode on a tool result with isError: true.

  • JSON-RPCerror.data.errorCode, alongside the JSON-RPC numeric code.

  • OpenAI HTTP path — OpenAI-shape {error: {message, type, code}} for OpenAI-compatible endpoints; SHROUD codes appear on the Shroud-native surface.

Full envelope examples in Error reference.

Observability

What to log per request:

  • HTTP status, response time, retry count.

  • _meta.usedCUMilli (MCP) / result.usedCUMilli (JSON-RPC) / usage.totalTokens (Cocoon streaming) — for cost attribution.

  • _meta.errorCode on failures (MCP) — never log the raw key.

  • Idempotent-Replayed header presence — distinguishes "this was a fresh side effect" from "we replayed a cached response".

  • Mcp-Session-Id (MCP) — useful for correlating an agent's tool call sequence on the server side.

What not to log:

  • API keys, full Authorization headers.

  • Raw request/response bodies for inference (they can contain user content). Hash a request id locally if you need to correlate.

For Cocoon SDK callers, also surface stream.usage.verified === false — it indicates the TEE-signed usage report failed verification, which is a security signal.

Confidentiality posture

This guide covers operational resilience. The orthogonal question — "is this deployment actually confidential?" — is covered by the Threat model production checklist. Both checklists should pass before claiming the privacy guarantee in production:

  • DCAP attestation mode on (TDX_VERIFICATION_MODE=dcap).

  • Non-empty TDX_ALLOWED_IMAGE_HASHES on the bridge deployment.

  • SDK uses the default AttestationPolicy or an explicit allowlist (never null).

  • Caller surfaces usage.verified failures.

Last modified: 08 May 2026