Production checklist

A pre-launch checklist for taking a Shroud integration from a prototype to production traffic. Each section is a set of yes/no items you can walk through before you point real users at the gateway. Pair this page with the Production guide, which covers the operational runbook in narrative form.

Keys and authentication

  • [ ] Each environment (dev, staging, prod) has its own API key. A key is pinned to the environment its prefix encodes (shroud_dev_…, shroud_stage_…, shroud_prod_…) and the gateway rejects keys for any other environment before doing the database lookup. Mixing prefix and deployment is a 401 every time. See Authentication.

  • [ ] Keys live only in your secret store (cloud secrets, Vault, encrypted env file). They are not in the repository, the CI config, the build logs, or the container image.

  • [ ] The Authorization header is sent in canonical form on every request: Authorization: Bearer <key>. Bare keys, lowercase bearer, or Token <key> are rejected.

  • [ ] Production keys have been rotated at least once after their initial issuance, and you have a documented rotation cadence (quarterly is a reasonable default).

  • [ ] If your deployment exposes IP allowlisting, the production key is restricted to the egress addresses of the calling service. Verify against the live Authentication reference for support status before relying on this.

  • [ ] You have a revocation runbook: how to invalidate a leaked key, who is paged, and how the calling services pick up the replacement without downtime.

Reliability

  • [ ] Every outbound call is wrapped in retry-with-backoff. Use exponential backoff with jitter; cap total retry duration well below the end-user-facing timeout you advertise.

  • [ ] 429 Too Many Requests and 503 Service Unavailable are treated as transient. The retry layer respects Retry-After when present (typical values: 1 second for rate-limit, 5 seconds for upstream unavailability) and waits at least that long before retrying.

  • [ ] Idempotency keys are sent on every request to billing and workspace-management endpoints (CU purchase, plan upgrade/downgrade, cancel-downgrade, admin model sync). Use UUID v4 for the key value. See the idempotency section of the Production guide.

  • [ ] Inference requests have a wall-clock timeout configured at the client. Streaming responses additionally have an idle-timeout so a stalled worker drops the connection cleanly. Both SDKs make exactly one dial per call and do not retry — your code owns the backoff.

  • [ ] Connection pooling and keep-alive are enabled in the HTTP client, and pool sizes have been sized against your peak concurrency rather than defaulted.

Observability

  • [ ] Every request log line carries the request_id returned by the gateway (X-Request-Id on REST responses, _meta.requestId on MCP calls). When you open a support ticket, this is the field that lets us find the failure in our logs.

  • [ ] MCP-protocol callers also log Mcp-Session-Id. The session id ties together the initialize call and every subsequent tools/call from the same client. A 404 on a previously-valid session id is the signal to re-initialize.

  • [ ] You alert on the 5xx error rate (any non-2xx that isn't a 4xx with a caller-correctable cause) and on sustained 429/503 traffic. A spike in 429 against a single API key is the canonical signal of a runaway client.

  • [ ] CU consumption is metered from your side, not just trusted from the gateway response. Emit a counter for every usedCUMilli field you see in successful responses, dimensioned by route and model. This is how you reconcile the gateway invoice against your own view.

  • [ ] If you use the Cocoon SDK, you log the usage.verified flag on every response. A request that returned with verified == false indicates the attestation policy could not confirm the TEE — treat it as a reportable security event, not a routine warning.

CU and budgets

  • [ ] Each production API key has a per-key CU cap that matches its expected workload, not the workspace-plan ceiling. The gateway enforces both per-key and plan-level CU caps and returns SHROUD_CU_LIMIT_EXCEEDED once either is hit; see Billing for the response envelope.

  • [ ] You alert at 80% of the monthly CU plan. Catching the slope at 80% gives you time to top up or split traffic to a second key before the plan-level cap fires.

  • [ ] Budget headroom for traffic spikes. A doubling of normal traffic should not put you at the cap on the day it happens. Either set the cap above the spike envelope, or have a documented emergency top-up path so on-call can keep the system serving.

  • [ ] You have a CU-exhaustion runbook: what fallback your application offers users when SHROUD_CU_LIMIT_EXCEEDED is returned (degrade gracefully, queue, or fail the user-facing operation explicitly).

Confidential inference (Cocoon SDK)

  • [ ] Attestation verification runs in CI. The CI job calls the SDK against a known-good TEE image and asserts that usage.verified == true on a successful response. This catches accidental policy regressions (an empty allowlist, a stale image hash) before they ship.

  • [ ] The attestation allowlist is pinned explicitly in production code. Do not rely on the SDK's default policy as a permanent answer; review the image hashes you accept and decide which set you trust for the deployment you are talking to. See Verification paths.

  • [ ] Production code checks usage.verified == true before treating a Cocoon SDK response as trusted. Fail closed if the field is absent or false: return an error to the caller rather than silently logging a warning. The default Cocoon SDK behaviour is permissive on policy misconfiguration; your app should not be.

  • [ ] The signed usage report is logged or persisted alongside the billing record so a future audit can correlate the on-platform invoice with a tamper-evident statement of token counts.

OpenAI-path callers

  • [ ] You are aware that the gateway sees plaintext on POST /v1/chat/completions. The blind-relay guarantee documented in A primer on confidential AI applies to the Cocoon SDK path, not the plain HTTP path.

  • [ ] If confidentiality matters for the workload, you are using the Cocoon SDK and not the OpenAI HTTP path. If you cannot adopt the SDK in this release, this gap is on your roadmap with a date.

  • [ ] You have read Migrate from OpenAI and your code does not depend on OpenAI fields the gateway accepts but ignores (temperature, top_p, tools, and similar). What is and is not honoured is enumerated on the OpenAI-compatible API page; verify that every parameter your code sends is in the "honoured" list before you cut over.

Where this fits

Last modified: 08 May 2026