Cocoon — Confidential Inference
Cocoon provides end-to-end encrypted AI inference through a Trusted Execution Environment (TEE). All data is encrypted between the client SDK and the TEE — the Shroud gateway never sees plaintext.
How It Works
Connection Flow
Client opens a WebSocket to
/v1/cocoon/streamand sends its Ed25519 public keyTEE responds with its Ed25519 public key and a TDX attestation quote
ECDH key exchange — both sides convert Ed25519 → X25519, compute shared secret
AES-256-GCM session key is derived:
key = SHA-256(shared_secret)Client encrypts the inference request and sends the ciphertext
TEE decrypts, runs the model, encrypts response chunks
Client decrypts each chunk as it arrives (streaming)
TEE sends a signed usage attestation at the end
The Shroud gateway only sees:
Encrypted blobs (ciphertext + nonce)
Model name (from the init message)
Token counts (from the signed usage report)
Endpoints
List Worker Types — Cocoon-native
A near-direct mirror of the upstream Cocoon client.workerTypesV2 reply. For each model that the gateway has registered in its catalog, returns the list of currently live workers with their per-worker coefficient (price multiplier) and load (active_requests/max_active_requests). The shape is intentionally raw so SDKs can pick a worker, compute capacity, or render a load graph without the gateway pre-aggregating.
The top-level object discriminator is always the literal string "cocoon.workerTypes". When the upstream Cocoon proxy is unreachable the gateway responds with HTTP 503 and a Retry-After: 5 header; clients should back off rather than retry tightly.
List Models — OpenAI shim
A minimal OpenAI-compatible model catalog. Returns only id, object, and owned_by per entry — no pricing, no worker counts, no timestamps. Use this endpoint when an OpenAI SDK is enumerating models; use /v1/cocoon/workers when you need pricing or load.
The unprefixed /v1/models returns the union across every enabled Cocoon network, so the same model id can appear once per network with a distinct owned_by value. The network-prefixed forms (/cocoon-classic/v1/models, /cocoon-alpha/v1/models) return only the matching network. The created field that OpenAI's spec includes is intentionally omitted: Cocoon does not expose a meaningful creation timestamp for a registered model, so synthesizing one would mislead.
Streaming Inference (WebSocket)
See Wire Protocol for the detailed message format.
Network-prefixed routes
Each of the routes above is also reachable under a network-prefixed form (/cocoon-classic/v1/... or /cocoon-alpha/v1/...) that pins the call to a specific Cocoon network instead of the deployment default. See Cocoon networks for the full grid and SDK pinning patterns.
Security Model
Layer | Technology | Purpose |
|---|---|---|
Key exchange | ECDH (Ed25519 → X25519) | Establish shared secret |
Session encryption | AES-256-GCM | Encrypt all inference data |
Nonce | Random 12 bytes per message | Prevent replay attacks |
TEE attestation | Intel TDX quotes | Verify code integrity |
Usage signing | Ed25519 signatures | Tamper-proof billing |
What the Gateway Can See
Data | Visible to Gateway? |
|---|---|
Prompts and responses | No — encrypted |
Model name | Yes — sent in init message |
Token counts | Yes — from signed usage report |
Session public keys | Yes — for routing |
Encrypted payloads | Yes — but cannot decrypt |
Selective Disclosure
By default, the TEE reveals minimal metadata. You can request specific fields to be disclosed in cleartext via the disclose parameter:
Field | Description |
|---|---|
| Number of prompt tokens |
| Number of cached tokens |
| Number of completion tokens |
| Number of reasoning tokens |
| Total token count (always included) |
| Model ID used for inference |
| Timestamp when proxy received request |
| Timestamp when proxy finished |
| Timestamp when inference worker started |
| Timestamp when inference worker finished |
| Worker debug information |
| Proxy debug information |
The TEE responds with effective_disclose — the negotiated set of fields it will reveal. total_tokens is always included as a minimum for billing purposes.
TEE Attestation
Each session includes a TDX attestation quote that proves:
Code integrity — the exact code running inside the TEE matches a known image hash
Key binding — the TEE's public key is bound to the quote via
report_data = SHA-512(ed25519_pubkey)Platform — the quote is generated by Intel TDX hardware
Attestation Verification
The SDKs verify attestation automatically when an AttestationPolicy is configured:
Parse the TDX quote (versions 3, 4, or 5)
Verify
report_data == SHA-512(tee_public_key)— ensures the key belongs to this TEECompute image hash:
SHA-256(MRTD || MR_CONFIG_ID || MR_OWNER || MR_OWNER_CONFIG || RTMR[0..3] || zeros[64])Check image hash against the allowed list in the policy
For the end-to-end trust chain (open-source TEE proxy, reproducible build, image-hash allowlist, Intel DCAP signature path) see How attestation works. For the three audience-shaped verification workflows — default policy, custom allowlist, reproduce-the-build — see Verification paths. For what attestation does and does not protect against, see Threat model.
Usage Attestation
After inference completes, the TEE signs a usage report:
TEE serializes usage data to JSON
Computes
hash = SHA-256(usage_json)Signs with Ed25519:
signature = Ed25519.sign(hash, tee_private_key)Client verifies the signature using the TEE's public key from the session
This ensures token counts cannot be tampered with by the gateway.
TEE proxy configuration
These environment variables configure the TEE proxy (cocoon-bridge) at startup. They affect runtime behavior on a per-deployment basis and are managed by the operator, not the API caller.
Default-disable-thinking allowlist
Variable | Default | Format |
|---|---|---|
|
| comma-separated list of model-name prefixes |
When a /v1/chat/completions request targets a model whose name starts with any prefix in this list and the caller did not set chat_template_kwargs.enable_thinking, the TEE proxy injects chat_template_kwargs.enable_thinking=false into the decrypted request before forwarding to the worker. This is the platform-wide default that produces the clean-content behavior described in Reasoning content and <think> tags.
Matching is case-sensitive prefix matching. Each comma-separated entry is trimmed; empty entries are dropped.
Examples:
The empty-string value is the documented escape hatch: it leaves all models at their native worker default, which for current Qwen3 weights is "thinking on". Use this if a deployment needs to expose chain-of-thought behavior to all callers without each caller setting chat_template_kwargs.enable_thinking=true.
Caller-set values always win over the platform default. If a request explicitly sets chat_template_kwargs.enable_thinking to either true or false, the proxy passes the value through unchanged regardless of the allowlist.
Next Steps
Go SDK — Go client with ECDH, AES-256-GCM, and attestation
TypeScript SDK — TypeScript client for Node.js and browsers
Wire Protocol — WebSocket message format details