Cocoon — Confidential Inference

Cocoon provides end-to-end encrypted AI inference through a Trusted Execution Environment (TEE). All data is encrypted between the client SDK and the TEE — the Shroud gateway never sees plaintext.

How It Works

┌────────────┐ ┌──────────────┐ ┌─────────────┐ │ Client SDK │◄────────►│ Shroud Gateway│◄────────►│ Cocoon TEE │ │ (Go / TS) │ E2E │ (blind proxy) │encrypted │ (Intel TDX) │ │ │encrypted │ │ blobs │ │ │ Encrypts & │ │ Auth, billing,│ │ Decrypts, │ │ decrypts │ │ rate limiting │ │ runs model, │ │ locally │ │ token counts │ │ re-encrypts │ └────────────┘ └──────────────┘ └─────────────┘

Connection Flow

  1. Client opens a WebSocket to /v1/cocoon/stream and sends its Ed25519 public key

  2. TEE responds with its Ed25519 public key and a TDX attestation quote

  3. ECDH key exchange — both sides convert Ed25519 → X25519, compute shared secret

  4. AES-256-GCM session key is derived: key = SHA-256(shared_secret)

  5. Client encrypts the inference request and sends the ciphertext

  6. TEE decrypts, runs the model, encrypts response chunks

  7. Client decrypts each chunk as it arrives (streaming)

  8. TEE sends a signed usage attestation at the end

The Shroud gateway only sees:

  • Encrypted blobs (ciphertext + nonce)

  • Model name (from the init message)

  • Token counts (from the signed usage report)

Endpoints

List Worker Types — Cocoon-native

GET /v1/cocoon/workers Authorization: Bearer shroud_prod_...

A near-direct mirror of the upstream Cocoon client.workerTypesV2 reply. For each model that the gateway has registered in its catalog, returns the list of currently live workers with their per-worker coefficient (price multiplier) and load (active_requests/max_active_requests). The shape is intentionally raw so SDKs can pick a worker, compute capacity, or render a load graph without the gateway pre-aggregating.

{ "object": "cocoon.workerTypes", "data": [ { "name": "Qwen/Qwen3-32B", "workers": [ { "coefficient": 1000, "active_requests": 0, "max_active_requests": 4 }, { "coefficient": 2000, "active_requests": 2, "max_active_requests": 4 } ] } ] }

The top-level object discriminator is always the literal string "cocoon.workerTypes". When the upstream Cocoon proxy is unreachable the gateway responds with HTTP 503 and a Retry-After: 5 header; clients should back off rather than retry tightly.

List Models — OpenAI shim

GET /v1/models Authorization: Bearer shroud_prod_...

A minimal OpenAI-compatible model catalog. Returns only id, object, and owned_by per entry — no pricing, no worker counts, no timestamps. Use this endpoint when an OpenAI SDK is enumerating models; use /v1/cocoon/workers when you need pricing or load.

{ "object": "list", "data": [ { "id": "Qwen/Qwen3-32B", "object": "model", "owned_by": "cocoon-classic" }, { "id": "Qwen/Qwen3-32B", "object": "model", "owned_by": "cocoon-alpha" } ] }

The unprefixed /v1/models returns the union across every enabled Cocoon network, so the same model id can appear once per network with a distinct owned_by value. The network-prefixed forms (/cocoon-classic/v1/models, /cocoon-alpha/v1/models) return only the matching network. The created field that OpenAI's spec includes is intentionally omitted: Cocoon does not expose a meaningful creation timestamp for a registered model, so synthesizing one would mislead.

Streaming Inference (WebSocket)

WebSocket /v1/cocoon/stream Authorization: Bearer shroud_prod_...

See Wire Protocol for the detailed message format.

Network-prefixed routes

Each of the routes above is also reachable under a network-prefixed form (/cocoon-classic/v1/... or /cocoon-alpha/v1/...) that pins the call to a specific Cocoon network instead of the deployment default. See Cocoon networks for the full grid and SDK pinning patterns.

Security Model

Layer

Technology

Purpose

Key exchange

ECDH (Ed25519 → X25519)

Establish shared secret

Session encryption

AES-256-GCM

Encrypt all inference data

Nonce

Random 12 bytes per message

Prevent replay attacks

TEE attestation

Intel TDX quotes

Verify code integrity

Usage signing

Ed25519 signatures

Tamper-proof billing

What the Gateway Can See

Data

Visible to Gateway?

Prompts and responses

No — encrypted

Model name

Yes — sent in init message

Token counts

Yes — from signed usage report

Session public keys

Yes — for routing

Encrypted payloads

Yes — but cannot decrypt

Selective Disclosure

By default, the TEE reveals minimal metadata. You can request specific fields to be disclosed in cleartext via the disclose parameter:

Field

Description

prompt_tokens

Number of prompt tokens

cached_tokens

Number of cached tokens

completion_tokens

Number of completion tokens

reasoning_tokens

Number of reasoning tokens

total_tokens

Total token count (always included)

model

Model ID used for inference

proxy_start_time

Timestamp when proxy received request

proxy_end_time

Timestamp when proxy finished

worker_start_time

Timestamp when inference worker started

worker_end_time

Timestamp when inference worker finished

worker_debug

Worker debug information

proxy_debug

Proxy debug information

The TEE responds with effective_disclose — the negotiated set of fields it will reveal. total_tokens is always included as a minimum for billing purposes.

TEE Attestation

Each session includes a TDX attestation quote that proves:

  1. Code integrity — the exact code running inside the TEE matches a known image hash

  2. Key binding — the TEE's public key is bound to the quote via report_data = SHA-512(ed25519_pubkey)

  3. Platform — the quote is generated by Intel TDX hardware

Attestation Verification

The SDKs verify attestation automatically when an AttestationPolicy is configured:

  1. Parse the TDX quote (versions 3, 4, or 5)

  2. Verify report_data == SHA-512(tee_public_key) — ensures the key belongs to this TEE

  3. Compute image hash: SHA-256(MRTD || MR_CONFIG_ID || MR_OWNER || MR_OWNER_CONFIG || RTMR[0..3] || zeros[64])

  4. Check image hash against the allowed list in the policy

For the end-to-end trust chain (open-source TEE proxy, reproducible build, image-hash allowlist, Intel DCAP signature path) see How attestation works. For the three audience-shaped verification workflows — default policy, custom allowlist, reproduce-the-build — see Verification paths. For what attestation does and does not protect against, see Threat model.

Usage Attestation

After inference completes, the TEE signs a usage report:

  1. TEE serializes usage data to JSON

  2. Computes hash = SHA-256(usage_json)

  3. Signs with Ed25519: signature = Ed25519.sign(hash, tee_private_key)

  4. Client verifies the signature using the TEE's public key from the session

This ensures token counts cannot be tampered with by the gateway.

TEE proxy configuration

These environment variables configure the TEE proxy (cocoon-bridge) at startup. They affect runtime behavior on a per-deployment basis and are managed by the operator, not the API caller.

Default-disable-thinking allowlist

Variable

Default

Format

COCOON_DEFAULT_DISABLE_THINKING_MODELS

Qwen/Qwen3-32B

comma-separated list of model-name prefixes

When a /v1/chat/completions request targets a model whose name starts with any prefix in this list and the caller did not set chat_template_kwargs.enable_thinking, the TEE proxy injects chat_template_kwargs.enable_thinking=false into the decrypted request before forwarding to the worker. This is the platform-wide default that produces the clean-content behavior described in Reasoning content and <think> tags.

Matching is case-sensitive prefix matching. Each comma-separated entry is trimmed; empty entries are dropped.

Examples:

# Default — covers Qwen/Qwen3-32B exactly COCOON_DEFAULT_DISABLE_THINKING_MODELS="Qwen/Qwen3-32B" # Broaden to all Qwen3 variants without redeploying when new sizes ship COCOON_DEFAULT_DISABLE_THINKING_MODELS="Qwen/Qwen3-" # Multiple prefixes COCOON_DEFAULT_DISABLE_THINKING_MODELS="Qwen/Qwen3-,deepseek-ai/DeepSeek-R1-" # Disable platform-wide auto-injection — operator escape hatch COCOON_DEFAULT_DISABLE_THINKING_MODELS=""

The empty-string value is the documented escape hatch: it leaves all models at their native worker default, which for current Qwen3 weights is "thinking on". Use this if a deployment needs to expose chain-of-thought behavior to all callers without each caller setting chat_template_kwargs.enable_thinking=true.

Caller-set values always win over the platform default. If a request explicitly sets chat_template_kwargs.enable_thinking to either true or false, the proxy passes the value through unchanged regardless of the allowlist.

Next Steps

  • Go SDK — Go client with ECDH, AES-256-GCM, and attestation

  • TypeScript SDK — TypeScript client for Node.js and browsers

  • Wire Protocol — WebSocket message format details

Last modified: 08 May 2026