Cocoon — Confidential Inference

Cocoon provides end-to-end encrypted AI inference through a Trusted Execution Environment (TEE). All data is encrypted between the client SDK and the TEE — the Shroud gateway never sees plaintext.

How It Works

┌────────────┐          ┌──────────────┐          ┌─────────────┐
│  Client SDK │◄────────►│ Shroud Gateway│◄────────►│ Cocoon TEE  │
│  (Go / TS)  │  E2E     │ (blind proxy) │encrypted │ (Intel TDX) │
│             │encrypted │              │  blobs   │             │
│  Encrypts & │          │ Auth, billing,│          │ Decrypts,   │
│  decrypts   │          │ rate limiting │          │ runs model, │
│  locally    │          │ token counts  │          │ re-encrypts │
└────────────┘          └──────────────┘          └─────────────┘

Connection Flow

Client opens a WebSocket to /v1/cocoon/stream and sends its Ed25519 public key
TEE responds with its Ed25519 public key and a TDX attestation quote
ECDH key exchange — both sides convert Ed25519 → X25519, compute shared secret
AES-256-GCM session key is derived: key = SHA-256(shared_secret)
Client encrypts the inference request and sends the ciphertext
TEE decrypts, runs the model, encrypts response chunks
Client decrypts each chunk as it arrives (streaming)
TEE sends a signed usage attestation at the end

The Shroud gateway only sees:

Encrypted blobs (ciphertext + nonce)
Model name (from the init message)
Token counts (from the signed usage report)

Endpoints

List Worker Types — Cocoon-native

GET /v1/cocoon/workers
Authorization: Bearer shroud_prod_...

A near-direct mirror of the upstream Cocoon client.workerTypesV2 reply. For each model that the gateway has registered in its catalog, returns the list of currently live workers with their per-worker coefficient (price multiplier) and load (active_requests/max_active_requests). The shape is intentionally raw so SDKs can pick a worker, compute capacity, or render a load graph without the gateway pre-aggregating.

{
  "object": "cocoon.workerTypes",
  "data": [
    {
      "name": "Qwen/Qwen3-32B",
      "workers": [
        {
          "coefficient": 1000,
          "active_requests": 0,
          "max_active_requests": 4
        },
        {
          "coefficient": 2000,
          "active_requests": 2,
          "max_active_requests": 4
        }
      ]
    }
  ]
}

The top-level object discriminator is always the literal string "cocoon.workerTypes". When the upstream Cocoon proxy is unreachable the gateway responds with HTTP 503 and a Retry-After: 5 header; clients should back off rather than retry tightly.

List Models — OpenAI shim

GET /v1/models
Authorization: Bearer shroud_prod_...

A minimal OpenAI-compatible model catalog. Returns only id, object, and owned_by per entry — no pricing, no worker counts, no timestamps. Use this endpoint when an OpenAI SDK is enumerating models; use /v1/cocoon/workers when you need pricing or load.

{
  "object": "list",
  "data": [
    {
      "id": "Qwen/Qwen3-32B",
      "object": "model",
      "owned_by": "cocoon-classic"
    },
    {
      "id": "Qwen/Qwen3-32B",
      "object": "model",
      "owned_by": "cocoon-alpha"
    }
  ]
}

The unprefixed /v1/models returns the union across every enabled Cocoon network, so the same model id can appear once per network with a distinct owned_by value. The network-prefixed forms (/cocoon-classic/v1/models, /cocoon-alpha/v1/models) return only the matching network. The created field that OpenAI's spec includes is intentionally omitted: Cocoon does not expose a meaningful creation timestamp for a registered model, so synthesizing one would mislead.

Streaming Inference (WebSocket)

WebSocket /v1/cocoon/stream
Authorization: Bearer shroud_prod_...

See Wire Protocol for the detailed message format.

Network-prefixed routes

Each of the routes above is also reachable under a network-prefixed form (/cocoon-classic/v1/... or /cocoon-alpha/v1/...) that pins the call to a specific Cocoon network instead of the deployment default. See Cocoon networks for the full grid and SDK pinning patterns.

Security Model

Layer	Technology	Purpose
Key exchange	ECDH (Ed25519 → X25519)	Establish shared secret
Session encryption	AES-256-GCM	Encrypt all inference data
Nonce	Random 12 bytes per message	Prevent replay attacks
TEE attestation	Intel TDX quotes	Verify code integrity
Usage signing	Ed25519 signatures	Tamper-proof billing

What the Gateway Can See

Data	Visible to Gateway?
Prompts and responses	No — encrypted
Model name	Yes — sent in init message
Token counts	Yes — from signed usage report
Session public keys	Yes — for routing
Encrypted payloads	Yes — but cannot decrypt

Selective Disclosure

By default, the TEE reveals minimal metadata. You can request specific fields to be disclosed in cleartext via the disclose parameter:

Field	Description
`prompt_tokens`	Number of prompt tokens
`cached_tokens`	Number of cached tokens
`completion_tokens`	Number of completion tokens
`reasoning_tokens`	Number of reasoning tokens
`total_tokens`	Total token count (always included)
`model`	Model ID used for inference
`proxy_start_time`	Timestamp when proxy received request
`proxy_end_time`	Timestamp when proxy finished
`worker_start_time`	Timestamp when inference worker started
`worker_end_time`	Timestamp when inference worker finished
`worker_debug`	Worker debug information
`proxy_debug`	Proxy debug information

The TEE responds with effective_disclose — the negotiated set of fields it will reveal. total_tokens is always included as a minimum for billing purposes.

TEE Attestation

Each session includes a TDX attestation quote that proves:

Code integrity — the exact code running inside the TEE matches a known image hash
Key binding — the TEE's public key is bound to the quote via report_data = SHA-512(ed25519_pubkey)
Platform — the quote is generated by Intel TDX hardware

Attestation Verification

The SDKs verify attestation automatically when an AttestationPolicy is configured:

Parse the TDX quote (versions 3, 4, or 5)
Verify report_data == SHA-512(tee_public_key) — ensures the key belongs to this TEE
Compute image hash: SHA-256(MRTD || MR_CONFIG_ID || MR_OWNER || MR_OWNER_CONFIG || RTMR[0..3] || zeros[64])
Check image hash against the allowed list in the policy

For the end-to-end trust chain (open-source TEE proxy, reproducible build, image-hash allowlist, Intel DCAP signature path) see How attestation works. For the three audience-shaped verification workflows — default policy, custom allowlist, reproduce-the-build — see Verification paths. For what attestation does and does not protect against, see Threat model.

Usage Attestation

After inference completes, the TEE signs a usage report:

TEE serializes usage data to JSON
Computes hash = SHA-256(usage_json)
Signs with Ed25519: signature = Ed25519.sign(hash, tee_private_key)
Client verifies the signature using the TEE's public key from the session

This ensures token counts cannot be tampered with by the gateway.

TEE proxy configuration

These environment variables configure the TEE proxy (cocoon-bridge) at startup. They affect runtime behavior on a per-deployment basis and are managed by the operator, not the API caller.

Default-disable-thinking allowlist

Variable	Default	Format
`COCOON_DEFAULT_DISABLE_THINKING_MODELS`	`Qwen/Qwen3-32B`	comma-separated list of model-name prefixes

When a /v1/chat/completions request targets a model whose name starts with any prefix in this list and the caller did not set chat_template_kwargs.enable_thinking, the TEE proxy injects chat_template_kwargs.enable_thinking=false into the decrypted request before forwarding to the worker. This is the platform-wide default that produces the clean-content behavior described in Reasoning content and <think> tags.

Matching is case-sensitive prefix matching. Each comma-separated entry is trimmed; empty entries are dropped.

Examples:

# Default — covers Qwen/Qwen3-32B exactly
COCOON_DEFAULT_DISABLE_THINKING_MODELS="Qwen/Qwen3-32B"

# Broaden to all Qwen3 variants without redeploying when new sizes ship
COCOON_DEFAULT_DISABLE_THINKING_MODELS="Qwen/Qwen3-"

# Multiple prefixes
COCOON_DEFAULT_DISABLE_THINKING_MODELS="Qwen/Qwen3-,deepseek-ai/DeepSeek-R1-"

# Disable platform-wide auto-injection — operator escape hatch
COCOON_DEFAULT_DISABLE_THINKING_MODELS=""

The empty-string value is the documented escape hatch: it leaves all models at their native worker default, which for current Qwen3 weights is "thinking on". Use this if a deployment needs to expose chain-of-thought behavior to all callers without each caller setting chat_template_kwargs.enable_thinking=true.

Caller-set values always win over the platform default. If a request explicitly sets chat_template_kwargs.enable_thinking to either true or false, the proxy passes the value through unchanged regardless of the allowlist.

Next Steps

Go SDK — Go client with ECDH, AES-256-GCM, and attestation
TypeScript SDK — TypeScript client for Node.js and browsers
Wire Protocol — WebSocket message format details

Last modified: 08 May 2026